gto: Mechanics change: `artifact.yaml` is not mandatory

While discussing GTO with @dmpetrov we agreed that the goal should be providing easy git-compatible mechanics instead of complicated one. In general sense, there shouldn’t be a huge initial learning hill to climb. Git user should be able to gradually transform his repo to support complicated scenarios and shouldn’t be required to implement a complex manipulations before they’re justified by their workflow.

One of the requirements now is to create artifacts.yaml and add an artifact to it. Now this is a way to “register” the artifact. To simplify workflow, we could lift this limitation. Now artifacts.yaml is not a pre-requisite, but just a way to “enrich” your artifact with GTO.

We initially introduced artifacts.yaml because if you set VERSION_BASE=commit, there is no way to know the name of your artifact at all, as you don’t create git tags. But as we decided to support tags-only approaches now, this limitation could be temporary removed.

So now user flow will look like this:

# starter option, no artifacts.yaml is there
gto register myartifact  # adds git tag myartifact@v0.0.1 -- NOT shown in Studio MR
gto register model  # adds model@v0.0.1 -- Shown in Studio MR. This is a special convention for a single name only ("model").

# now you cann add GTO "enrichment" and specify it's a model
gto add model myartifact path/to.h5  # creates artifacts.yaml, now is shown in Studio, because we know it's a "model" TYPE
gto register myartifact  # adds git tag myartifact@v0.0.2 -- we trace the history back and now the latest version was 0.0.1

This also make a question “How to deprecate a model” from #93 somewhat different. CC @omesser @shcheklein

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

I had a brief chat with @dberenbaum about this. His feedback was (please correct me if i misunderstood):

  1. How to incorporate model path to git tag (we had this idea a while ago if you remember). Is it a valid scenario or not.
  2. We need a clear user transition from tags (basic use case) to aritfact.yaml when use need more features.

My understanding: for (2) we have a good transitioning path; for (1) - we are not optimizing for this flow right now (it stays on the user level convention with no automation - hopefully we can do better 😃 ).

I’ve implemented what we have been discussing. Here’s how it works. To reproduce, run pytest --basetemp=pytest-cache && cd pytest-cache/test_api0/.

Please let me know if this workflow looks clear and explicable to you. @shcheklein @dmpetrov @omesser @dberenbaum

Show registry

By default you can access to “registered” artifacts only. To “register” an artifact, you need to create a git tag for it: create a version (e.g. “rf@v0.0.1”) or promote it (e.g. “rf#prod-5”).

$ gto show
╒════════╤═══════════╤════════════════════╤═════════════════╕
│ name   │ version   │ stage/production   │ stage/staging   │
╞════════╪═══════════╪════════════════════╪═════════════════╡
│ rf     │ v1.2.4    │ v1.2.4             │ -               │
│ nn     │ v0.0.1    │ -                  │ v0.0.1          │
╘════════╧═══════════╧════════════════════╧═════════════════╛

There could be artifacts present in “artifacts.yaml” only, though they’re not registered. You could find them if you add “–discover” flag:

$ gto show --discover
╒══════════╤═══════════╤════════════════════╤═════════════════╕
│ name     │ version   │ stage/production   │ stage/staging   │
╞══════════╪═══════════╪════════════════════╪═════════════════╡
│ rf       │ v1.2.4    │ v1.2.4             │ -               │
│ nn       │ v0.0.1    │ -                  │ v0.0.1          │
│ features │ -         │ -                  │ -               │
╘══════════╧═══════════╧════════════════════╧═════════════════╛

By default discovery only happens on HEAD. You could also add “–all-commits” or “–all-branches” to check other commits.

Show artifact

You can also see all versions/promotions of the artifact.

$ gto show nn
╒════════════╤════════╤═════════╤═════════════════════╤═══════════════════╤═════════════════╤══════════════╤═══════════════╕
│ artifact   │ name   │ stage   │ creation_date       │ author            │ commit_hexsha   │ discovered   │ enrichments   │
╞════════════╪════════╪═════════╪═════════════════════╪═══════════════════╪═════════════════╪══════════════╪═══════════════╡
│ nn         │ v0.0.1 │ staging │ 2022-04-11 10:39:30 │ Alexander Guschin │ ecbe85e         │ False        │ ['gto']       │
╘════════════╧════════╧═════════╧═════════════════════╧═══════════════════╧═════════════════╧══════════════╧═══════════════╛

If the artifact is not registered, you’ll see an error (I’ll fix the format of error later)

$ gto show features
gto.exceptions.ArtifactNotFound: Cannot find artifact 'features'

But you can discover it:

$ gto show features --discover
╒════════════╤══════════════════════════════════════════╤═════════╤═════════════════════╤═══════════════════╤═════════════════╤══════════════╤═══════════════╕
│ artifact   │ name                                     │ stage   │ creation_date       │ author            │ commit_hexsha   │ discovered   │ enrichments   │
╞════════════╪══════════════════════════════════════════╪═════════╪═════════════════════╪═══════════════════╪═════════════════╪══════════════╪═══════════════╡
│ features   │ c87eeed5e15d479dfcf72e5baa7f432f67c0e23a │ -       │ 2022-04-11 10:39:31 │ Alexander Guschin │ c87eeed         │ True         │ ['gto']       │
╘════════════╧══════════════════════════════════════════╧═════════╧═════════════════════╧═══════════════════╧═════════════════╧══════════════╧═══════════════╛

$ gto show features --discover --all-commits
╒════════════╤══════════════════════════════════════════╤═════════╤═════════════════════╤═══════════════════╤═════════════════╤══════════════╤═══════════════╕
│ artifact   │ name                                     │ stage   │ creation_date       │ author            │ commit_hexsha   │ discovered   │ enrichments   │
╞════════════╪══════════════════════════════════════════╪═════════╪═════════════════════╪═══════════════════╪═════════════════╪══════════════╪═══════════════╡
│ features   │ ecbe85ebdc6a2d0957bb40ef0190e0660eea2c0c │ -       │ 2022-04-11 10:39:29 │ Alexander Guschin │ ecbe85e         │ True         │ ['gto']       │
│ features   │ c87eeed5e15d479dfcf72e5baa7f432f67c0e23a │ -       │ 2022-04-11 10:39:31 │ Alexander Guschin │ c87eeed         │ True         │ ['gto']       │
╘════════════╧══════════════════════════════════════════╧═════════╧═════════════════════╧═══════════════════╧═════════════════╧══════════════╧═══════════════╛

History

I’ve deprecated audit cmd because it was duplicated with history and the latest is a superset of the first one. History works in similar way, it finds all actions that are related to the artifact. For “features” the only action is “commit”, because there are only enrichments for “features” in “artifacts.yaml” and no git tags.

$ gto history features
No history found

$ gto history features --discover
╒═════════════════════╤══════════╤═════════════════════╤══════════╤═══════════════════╕
│ timestamp           │ name     │ event               │ commit   │ author            │
╞═════════════════════╪══════════╪═════════════════════╪══════════╪═══════════════════╡
│ 2022-04-11 10:39:31 │ features │ commit [discovered] │ c87eeed  │ Alexander Guschin │
╘═════════════════════╧══════════╧═════════════════════╧══════════╧═══════════════════╛

$ gto history features --discover --all-commits
╒═════════════════════╤══════════╤═════════════════════╤══════════╤═══════════════════╕
│ timestamp           │ name     │ event               │ commit   │ author            │
╞═════════════════════╪══════════╪═════════════════════╪══════════╪═══════════════════╡
│ 2022-04-11 10:39:29 │ features │ commit [discovered] │ ecbe85e  │ Alexander Guschin │
│ 2022-04-11 10:39:31 │ features │ commit [discovered] │ c87eeed  │ Alexander Guschin │
╘═════════════════════╧══════════╧═════════════════════╧══════════╧═══════════════════╛

Let’s see one more example. “nn” has a single version that was promoted to “staging”. Because of that, in history we’ll see three events: the artifact was committed, then registered, then promoted. If we add “–discover”, we’ll also see another commit. In that commit “nn” is present in “artifacts.yaml”

$ gto history nn
╒═════════════════════╤════════╤══════════════╤═══════════╤═════════╤══════════╤═══════════════════╕
│ timestamp           │ name   │ event        │ version   │ stage   │ commit   │ author            │
╞═════════════════════╪════════╪══════════════╪═══════════╪═════════╪══════════╪═══════════════════╡
│ 2022-04-11 10:39:29 │ nn     │ commit       │ -         │ -       │ ecbe85e  │ Alexander Guschin │
│ 2022-04-11 10:39:30 │ nn     │ registration │ v0.0.1    │ -       │ ecbe85e  │ Alexander Guschin │
│ 2022-04-11 10:39:31 │ nn     │ promotion    │ v0.0.1    │ staging │ ecbe85e  │ Alexander Guschin │
╘═════════════════════╧════════╧══════════════╧═══════════╧═════════╧══════════╧═══════════════════╛
$ gto history nn --discover
╒═════════════════════╤════════╤═════════════════════╤═══════════╤═════════╤══════════╤═══════════════════╕
│ timestamp           │ name   │ event               │ version   │ stage   │ commit   │ author            │
╞═════════════════════╪════════╪═════════════════════╪═══════════╪═════════╪══════════╪═══════════════════╡
│ 2022-04-11 10:39:29 │ nn     │ commit              │ -         │ -       │ ecbe85e  │ Alexander Guschin │
│ 2022-04-11 10:39:30 │ nn     │ registration        │ v0.0.1    │ -       │ ecbe85e  │ Alexander Guschin │
│ 2022-04-11 10:39:31 │ nn     │ commit [discovered] │ -         │ -       │ c87eeed  │ Alexander Guschin │
│ 2022-04-11 10:39:31 │ nn     │ promotion           │ v0.0.1    │ staging │ ecbe85e  │ Alexander Guschin │
╘═════════════════════╧════════╧═════════════════════╧═══════════╧═════════╧══════════╧═══════════════════╛

I’ve implemented this, roughly. At least commands print what they have to. Please check out #108 for examples and provide your feedback. @dmpetrov @shcheklein @omesser @dberenbaum

yes, I got it. It still forces us into some conventions (default name “model” just works, it becomes “special”) thus it becomes more like Rails. Probably not critical. And the most important it won’t complicate the implementation significantly unless I’m missing something, so we should be good.

Another way of thinking about it - initially you have virtual, not created artifacts.yaml with default model “model” (and dataset - “dataset”) and it just works. If you need more - you create artifacts.yaml with more attributes for “model” or new models (like “m1”, “nn”).

Repo pretty much defines the model in this case. And path/name are not that important to me.

Right! It is for model-per-repo use case when there is nothing to distinguish with path.

I was asking about this special “model” tag support. Instead of providing a path (s3, local, etc) I can specify only repo.

Yes, but there is no “instead”. If you are in the repo (CLI) then you just promote the “model”. If you outside the repo - you need to specify the repo anyway.

Okay, I better understand it now. So, we reserve only one specific name: model and it won’t describe anything at all - no name, no URL, nothing, Only it’s version + stage (prod, staging, etc)?

so (I’m guessing), this is the use case if, I, let’s say have a repo with a single model that is already being deployed, I already hard coded its path in CI/CD, etc, and I know that it’s a single model per repo so I want to see it in Studio in a form of repo_url:model and I will be able to promote it to different stages.

Repo pretty much defines the model in this case. And path/name are not that important to me.

If you are talking about status changes - yes, these are tag based actions (in the current and the previous design).

I was asking about this special “model” tag support. Instead of providing a path (s3, local, etc) I can specify only repo.

could someone describe how exactly will these tags looks like now

The user experience is the same. Technically the change is a bit bigger since it requires decoupling tags and artifacts.yaml.

The changes from user and API point of view:

  1. User will be able to create a model without pointing to a specific file/url (so, no artifacts.yaml creation/edits).
  2. User can create “simple” statuses without increments - prod, not only prod-N
  3. (1)&(2) leads to changes in commands defaults.
  4. At the same time there are quite a few API changes based on feedback - not related to (1) & (2).

how will it affect Studio

It is the same as it was, but in the simple use case we are avoiding artifacts.yaml edits.

can we for now implement tags-only approach in Studio and don’t do file at all

If you are talking about status changes - yes, these are tag based actions (in the current and the previous design).

it doesn’t feel though it’s worth doing this, at least now.

Unfortunately, I don’t see how can we realistically introduce widely-use experience with the existing opinionated approach when users (and the tool) have to deals with the combination of tags and files.

This is a core part of the product. By skipping this we are blocking the lightweight scenarios until next big redesign - a new product basically.

  • the biggest concern - we are making it opinionated / conventions based vs explicit - it’s not what we usually do (we are not rails).

I agree with the motivation. However, this proposal is a bit opposite. We are extracting the opinionated part (artifacts.yaml) as an addition (or enrichment) while focusing on the less opinionated part (tags). So, this is the step from opinionated solution to a more general one. This is what we usually do.

btw, if we go this direction - can we make an artifact type (model, dataset, etc) also as tag?

Right, te types are supported even without artifacts.yaml. Technically, the type is a tag prefix model-v0.01, dataset-v1.3.0. With enrichment - tag prefix is an artifact name\alias as it was before.

Also in this case we can simplify the implementation by making it either purely tag based or using the file. Do not even try to consolidate and do not allow mixing things.

Right, we are basically making it tag based. artifacts.yaml becomes an “enrichment”.

UX - I would now be confused…

@omesser we are not dropping off any use cases from the existing functionality.

This proposal is only about introducing a bit more lightweight user flow. It requires the separation between git tag (lightweight experience) and artifact.yaml (our opinionated/advance experience). See: Record in artifacts.yaml probably should become another enrichment.

Is gto just a tool to create git tags?

Yes 😄 but only for simple/lightweight scenarios.

@omesser

Tagging/registering a version for an artifact that was never added to the registry creates a vagueness - what is being tracked here and why?

We still track artifacts. E.g. gto show will show:

  1. Artifacts from artifacts.yaml
  2. Artifacts registered with git tags.

This actually confuses me also. Consider this example:

$ gto register model
$ gto register myartifact
$ gto add model myartifact path/to/it.h5
$ gto add dataset mydata other/path/to.data
$ git add artifacts.yaml && git commit -m "committed new artifacts"

There is three artifacts now. model exists in git tags only, myartifact in both git tags and artifacts.yaml, my data in artifacts.yaml only.

Now --rev param in gto list kinda behaves strange, because you may have artifacts that are registered by tags, so you should output (1) what’s registered with git tags, (2) what’s in artifacts.yaml in --rev. As a result: (1) virtually exists in all commits (2) artifacts.yaml content may be different across commits.

Now in fact the previous dichotomy we had (gto ls --rev - works on specific revision, gto show - works on the whole repo), doesn’t make sense. gto ls becomes roughly the same as gto show IIUC.

What does the registry contain and how can I see/control its contents, this should be simple and clear imo, one way to add and remove (or register/deregister, depending on naming).

In the concept we’re discussing, to deregister the artifact, you need to remove ALL git tags for it and make sure it is not listed in artifacts.yaml in HEADs of all branches. (Alternative for “remove all git tags” is to create some “stop-tag”, something like model@deprecated).

Is gto just a tool to create git tags?

In this concept it also parses all tags and makes sense out of them. Basically, it does “group by” tag prefix (==name). Also it allows you to add “GTO enrichments” that connect name with path, and then you can add MLEM/DVC enrichments too.

One more thought. gto history now can show only registrations and promotions. If record in artifact.yaml exists, does it mean that artifact exists? Maybe you forgotten to delete it. Record in artifacts.yaml probably should become another enrichment. So, you can parse the commits, and when model is there in artifacts.yaml, you can check other enrichments - to further enrich the artifact - in that commit.

$ gto history myartifact
ACTION        COMMIT   VERSION STAGE             ENRICHMENT
registration  commit1  v0.0.1
promotion     commit2  v0.0.1  production-ready
enrichment    commit3                            GTO
enrichment    commit4                            GTO, MLEM, DVC
registration  commit5  v0.0.2

To sum this up, IMO we’re sacrificing the integrity of the experience but gaining calmer learning curve. E.g. we have some pre-requisites that make sense and make the experience in a complex cases more logical and consistent, and we can discard them to allow easier scenarios with GTO (that don’t require GTO TBH) that can steadily evolve into something more complex (and where GTO can provide more value).

@aguschin UX - I would now be confused. Tagging/registering a version for an artifact that was never added to the registry creates a vagueness - what is being tracked here and why? What does the registry contain and how can I see/control its contents, this should be simple and clear imo, one way to add and remove (or register/deregister, depending on naming). Is gto just a tool to create git tags? What is the simple mechanic of adding an artifact - is it add or is it register? Is the basic resource artifact or artifact version? 🤔