gto: Mechanics change: `artifact.yaml` is not mandatory
While discussing GTO with @dmpetrov we agreed that the goal should be providing easy git-compatible mechanics instead of complicated one. In general sense, there shouldn’t be a huge initial learning hill to climb. Git user should be able to gradually transform his repo to support complicated scenarios and shouldn’t be required to implement a complex manipulations before they’re justified by their workflow.
One of the requirements now is to create artifacts.yaml
and add an artifact to it. Now this is a way to “register” the artifact. To simplify workflow, we could lift this limitation. Now artifacts.yaml
is not a pre-requisite, but just a way to “enrich” your artifact with GTO.
We initially introduced artifacts.yaml
because if you set VERSION_BASE=commit
, there is no way to know the name of your artifact at all, as you don’t create git tags. But as we decided to support tags-only approaches now, this limitation could be temporary removed.
So now user flow will look like this:
# starter option, no artifacts.yaml is there
gto register myartifact # adds git tag myartifact@v0.0.1 -- NOT shown in Studio MR
gto register model # adds model@v0.0.1 -- Shown in Studio MR. This is a special convention for a single name only ("model").
# now you cann add GTO "enrichment" and specify it's a model
gto add model myartifact path/to.h5 # creates artifacts.yaml, now is shown in Studio, because we know it's a "model" TYPE
gto register myartifact # adds git tag myartifact@v0.0.2 -- we trace the history back and now the latest version was 0.0.1
This also make a question “How to deprecate a model” from #93 somewhat different. CC @omesser @shcheklein
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (21 by maintainers)
I had a brief chat with @dberenbaum about this. His feedback was (please correct me if i misunderstood):
My understanding: for (2) we have a good transitioning path; for (1) - we are not optimizing for this flow right now (it stays on the user level convention with no automation - hopefully we can do better 😃 ).
I’ve implemented what we have been discussing. Here’s how it works. To reproduce, run
pytest --basetemp=pytest-cache && cd pytest-cache/test_api0/
.Please let me know if this workflow looks clear and explicable to you. @shcheklein @dmpetrov @omesser @dberenbaum
Show registry
By default you can access to “registered” artifacts only. To “register” an artifact, you need to create a git tag for it: create a version (e.g. “rf@v0.0.1”) or promote it (e.g. “rf#prod-5”).
There could be artifacts present in “artifacts.yaml” only, though they’re not registered. You could find them if you add “–discover” flag:
By default discovery only happens on HEAD. You could also add “–all-commits” or “–all-branches” to check other commits.
Show artifact
You can also see all versions/promotions of the artifact.
If the artifact is not registered, you’ll see an error (I’ll fix the format of error later)
But you can discover it:
History
I’ve deprecated
audit
cmd because it was duplicated withhistory
and the latest is a superset of the first one. History works in similar way, it finds all actions that are related to the artifact. For “features” the only action is “commit”, because there are only enrichments for “features” in “artifacts.yaml” and no git tags.Let’s see one more example. “nn” has a single version that was promoted to “staging”. Because of that, in history we’ll see three events: the artifact was committed, then registered, then promoted. If we add “–discover”, we’ll also see another commit. In that commit “nn” is present in “artifacts.yaml”
I’ve implemented this, roughly. At least commands print what they have to. Please check out #108 for examples and provide your feedback. @dmpetrov @shcheklein @omesser @dberenbaum
yes, I got it. It still forces us into some conventions (default name “model” just works, it becomes “special”) thus it becomes more like Rails. Probably not critical. And the most important it won’t complicate the implementation significantly unless I’m missing something, so we should be good.
Another way of thinking about it - initially you have virtual, not created artifacts.yaml with default model “model” (and dataset - “dataset”) and it just works. If you need more - you create artifacts.yaml with more attributes for “model” or new models (like “m1”, “nn”).
Right! It is for model-per-repo use case when there is nothing to distinguish with path.
Yes, but there is no “instead”. If you are in the repo (CLI) then you just promote the “model”. If you outside the repo - you need to specify the repo anyway.
Okay, I better understand it now. So, we reserve only one specific name:
model
and it won’t describe anything at all - no name, no URL, nothing, Only it’s version + stage (prod, staging, etc)?so (I’m guessing), this is the use case if, I, let’s say have a repo with a single model that is already being deployed, I already hard coded its path in CI/CD, etc, and I know that it’s a single model per repo so I want to see it in Studio in a form of
repo_url:model
and I will be able to promote it to different stages.Repo pretty much defines the model in this case. And path/name are not that important to me.
I was asking about this special “model” tag support. Instead of providing a path (s3, local, etc) I can specify only repo.
The user experience is the same. Technically the change is a bit bigger since it requires decoupling tags and artifacts.yaml.
The changes from user and API point of view:
prod
, not onlyprod-N
It is the same as it was, but in the simple use case we are avoiding
artifacts.yaml
edits.If you are talking about status changes - yes, these are tag based actions (in the current and the previous design).
Unfortunately, I don’t see how can we realistically introduce widely-use experience with the existing opinionated approach when users (and the tool) have to deals with the combination of tags and files.
This is a core part of the product. By skipping this we are blocking the lightweight scenarios until next big redesign - a new product basically.
I agree with the motivation. However, this proposal is a bit opposite. We are extracting the opinionated part (artifacts.yaml) as an addition (or enrichment) while focusing on the less opinionated part (tags). So, this is the step from opinionated solution to a more general one. This is what we usually do.
Right, te types are supported even without
artifacts.yaml
. Technically, the type is a tag prefixmodel-v0.01
,dataset-v1.3.0
. With enrichment - tag prefix is an artifact name\alias as it was before.Right, we are basically making it tag based.
artifacts.yaml
becomes an “enrichment”.@omesser we are not dropping off any use cases from the existing functionality.
This proposal is only about introducing a bit more lightweight user flow. It requires the separation between
git tag
(lightweight experience) andartifact.yaml
(our opinionated/advance experience). See:Record in artifacts.yaml probably should become another enrichment.
Yes 😄 but only for simple/lightweight scenarios.
@omesser
We still track artifacts. E.g.
gto show
will show:artifacts.yaml
This actually confuses me also. Consider this example:
There is three artifacts now.
model
exists in git tags only,myartifact
in both git tags andartifacts.yaml
,my data
inartifacts.yaml
only.Now
--rev
param ingto list
kinda behaves strange, because you may have artifacts that are registered by tags, so you should output (1) what’s registered with git tags, (2) what’s inartifacts.yaml
in--rev
. As a result: (1) virtually exists in all commits (2) artifacts.yaml content may be different across commits.Now in fact the previous dichotomy we had (
gto ls --rev
- works on specific revision,gto show
- works on the whole repo), doesn’t make sense.gto ls
becomes roughly the same asgto show
IIUC.In the concept we’re discussing, to deregister the artifact, you need to remove ALL git tags for it and make sure it is not listed in
artifacts.yaml
in HEADs of all branches. (Alternative for “remove all git tags” is to create some “stop-tag”, something likemodel@deprecated
).In this concept it also parses all tags and makes sense out of them. Basically, it does “group by” tag prefix (==name). Also it allows you to add “GTO enrichments” that connect
name
withpath
, and then you can add MLEM/DVC enrichments too.One more thought.
gto history
now can show only registrations and promotions. If record inartifact.yaml
exists, does it mean that artifact exists? Maybe you forgotten to delete it. Record inartifacts.yaml
probably should become another enrichment. So, you can parse the commits, and when model is there inartifacts.yaml
, you can check other enrichments - to further enrich the artifact - in that commit.To sum this up, IMO we’re sacrificing the integrity of the experience but gaining calmer learning curve. E.g. we have some pre-requisites that make sense and make the experience in a complex cases more logical and consistent, and we can discard them to allow easier scenarios with GTO (that don’t require GTO TBH) that can steadily evolve into something more complex (and where GTO can provide more value).
@aguschin UX - I would now be confused. Tagging/registering a version for an artifact that was never added to the registry creates a vagueness - what is being tracked here and why? What does the registry contain and how can I see/control its contents, this should be simple and clear imo, one way to add and remove (or register/deregister, depending on naming). Is gto just a tool to create git tags? What is the simple mechanic of adding an artifact - is it add or is it register? Is the basic resource artifact or artifact version? 🤔