dvc: plots diff: don't make one plot break all the others
In https://github.com/iterative/vscode-dvc/issues/3222, one mis-configured plot for one revision breaks the entire dvc plots diff
command. in #5984, we made it so that an invalid dvc.yaml
doesn’t break other revisions in dvc plots diff
. However, even if the dvc.yaml
is valid, a single revision or plot within one revision can break dvc plots diff
if there are other errors encountered while rendering one of the plots.
To do:
- Skip revisions if any errors are encountered rendering plots for that revision.
- Skip only the specific plots within a revision that fail to render.
Related: #7787
Edit: The problem occurs because the workspace dvc.yaml
config is applied to all other revisions, and if the data structure has changed, it may be impossible to render an old revision with the current dvc.yaml
config.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (19 by maintainers)
@pmrowla Makes sense and is simpler, although I have a couple hesitations:
I’m still not sure those outweigh the simplicity of always using the workspace but wanted to raise them for discussion.
@pmrowla mentioned that it was discussed whether we should be merging all the plots properties. I looked through some of the old issues and PRs and can’t find a good rationale for why we do this.
It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it’s implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.
Edit: Meant to reference #7913, not #7193.
We merge all of the plot props, so it’s not just limited to workspace dvc.yaml file it seems. https://github.com/iterative/dvc/blob/37761a2627c1e41c35c03e55cd5284d2f4f1386d/dvc/render/match.py#L19
Discussed at length in the VS Code planning call this morning. The problem boils down to this:
Changes made to a template can lead to changes in the pre-processed datapoints that we receive from plots diff. This makes the datapoints mutable as opposed to immutable. We have the same problem with errors being mutable.
We can start by always calling plots diff with the workspace revision and invalidating the cache whenever any
dvc.yaml
is changed but I do not think this will be enough to keep everything up to date.Let’s move forward with always using the workspace config as @pmrowla suggested.
@mattseddon WDYT about trying to always call
plots diff
with all the revisions now that performance is better in DVC, and we can follow up if caching is needed?I would say that even when
workspace
is not implied it still makes sense to be using the workspace config (and not the leftmost revision). If we are using the leftmost revision’s config, the user can never modify plots unless they also include the workspace data. Even if the user adds a new template, or modifies/fixes something in an existing template, they would not be able to populate it with purely historical data (they would have to also includeworkspace
data).This comment is also related to https://github.com/iterative/dvc/issues/8786.
The extension caches revision data (images + datapoints) which means that we should only call
plots diff
once for each revision (exceptworkspace
). It is looking less likely that this approach (caching on the extension side) actually works well. For example, after reading this from https://github.com/iterative/dvc/issues/9025#issuecomment-1472158045It seems like revision data is actually mutable depending on the combination of revisions passed to
plots diff
. This becomes even more error-prone when we throwFileNotFound
errors into the mix and whether or not these errors are important/should be displayed. In order to cache data on the VS Code side we need revisions to be static.So after a fair bit of deliberation triggered by the ongoing work with errors I think that the extension should only attempt to display templates/images that are available in the workspace’s
dvc.yaml
s. Everything else should be disregarded. Old(er) revision should be applied to the current workspace templates and that’s it. The user won’t lose much but this reduces/limits complexity greatly.Even doing this we will still get edge cases where the workspace’s
dvc.yaml
s get updated and we have to “refresh” all of the cached revisions on the extension side.If that means every time we call
plots diff
we call it with theworkspace
as the leftmost argument then no worries.Ideally, we would get
plots diff
performance to a place where we can call it with up to 7 revisions and it returns in a timely enough manner to drop all of the caching (mess) that is currently in the extension (will mention this in https://github.com/iterative/dvc/issues/8786 too).Please LMK what you think.
I would say, let’s agree on expected behavior and implement it.
To clarify, when I said:
I meant that it didn’t sound like what I thought the code was doing (because of https://github.com/iterative/dvc/issues/9025#issuecomment-1441609600), not that I was strongly against the idea of having a single source of truth for properties.
It does not merge in terms of recency, but in the order of revisions passed, so the revision on the left side is preferred (except
workspace
where it always gets higher priority).I guess the arguments are usually passed in the order of latest to oldest, so in that case, it will prefer recent revisions.