scikit-learn: Run more examples that do not start with plot_ on CircleCI

From https://github.com/scikit-learn/scikit-learn/pull/8847#issuecomment-300015905:

and we should have a CI test for non-plotted examples or convert as many as possible to plots

My proposal is to have a convention like run_ for examples that do not produce any plots. sphinx-gallery allows to have a regex to specify which examples you want to run. It could be something like plot_|run_. See the doc for more details.

I looked at the examples whose filename is not starting with plot_. Timings are in seconds and in increasing order.

examples/feature_selection/feature_selection_pipeline.py 1.39
examples/exercises/digits_classification_exercise.py 1.47
examples/applications/svm_gui.py 1.86
examples/missing_values.py 2.01
examples/model_selection/randomized_search.py 2.02
examples/feature_stacker.py 2.14
examples/text/document_clustering.py 3.21
examples/linear_model/lasso_dense_vs_sparse_data.py 3.98
examples/text/hashing_vs_dict_vectorizer.py 4.78
examples/model_selection/grid_search_digits.py 8.29
examples/text/document_classification_20newsgroups.py 8.93
examples/applications/topics_extraction_with_nmf_lda.py 10.53
examples/applications/face_recognition.py 25.02
examples/bicluster/bicluster_newsgroups.py 25.72
examples/hetero_feature_union.py 116.22
examples/applications/wikipedia_principal_eigenvector.py 139.77
examples/model_selection/grid_search_text_feature_extraction.py 156.86

With this in mind I would be in favour of running all the examples but svm_gui.py and the last three examples.

More details: svm_gui.py pops up a gui so it should probably not be run. Whether we should run wikipedia_principal_eigenvector.py and grid_search_text_feature_extraction.py which each takes more than 2 minutes is up for debate. On top of that, some of them may require data download that is not using the typical ~/scikit_learn_data (e.g. the Wikipedia one). If that is the case these examples would not benefit from the CircleCI cache.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 1
  • Comments: 17 (17 by maintainers)

Most upvoted comments

There are still some examples that we don’t run (i.e. that don’t start with plot_):

❯ find examples -name '*.py' | grep -v plot_
examples/neighbors/approximate_nearest_neighbors.py
examples/applications/wikipedia_principal_eigenvector.py
examples/applications/svm_gui.py
examples/model_selection/grid_search_text_feature_extraction.py
  • examples/neighbors/approximate_nearest_neighbors.py is new (compared to ~4 years ago). I am guessing this is not run because it needs more dependencies, e.g. annoy and nmslib (available on conda-forge: python-annoy and nmslib). This was part of https://github.com/scikit-learn/scikit-learn/pull/10482 if more context is needed. This example takes ~3.5 minutes on my machine so maybe a bit too long to run in the CI …
  • examples/applications/wikipedia_principal_eigenvector.py: needs a closer look at how much time it would need in the CI (the ~2 minutes timings in the top post were very likely from a run on my machine) and whether we can afford it
  • examples/model_selection/grid_search_text_feature_extraction.py: needs a closer look at how much time it would need in the CI (the ~2 minutes timings in the top post were from a run on my machine) and whether we can afford it
  • examples/applications/svm_gui.py: we should not run it since it opens a GUI as noted above

For more context why this matters (at least a little bit):

From #8847 (comment):

and we should have a CI test for non-plotted examples or convert as
many as possible to plots

Often, examples are not named “plot_*” because they take a long time to run, or require a large download. Back when we create them, we considered that we did not have enough horsepower with the CI to run them. Maybe we should indeed reconsider this decision, but first we need to evaluate our computing power in the CI.