pandas: DOC: fix EX03 errors in docstrings

pandas has a script for validating docstrings

https://github.com/pandas-dev/pandas/blob/b7e2202459eadc9dd599cbe58251ecc930798b97/ci/code_checks.sh#L72-L172

Currently, some methods fail the EX03 check.

The task here is:

  • take 2-4 methods
  • run: scripts/validate_docstrings.py --format=actions --errors=EX03 method-name
  • check if validation docstrings passes for those methods, and if it’s necessary fix the docstrings according to whatever error is reported
  • remove those methods from code_checks.sh
  • commit, push, open pull request

Please don’t comment take as multiple people can work on this issue. You also don’t need to ask for permission to work on this, just comment on which methods are you going to work.

If you’re new contributor, please check the contributing guide

thanks @MarcoGorelli for giving me the idea for this issue.

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Reactions: 1
  • Comments: 66 (54 by maintainers)

Most upvoted comments

yeah that’s the fix - sorry if it wasn’t clear, it was more of an explanation for people that had trouble figuring out the lines affected.

Explanation of what to look for:

EX03 is the errors for the example code-blocks in a function/method’s documentation

for pandas.errors.SpecificationError the examples show:

Examples
--------
>>> df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
...                    'B': range(5),
...                    'C': range(5)})
>>> df.groupby('A').B.agg({'foo': 'count'}) # doctest: +SKIP
... # SpecificationError: nested renamer is not supported

>>> df.groupby('A').agg({'B': {'foo': ['sum', 'max']}}) # doctest: +SKIP
... # SpecificationError: nested renamer is not supported

>>> df.groupby('A').agg(['min', 'min']) # doctest: +SKIP
... # SpecificationError: nested renamer is not supported

line 4 here would be the 4th line in the examples which is >>> df.groupby('A').B.agg({'foo': 'count'}) # doctest: +SKIP

line 6 would be >>> df.groupby('A').agg({'B': {'foo': ['sum', 'max']}}) # doctest: +SKIP

@jordan-d-murphy, I agree, seems we fixed all flake8 errors. Thank you for working on this issue with intensity and helping other contributors. Now, we can close this issue.

@jordan-d-murphy great I’ll take those!

I’ll take:

  • pandas.io.formats.style.Styler.to_latex
  • pandas.read_parquet

Okay! Makes sense. Hope the photo might help someone else then 🙂

I’ve opened a PR for the remaining 4 functions. I believe this will close this issue.

pandas.Series.plot.line pandas.Series.to_sql pandas.read_json pandas.DataFrame.to_sql

I tried all of that and still didn’t work. I’m going to stop working on it and find another issue to take on. Thanks for all the help!

hmmm okay, yes your approach seems correct, but when I ran this on the latest branch I’m seeing no EX03 errors for pandas.Series.plot.line

I’ve been using the following approach to set up my dev env and working branch before working on my PRs, which ensures my branch is up to date with the latest version of main.

can you try running these commands, and then try running your script again and see if it helps?


Updating the development environment

git checkout main git merge upstream/main mamba activate pandas-dev mamba env update -f environment.yml --prune

Creating a feature branch

git checkout main git pull upstream main --ff-only git checkout -b shiny-new-feature (NOTE: shiny-new-feature should be your new working branch name)


After running the above commands, running the following script scripts/validate_docstrings.py --format=actions --errors=EX03 pandas.Series.plot.line results in output that ends with this:

################################################################################
################################## Validation ##################################
################################################################################

1 Errors found for `pandas.Series.plot.line`:
	PR02	Unknown parameters {'color'}

script: scripts/validate_docstrings.py --format=actions --errors=EX03 pandas.Series.plot.line

Output:

################################################################################
##################### Docstring (pandas.Series.plot.line)  #####################
################################################################################

Plot Series or DataFrame as lines.

This function is useful to plot lines using DataFrame's values
as coordinates.

Parameters
----------
x : label or position, optional
    Allows plotting of one column versus another. If not specified,
    the index of the DataFrame is used.
y : label or position, optional
    Allows plotting of one column versus another. If not specified,
    all numerical columns are used.
color : str, array-like, or dict, optional
    The color for each of the DataFrame's columns. Possible values are:

    - A single color string referred to by name, RGB or RGBA code,
        for instance 'red' or '#a98d19'.

    - A sequence of color strings referred to by name, RGB or RGBA
        code, which will be used for each column recursively. For
        instance ['green','yellow'] each column's line will be filled in
        green or yellow, alternatively. If there is only a single column to
        be plotted, then only the first color from the color list will be
        used.

    - A dict of the form {column name : color}, so that each column will be
        colored accordingly. For example, if your columns are called `a` and
        `b`, then passing {'a': 'green', 'b': 'red'} will color lines for
        column `a` in green and lines for column `b` in red.

**kwargs
    Additional keyword arguments are documented in
    :meth:`DataFrame.plot`.

Returns
-------
matplotlib.axes.Axes or np.ndarray of them
    An ndarray is returned with one :class:`matplotlib.axes.Axes`
    per column when ``subplots=True``.

        See Also
        --------
        matplotlib.pyplot.plot : Plot y versus x as lines and/or markers.

        Examples
        --------

        .. plot::
            :context: close-figs

            >>> s = pd.Series([1, 3, 2])
            >>> s.plot.line()  # doctest: +SKIP

        .. plot::
            :context: close-figs

            The following example shows the populations for some animals
            over the years.

            >>> df = pd.DataFrame({
            ...    'pig': [20, 18, 489, 675, 1776],
            ...    'horse': [4, 25, 281, 600, 1900]
            ...    }, index=[1990, 1997, 2003, 2009, 2014])
            >>> lines = df.plot.line()

        .. plot::
           :context: close-figs

           An example with subplots, so an array of axes is returned.

           >>> axes = df.plot.line(subplots=True)
           >>> type(axes)
           <class 'numpy.ndarray'>

        .. plot::
           :context: close-figs

           Let's repeat the same example, but specifying colors for
           each column (in this case, for each animal).

           >>> axes = df.plot.line(
           ...     subplots=True, color={"pig": "pink", "horse": "#742802"}
           ... )

        .. plot::
            :context: close-figs

            The following example shows the relationship between both
            populations.

            >>> lines = df.plot.line(x='pig', y='horse')

################################################################################
################################## Validation ##################################
################################################################################

3 Errors found for `pandas.Series.plot.line`:
	PR02	Unknown parameters {'color'}
	EX03	flake8 error: line 4, col 4: E121 continuation line under-indented for hanging indent
	EX03	flake8 error: line 6, col 4: E123 closing bracket does not match indentation of opening bracket's line

Otherwise id take:

  • pandas.errors.SettingWithCopyWarning
  • pandas.errors.SpecificationError
  • pandas.errors.UndefinedVariableError

I cannot work on these since I cannot make the numpydoc work to test for validity. So they are up for grabs.

@natmokval should pandas.arrays.DatetimeArray be added to the list? I see a flake8 error on my PR tests:

Error: /home/runner/work/pandas/pandas/pandas/core/arrays/datetimes.py:179:EX03:pandas.arrays.DatetimeArray:flake8 error: line 2, col 4: E121 continuation line under-indented for hanging indent

Edit: Looks like this is fixed in #56855

I’ll take these:

pandas.io.formats.style.Styler.highlight_quantile pandas.io.formats.style.Styler.background_gradient pandas.io.formats.style.Styler.text_gradient

Working on:

  • pandas.DataFrame.plot.hexbin
  • pandas.DataFrame.plot.line

I’ll take these:

pandas.io.formats.style.Styler.set_tooltips pandas.io.formats.style.Styler.set_uuid pandas.io.formats.style.Styler.pipe pandas.io.formats.style.Styler.highlight_between

Working on:

  • pandas.DataFrame.groupby
  • pandas.DataFrame.values
  • pandas.DataFrame.sort_values

Working on:

  • pandas.ExcelWriter

I’ll take:

pandas.io.formats.style.Styler.format_index pandas.io.formats.style.Styler.relabel_index pandas.io.formats.style.Styler.hide pandas.io.formats.style.Styler.set_td_classes

I’ll take these:

pandas.io.json.build_table_schema pandas.read_stata pandas.plotting.scatter_matrix pandas.Index.droplevel pandas.Grouper

I’ll take: pandas.Timestamp.ceil pandas.Timestamp.floor pandas.Timestamp.round

working on:

  • pandas.errors.PossibleDataLossError
  • pandas.errors.PossiblePrecisionLoss
  • pandas.errors.SettingWithCopyError
  • pandas.errors.ValueLabelTypeMismatch

Hi @natmokval, the command line scripts/validate_docstrings.py --format=actions --errors=EX03 method-name outputs all kind of errors, not just the EX03 errors.

https://github.com/pandas-dev/pandas/blob/c778746f2219601ac3c38f4f287f9a4e68905655/scripts/validate_docstrings.py#L444-L458

Working on:

  • pandas.DataFrame.hist
  • pandas.read_json
  • pandas.DataFrame.to_sql

PR opened for the following:

  • pandas.DataFrame.var
  • pandas.DatetimeIndex.day_name
  • pandas.core.groupby.DataFrameGroupBy.apply
  • pandas.DatetimeIndex.month_name
  • pandas.core.groupby.DataFrameGroupBy.hist
  • pandas.core.groupby.SeriesGroupBy.apply
  • pandas.core.groupby.SeriesGroupBy.transform
  • pandas.DataFrame.plot.hist
  • pandas.DataFrame.tz_localize
  • pandas.CategoricalIndex.set_categories
  • pandas.core.groupby.DataFrameGroupBy.boxplot
  • pandas.core.groupby.SeriesGroupBy.pipe
  • pandas.DataFrame.plot.bar
  • pandas.DataFrame.tz_convert
  • pandas.core.groupby.DataFrameGroupBy.pipe
  • pandas.DataFrame.skew
  • pandas.core.window.rolling.Rolling.corr

work on

  • pandas.MultiIndex.names
  • pandas.MultiIndex.droplevel

I’ll take these:

pandas.core.resample.Resampler.interpolate pandas.pivot
pandas.merge_asof pandas.wide_to_long pandas.Index.rename pandas.Index.isin pandas.IndexSlice

Im new to this and maybe I overlook something fundamental: I wanted to take on some of these docstrings, I however run into an issue and am not sure what I am doing wrong.

After executing:

python3 validate_docstrings.py --format=actions --errors=EX03 pandas.errors.SpecificationError

I get a list of flake8 - errors:

flake8 error: line 4, col 40: E261 at least two spaces before inline comment
...

However, after adding an extra space and saving the docstring for SpecificationError in this case in

./pandas/errors/__init__.py

and rerunning the above validation_docstrings.py again, the script returns the same errors, as if the change had no effect.

Even I am new and facing similar issue . Even after making the changes the error logs don’t change

Working on:

  • pandas.errors.DatabaseError
  • pandas.errors.IndexingError
  • pandas.errors.InvalidColumnName Thank you!

Working on :

  • pandas.Series.plot.line
  • pandas.Series.to_sql

I will work for: pandas.Series.cat.set_categories pandas.Series.plot.bar pandas.Series.plot.hist

I can take the first two methods.

  • pandas.Series.dt.day_name
  • pandas.Series.str.len