kedro: Avoid Kedro fsspec requirements being mutually incompatible with pandas 1.1.0

Description

Kedro 16.4 enforces fsspec<0.7.0,>=0.5.1 & PANDAS = "pandas>=0.24, <1.0.4" in setup.py

However,

  1. the pd upper version limit is not enforced, so newer pd versions can coexist after pip install kedro
  2. The blocking issue (for kedro including higher pd versions) on parquet reads in pandas is resolved re: https://github.com/pandas-dev/pandas/issues/34467 in pd >1.0.5, so the key blocker on Kedro upgrading pandas versions no longer exists.

IF kedro allows pandas 1.1.0, then you are going to hit an incompatibility with fsspec, as pandas 1.1.0 requires fsspec>=0.7.4.

Context

You cannot run kedro and pandas>1.1.0 on the same environment. Pandas needs fsspec>0.7.4.

There are meaningful improvements in newer pandas, so I would like to be able to run them together out-of-the-box.

The current reason to not allow higher pd versions as per setup.py in the kedro source code (https://github.com/pandas-dev/pandas/issues/34467) is no longer applicable, so i think it is time to make this change unless fsspec has deal-breaking problems in later versions.

Possible Implementation

Can you just update fsspec version requirements to fsspec<0.7.4,>=0.5.1? I’m not aware of any major problems in the newer versions

Possible Alternatives

Raise warnings or errors if kedro co-exists with pd>=1.1.0

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Oops. FYI, I’m running into an incompatibility with s3fs 0.5.0 as it requires fsspec>=0.8.0.

Thanks both,

Would I be correct in understanding the official position here as: “all file loads should be handled by .yml in your catalog folder; it is anti-pattern to directly read files with pandas, therefore we don’t view it as critical for kedro to facilitate pandas s3 functionality, nor maintain it if it is incidentally present in some kedro versions”?

I understand the position if so, although I think there are good reasons to prototype eda in a project using pandas s3 functionality rather than hopping in and out of the data catalog to create potentially disposable items.

Regardless, looking forward to v0.17.0!

PS. I agree that anyone who doesn’t use virtual environments deserves the problems they encounter

Oops. FYI, I’m running into an incompatibility with s3fs 0.5.0 as it requires fsspec>=0.8.0.

Im having this same issue. Its keeping me from creating my environment. A lower version of s3fs is not really a good option for me.

Thanks @lorenabalan, if I’m reading develop: 3218055 correctly then it would make sense to allow broader pandas boundaries as part of the same feature release in 0.17.0?

The reason quantum Black unpinned pandas (pandas-dev/pandas#34467) is solved, and the fsspec problem that would have been introduced by a pd update is now fixed.

Would only need to change the appropriate pandas-determining lines in requirements.txt and setup.py.

@fjp I could update fsspec in kedro setup files, but I don’t think I should.

Kedro has a lot of I/O functionality that could intersect with fsspec in a lot of ways. Unless quantumBlack is very confident in their hooks and tests I don’t think I should be making that change (particularly as I don’t understand fsspec very deeply).

On the other hand, I am a (minor) Pandas contributor and understand that library well enough to know that 1.1.0 is a good upgrade and will not break data pipelines if the fsspec dependency is correctly managed (gcsfs and similar have been stable for long enough that it shouldn’t be an issue for kedro)

Hello again! So what happened was we relaxed the pandas requirements, to be just >0.24.0. fsspec is an optional requirement for pandas, so it’s not an impediment in fetching the latest pandas. fsspec is pulled according to Kedro core requirements, <0.7.0. The way Kedro interacts with fsspec means pandas doesn’t have to, so there are actually no compatibility issues on that front. We read the data using fsspec directly and then pass a file-like object to the pandas.read_csv. The example you gave, with passing a path/string indeed won’t work, but fortunately that is not how Kedro datasets work so it should be safe.

We hear you all on the fsspec bump, and as I said this has been implemented on develop and will be made available on 0.17.0. Hopefully this release will come sooner rather than later, thanks to you all raising awareness/visibility on your issues! 😃

Happy to make pull request with my suggested implementation if desired. Although it is only a 1-liner edit so barely saving any labour…