pandas: BUG: Python operator 'in' behaves in two different ways on Series object

The python operator ‘in’ acting on a pandas.Series object behaves in two different ways. When iterating over the series it acts on the list of values. But for membership testing it acts on the index. Is this the desired behavior?

import pandas as pd
s = pd.Series([1, 2, 3])
for v in s: print v
    # prints 1 2 3, i.e. 'in' is acting on the values
for v in [1, 2, 3]: print v in s
    # prints True True False, i.e. the latter 'in' is acting on the index

This behavior also leads to peculiar behavior when working with DataFrames.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df

#      a  b
#  0  1  4
#  1  2  5
#  2  3  6

for value in df['a']:
    print value,
    if value in df['b']:
        print 'also in column b'

#  1 also in column b
#  2 also in column b
#  3

This is not what one would naively expect from that code snippet and it can potentially lead to errors that might be difficult to spot. This is the behavior that is also mentioned in the docs (in a pretty hidden way – one has to actively search for it), but I suggest that this should not be the desired behavior. Instead, I’d find it much more natural, if ‘in’ worked always on the values and never on the index. If needed one can always use ‘in s.index’, but the standard way should be acting on the values always.

pandas version 0.18.1

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Reactions: 2
  • Comments: 23 (20 by maintainers)

Most upvoted comments

We could also consider deprecating DataFrame.contains() as well

I think this would be a mistake. I find code like if col in df: do_somthing(df[col]) to be really natural and useful.

Just got bit by this. It was miserable. agreed that it needs to be consistent, and imho in on the actual values seems much more sensible. iteration on series works on values, and no one really wants to check indices “implicitly”.

I am a scientist in astronomy and this just bitted me. I feel lucky I discovered this before I publish my journal paper so that I did not publish wrong results in the frontier of science. But, are others aware of this? I feel worried. I think this is definitely a nightmare for scientists, and even a hidden bomb in science that could explode at any moment.

Pandas and Numpy are the most important tools for researchers. Since a in List and a in numpy.array behaves the same, and Pandas Series or DataFrame can be built based on list or numpy.array, I think most people will believe a in series means the same thing with no doubt. What’s worse, it raises no error, and the (wrong) result is hidden below huge datasets which we usually do with DataFrame or its column Series. As a result, no one will be aware of this.

I know this is a documented behavior and you may have many reasons why a in series means a in series.index, and of course you can blame users for not reading the doc. BUT, some of you may have heard about this: https://www.nature.com/articles/d41586-021-02211-4 The auto-correction in MS Excel has led to many unnoticed errors in research. That is the consequence. Microsoft is not wrong, but science is harmed.

Considering the leading role of Pandas in all subjects worldwide, I urge the developers think about this carefully. A simple implementation of a in series=a in series.values would save loads of errors in the future.

(I asked several colleagues and none of them knew this before, and we rushed to check all our Python pipelines… but did other millions of researchers know this? I am worried…)

Sure, agree in general, it’s the “changing __iter__ being better for users” that I’m really questioning - I would also expect ser.tolist() and list(ser) to give the same result

I agree with the OP. There are a few StackOverflow questions about this already (e.g., here, and here from the opposite expectation perspective).

Let me make a few points:

  • x in y, for objects y that are iterable, is generally expected to return True exactly if for z in y: ... sets z to an object equal to y at least once. This is the default behaviour of in when you only implement __iter__, and all other Python libraries I’ve ever encountered adhere to this convention.
  • Since it’s claimed that Series are dict-like: dicts are consistent, both the in keyword and the in operator referring to keys; Series are not (you’ve pointed this out @jorisvandenbossche):
    > dict = {'a': 3, 'b': 2}
    > for x in dict:
    >   print(x)
    a
    b
    
    > 3 in dict
    False
    
    > 'a' in dict
    True
    
  • This inconsistency is the main reason for confusion as evidenced by the two linked StackOverflow posts, the first expecting x in y to be true if the value x is in y, the second expecting for x in y to yield keys.
  • While it is documented, it is documented in a gotcha section, a place were users will generally not look for something as “fundamental” as the in operator.
  • The isin function often presented as an alternative for checking whether a Series contains a value is neither memory- nor runtime-efficient nor nicely readable (the needle and haystack being ‘inverted’), and it requires a list argument even if one only wants to check for a single value, so from an intuitive standpoint for this specific purpose, it “checks the wrong way around”: s.isin([2]) checks for each value in the Series if it is in the list [2].

I think making a consistent choice for either values or keys would be most sensible, since I don’t see how this could be put in the docs in an appropriate way: IMHO, the current place (in gotchas) is too hidden, but any other place (e.g. in the Series API) is out of place, for the simple reason that in is a Python keyword/operator, and not part of the API of Series objects. IMHO, breaking user code is fine when this breaking change is introduced in a few versions into the future while affected code prints deprecation warnings (by printing warnings in the __contains__ or the __iter__ implementations).