swifter: Slow Performance of Swifter for Text Preprocessing

Hi @jmcarpenter2,

Dear Swifter Folks,

Recently, i found the speed when using swifter is 5-10x slower than using vanilla pandas apply for case that the process is not vectorized (my case is doing text preprocessing).

The experiment is like this:


import pandas as pd
import swifter

def clean_text(text):
    text = text.strip()
    text = text.replace(' ', '_')
    return text

N_rows = 7000000
df_data = pd.DataFrame([["i want to break free"]] * N_rows, columns=["text"])

%time df_data['text'] = df_data['text'].swifter.apply(clean_text)

%time df_data['text'] = df_data['text'].apply(clean_text)

Is it expected? let’s have a discussion to make sure i’m not missing something. Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 4
  • Comments: 26 (10 by maintainers)

Most upvoted comments

For anyone reading this issue –

If you are doing processing on text data and want to try to increase speed with swifter, you should try adding allow_dask_on_strings() to your command chain.

For example df.swifter.allow_dask_on_strings().apply(foo) will allow swifter to attempt using dask on your text data, which by default is not allowed.

Please see the discussion above for why this is the default. Long story short: it can actually run slower than a pandas apply.

So if you are experiencing a lack of performance boost from swifter and you have text data in your dataframe, try allow_dask_on_strings(). It is more likely to increase speed if the text column of the dataframe is only used as a lookup rather than be mutated by the function call itself.

I haven’t been using the library for a while but it’s really great to see things resolved with such awesome bug closing notes. Folks like you make the OSS world the magical place that it is. Thank you so much Jason!

Dear swifter community and @hadyan-tvlk ,

I am very pleased to announce that swifter v1.0.1 was just released. With that release comes a fix for this nearly 2 year old issue. By leveraging modin dataframes specifically to perform axis=1 string applies, we are able to achieve a performance boost when using swifter on dataframes containing string dtype columns.

NOTE: This means that using allow_dask_on_strings() is no longer required. In fact, this legacy parameter is likely to slow down your apply performance because it will force swifter to attempt to use dask instead of modin. However, the parameter is kept for (A) backwards compatibility and (B) in case a user prefers the dask functionality.

Please upgrade to swifter 1.0.1 and try it out for yourself. To upgrade: pip install -U swifter

Let me know if (A) it doesn’t work for you or (B) it works well for you. Feedback is desired! I’m going to leave this issue “Open” for a short time to allow for any feedback or issues that arise related to this one.

Thanks so much for your patience, Jason

Hi @c-shreya and all,

The only update I have at this moment is that I have been looking into alternative approaches/back-ends which may help with specific issues that swifter has and features that users strongly desire. For example, this text processing issue has been a longstanding issue for swifter. Additionally, users have been asking for a groupby-apply for a long time as well.

I am optimistic that other approaches will enable these shortcomings of swifter to be solved or at least improved, but I don’t want to promise anything. These are not easy problems to solve and I’ve found that most out-of-the-box solutions don’t help with them.

Thank you all for your patience, Jason

I apologize for the delay in implementing a working solution for this issue. In order to avoid this being a reason not to use swifter in the short term, I have implemented a simple check if the input is a string. If the input is a series/dataframe of dtype string then we simply resort to pandas apply.

To be clear, this is a temporary solution until I can find a solution that will use parallelization to actually increase processing speed of text-based inputs.

Please update to version 0.230 to have the benefits of this temporary fix.

Using swifter (version 1.0.9) actually decreased speed for me too. I did some string operations on a 100k row dataframe. Even tried allow_dask_on_strings() but that didn’t change much.

If you are doing processing on text data and want to try to increase speed with swifter, you should try adding allow_dask_on_strings() to your command chain.

For example df.swifter.allow_dask_on_strings().apply(foo) will allow swifter to attempt using dask on your text data, which by default is not allowed.

Dask Apply: 100%|██████████| 24/24 [00:02<00:00, 8.34it/s] CPU times: user 1.21 s, sys: 126 ms, total: 1.34 s Wall time: 5.08 s

Earlier it was 10 odd secs. Works great for me!! Thanks

I tried using df.swifter.allow_dask_on_strings().apply(foo) to process text (calculating edit distance between two columns) , but it still seems that it is still using one core only, and there is not speed difference compared to df.swifter.apply(foo).

2 weeks later update:

I’ve explored several libraries for multithreading and multiprocessing and the best I seem to be able to do on the given function clean_text for 70M rows is to nearly tie the speed of a pandas apply.

Until I come across a better solution, I’m going to leave this issue open so that people are aware of this limitation, and I will retain the functionality that swifter resorts to pandas apply in the case of text processing.

Hey @hadyan-tvlk,

Thanks for identifying this issue! I too am noticing similar issues on non-vectorized string applies. To be perfectly honest, I have no clue why the underlying dask is processing large string applies slower than a regular large pandas apply.

I am currently experimenting with using a different parallel processing module specifically for string applies to see if I can resolve this shortcoming of the package. If you want, you are welcome to help identify other parallel processing packages that might be useful for me to implement in swifter for the case of string applies.

Thanks!