pandas: pd.read_json yields: OSError: [Errno 22] Invalid argument

Code Sample, a copy-pastable example if possible

data = '/Users/davidleifer/Desktop/Geog500/thesis/data/merged-file.json'
df = pd.read_json(data, lines=True)

Problem description

The JSON file contains Twitter data scraped using their API. I’ve limited the files to 10,000 tweets per file. I clean the files using this process:

Merge files in directory using: cat * > merged-file.json
Remove blank lines in Sublime Text using Find and Replace: ^\n.

Here is an example Tweet (one tweet per line):

{“created_at”:“Thu Nov 02 08:08:01 +0000 2017”,“id”:925997914136002562,“id_str”:“925997914136002562”,“text”:“#RussianGate #FollowTheFacts #Resist #FakePresident #GOP #War #Vote #ClimateChange #Peace #Animals #Women https://t.co/xe7AEdod1Y”,“display_text_range”:[0,105],“source”:“\u003ca href="http://twitter.com" rel="nofollow"\u003eTwitter Web Client\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:760436942,“id_str”:“760436942”,“name”:“Athoughtz”,“screen_name”:“athoughtz”,“location”:“United States”,“url”:null,“description”:“#RussianGate #FollowTheFacts #Resist #FakePresident #GOP #War #Vote #ClimateChange #Peace #Animals #Women”,“translator_type”:“none”,“protected”:false,“verified”:false,“followers_count”:5063,“friends_count”:5064,“listed_count”:142,“favourites_count”:659,“statuses_count”:62057,“created_at”:“Thu Aug 16 00:11:12 +0000 2012”,“utc_offset”:-25200,“time_zone”:“Arizona”,“geo_enabled”:false,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“C0DEED”,“profile_background_image_url”:“http://abs.twimg.com/images/themes/theme1/bg.png”,“profile_background_image_url_https”:“https://abs.twimg.com/images/themes/theme1/bg.png”,“profile_background_tile”:false,“profile_link_color”:“1DA1F2”,“profile_sidebar_border_color”:“C0DEED”,“profile_sidebar_fill_color”:“DDEEF6”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/378800000835488491/565d1bd43c8b0a615b8a39887e52ef2c_normal.jpeg”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/378800000835488491/565d1bd43c8b0a615b8a39887e52ef2c_normal.jpeg”,“default_profile”:true,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:0,“favorite_count”:0,“entities”:{“hashtags”:[{“text”:“RussianGate”,“indices”:[0,12]},{“text”:“FollowTheFacts”,“indices”:[13,28]},{“text”:“Resist”,“indices”:[29,36]},{“text”:“FakePresident”,“indices”:[37,51]},{“text”:“GOP”,“indices”:[52,56]},{“text”:“War”,“indices”:[57,61]},{“text”:“Vote”,“indices”:[62,67]},{“text”:“ClimateChange”,“indices”:[68,82]},{“text”:“Peace”,“indices”:[83,89]},{“text”:“Animals”,“indices”:[90,98]},{“text”:“Women”,“indices”:[99,105]}],“urls”:[],“user_mentions”:[],“symbols”:[],“media”:[{“id”:925997885778378752,“id_str”:“925997885778378752”,“indices”:[106,129],“media_url”:“http://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“media_url_https”:“https://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“url”:“https://t.co/xe7AEdod1Y”,“display_url”:“pic.twitter.com/xe7AEdod1Y”,“expanded_url”:“https://twitter.com/athoughtz/status/925997914136002562/photo/1”,“type”:“photo”,“sizes”:{“medium”:{“w”:600,“h”:585,“resize”:“fit”},“small”:{“w”:600,“h”:585,“resize”:“fit”},“thumb”:{“w”:150,“h”:150,“resize”:“crop”},“large”:{“w”:600,“h”:585,“resize”:“fit”}}}]},“extended_entities”:{“media”:[{“id”:925997885778378752,“id_str”:“925997885778378752”,“indices”:[106,129],“media_url”:“http://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“media_url_https”:“https://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“url”:“https://t.co/xe7AEdod1Y”,“display_url”:“pic.twitter.com/xe7AEdod1Y”,“expanded_url”:“https://twitter.com/athoughtz/status/925997914136002562/photo/1”,“type”:“photo”,“sizes”:{“medium”:{“w”:600,“h”:585,“resize”:“fit”},“small”:{“w”:600,“h”:585,“resize”:“fit”},“thumb”:{“w”:150,“h”:150,“resize”:“crop”},“large”:{“w”:600,“h”:585,“resize”:“fit”}}}]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“und”,“timestamp_ms”:“1509610081596”} {“created_at”:“Thu Nov 02 08:08:02 +0000 2017”,“id”:925997918795866113,“id_str”:“925997918795866113”,“text”:“RT @CGTNOfficial: Survey released on Chinese public awareness of #climatechange https://t.co/q92jAnobmd”,“source”:“\u003ca href="http://nosudo.co" rel="nofollow"\u003eQxNews-python\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:1664059166,“id_str”:“1664059166”,“name”:“Question News”,“screen_name”:“QxNews”,“location”:“USA”,“url”:null,“description”:“Interrogare Semper | News bot/humans via retweets | 1 min per retweet”,“translator_type”:“none”,“protected”:false,“verified”:false,“followers_count”:3254,“friends_count”:271,“listed_count”:2786,“favourites_count”:38,“statuses_count”:1018592,“created_at”:“Mon Aug 12 03:35:37 +0000 2013”,“utc_offset”:-25200,“time_zone”:“Pacific Time (US & Canada)”,“geo_enabled”:false,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“000000”,“profile_background_image_url”:“http://pbs.twimg.com/profile_background_images/514662332492816384/TuhAkn7d.jpeg”,“profile_background_image_url_https”:“https://pbs.twimg.com/profile_background_images/514662332492816384/TuhAkn7d.jpeg”,“profile_background_tile”:false,“profile_link_color”:“000000”,“profile_sidebar_border_color”:“FFFFFF”,“profile_sidebar_fill_color”:“DDEEF6”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/597288578092240896/ePlmSYCH_normal.png”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/597288578092240896/ePlmSYCH_normal.png”,“profile_banner_url”:“https://pbs.twimg.com/profile_banners/1664059166/1484679111”,“default_profile”:false,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“retweeted_status”:{“created_at”:“Thu Nov 02 07:55:00 +0000 2017”,“id”:925994638019825664,“id_str”:“925994638019825664”,“text”:“Survey released on Chinese public awareness of #climatechange https://t.co/q92jAnobmd”,“source”:“\u003ca href="https://about.twitter.com/products/tweetdeck" rel="nofollow"\u003eTweetDeck\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:1115874631,“id_str”:“1115874631”,“name”:“CGTN”,“screen_name”:“CGTNOfficial”,“location”:“Beijing, China”,“url”:“http://www.CGTN.com”,“description”:“China Global Television Network, or CGTN, is a multi-language, multi-platform media grouping.”,“translator_type”:“none”,“protected”:false,“verified”:true,“followers_count”:4828619,“friends_count”:53,“listed_count”:4517,“favourites_count”:32,“statuses_count”:39079,“created_at”:“Thu Jan 24 03:18:59 +0000 2013”,“utc_offset”:28800,“time_zone”:“Beijing”,“geo_enabled”:true,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“131516”,“profile_background_image_url”:“http://pbs.twimg.com/profile_background_images/378800000169084583/SqpyvnvQ.jpeg”,“profile_background_image_url_https”:“https://pbs.twimg.com/profile_background_images/378800000169084583/SqpyvnvQ.jpeg”,“profile_background_tile”:true,“profile_link_color”:“009999”,“profile_sidebar_border_color”:“FFFFFF”,“profile_sidebar_fill_color”:“EFEFEF”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/815049165508112384/wJA8jWZh_normal.jpg”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/815049165508112384/wJA8jWZh_normal.jpg”,“profile_banner_url”:“https://pbs.twimg.com/profile_banners/1115874631/1483157766”,“default_profile”:false,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:10,“favorite_count”:25,“entities”:{“hashtags”:[{“text”:“climatechange”,“indices”:[47,61]}],“urls”:[{“url”:“https://t.co/q92jAnobmd”,“expanded_url”:“https://news.cgtn.com/news/794d7a4e33597a6333566d54/share_p.html”,“display_url”:“news.cgtn.com/news/794d7a4e3\u2026”,“indices”:[62,85]}],“user_mentions”:[],“symbols”:[]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“en”},“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:0,“favorite_count”:0,“entities”:{“hashtags”:[{“text”:“climatechange”,“indices”:[65,79]}],“urls”:[{“url”:“https://t.co/q92jAnobmd”,“expanded_url”:“https://news.cgtn.com/news/794d7a4e33597a6333566d54/share_p.html”,“display_url”:“news.cgtn.com/news/794d7a4e3\u2026”,“indices”:[80,103]}],“user_mentions”:[{“screen_name”:“CGTNOfficial”,“name”:“CGTN”,“id”:1115874631,“id_str”:“1115874631”,“indices”:[3,16]}],“symbols”:[]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“en”,“timestamp_ms”:“1509610082707”}

I get this error:

OSError Traceback (most recent call last) <ipython-input-4-5322def5edd5> in <module>() ----> 1 df = pd.read_json(data, lines=True)

/Users/davidleifer/anaconda/lib/python3.5/site-packages/pandas/io/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines) 214 if exists: 215 with _get_handle(filepath_or_buffer, ‘r’, encoding=encoding) as fh: –> 216 json = fh.read() 217 else: 218 json = filepath_or_buffer

OSError: [Errno 22] Invalid argument

Expected Output

Loading the JSON into a pandas dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0 nose: 1.3.7 pip: 9.0.1 setuptools: 36.2.7 Cython: 0.24 numpy: 1.13.2 scipy: 0.19.1 statsmodels: 0.6.1 xarray: None IPython: 4.2.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.2 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.0 bs4: None html5lib: 0.999999999 httplib2: 0.9.2 apiclient: 1.5.1 sqlalchemy: 1.0.13 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: 2.48.0 pandas_datareader: None

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 2
Comments: 26 (7 by maintainers)

Most upvoted comments

Same bug with pd.to_json from a CSV file. CSV file is only 700mb, I can in fact change it to json the long way, but it gives a slightly different format than I would like. Pandas version is 0.23.4.

mariskaas on Aug 8, 2018

Hit the same bug with a proper jsonlines file of 13GB on macOS and Pandas 0.23.0. Please reopen the issue

fercook on Aug 7, 2018

I have tried the solution above on multiple files and it works ok. I think the problem arises when the file is about 2GB. Haven’t tried with other solutions like dask to see whether that solves the issue. And this has only happened on OSX, on Linux it loads directly without any issues. I’ve only seen this issue opening json, not other formats like csv or dta, even with data over 20GB.

ozak on Oct 13, 2018

On OSX I was able to load the data by doing

max_records = 1e5
df = pd.read_json(file, lines=True, chunksize=max_records)
filtered_data = pd.DataFrame() # Initialize the dataframe
try:
   for df_chunk in df:
       filtered_data = pd.concat([filtered_data, df_chunk])
except ValueError:
       print ('\nSome messages in the file cannot be parsed')

as suggested here. Interestingly, I never got an error message in this case.

ozak on Mar 8, 2018