google-play-scraper: [BUG] reviews_all doesn't download all reviews of an app with large amount of reviews

Library version 1.2.6

Describe the bug I cannot download all the reviews of an app with large amount of reviews. The number of downloaded reviews is always a multiple of 199.

Code

result = reviews_all("com.google.android.apps.fitness")
print(len(result))
# get 995

Expected behavior Expect to download all the reviews with reviews_all, which should be at least 20k

Additional context No

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 22

Most upvoted comments

This mod did not work for me either. I tried a different approach that worked for me:

In reviews.py:

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

Here’s my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

and in reviews.py I added the mod as my original comment.

This is probably a dupe of #208.

The error seems to be the play service intermittently returning an error inside a 200 success code, which then fails to parse as the json the library expects. It seems to contain this ....store.error.PlayDataError message.

)]}'

[["wrb.fr","UsvDTd",null,null,null,[5,null,[["type.googleapis.com/wireless.android.finsky.boq.web.data.store.error.PlayDataError",[1]]]],"generic"],["di",45],["af.httprm",45,"-6355766929392607683",2]]

The error seems to happen frequently but not reliably. Scraping in chunks of 200 reviews, basically every request has a decent chance of crashing, resulting in usually 200-1000 total reviews scraped before it craps out.

Currently, the library swallows this exception silently and quits. Handling this error lets the scraping continue as normal.

We monkey-patched around it like this and seem to have gotten back to workable scraping:

import google_play_scraper
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # MOD error handling
    if "error.PlayDataError" in dom:
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
    # ENDMOD

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]


google_play_scraper.reviews._fetch_review_items = _fetch_review_items

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

Me too, and I found that the output number is always a multiple of 199. It seems that Google Play randomly block the retrieval of next page of reviews.

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

I have tried with your code, and it worked for me running on colab

Thanks @adilosa and @paulolacombe , your posts are worked for me 😃

Hey @Shivam-170103, you need to use the code lines that @adilosa provided to replace the corresponding ones in the reviews.py function file in your environment. Let me know if that helps as I am not that familiar with Google Colab.

Still not able to get more than a few hundred reviews.