facebook-scraper: Scrape does not get full post when there is 2 layers of

@neon-ninja When a post have long text or post_text with ‘double layer’ of ‘See more’ that need to be clicked, extractor only manage to get the first layer. What i had test: facebook-scraper==0.2.42 from git-master

  1. Using 2 different accounts (with 2 different cookies) in chrome and also firefox. I used EditThisCookie in chrome and Cookie Quick Manager in firefox
  2. Using both windows CLI and also from .py
  3. WIth --encoding utf-8 and without encoding.

For cli i used this code : facebook-scraper --filename najibFullPost1.csv --pages 5 najibrazak -c C:\\Users\\insane\\Desktop\\NajibRazak\\cookies.json -v --encoding utf-8

the output for 1 layer of See more is fine. But if there is two layers it will only capture the first layer :

1 Layer output

post click

post link

2 layer output

post click

post link

I have read about others that been facing this issues but none seems to solve this problem.

by using

>>> from facebook_scraper import get_posts, enable_logging
>>> import logging
>>> import pprint
>>> enable_logging(logging.DEBUG)
>>> for post in get_posts(post_urls=[10157944979490952]):
...     print(post['text'])
...

it will return correct post value, but not if in cli with username.

side note : i have a problem that the output file is printing empty space between each record (row). I fixed it by adding newline=''

with open(filename, 'w', encoding=encoding, newline='') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(list_of_posts)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23

Most upvoted comments

Ok, I think I see the problem. For me, the HTML is

<p> 1. i-Sinar dan i-Lestari juga… <a href="/story.php?story_fbid=10157944979490952&amp;id=157851205951&amp;_ft_=mf_story_key.10157944979490952%3Atop_level_post_id.10157944979490952%3Atl_objid.10157944979490952%3Acontent_owner_id_new.157851205951%3Athrowback_story_fbid.10157944979490952%3Apage_id.157851205951%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX-KtQoVMZIEDTeL&amp;__tn__=%2C%3B" data-gt="{&quot;tn&quot;:&quot;,;&quot;}">More</a></p>

but for you, it’s

<p>
       1. i-Sinar dan i-Lestari juga…
       <a data-gt="{&quot;tn&quot;:&quot;,;&quot;}" href="/story.php?story_fbid=10157944979490952&amp;id=157851205951&amp;_ft_=mf_story_key.10157944979490952%3Atop_level_post_id.10157944979490952%3Atl_objid.10157944979490952%3Acontent_owner_id_new.157851205951%3Athrowback_story_fbid.10157944979490952%3Apage_id.157851205951%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX-KtQoVMZIEDTeL&amp;__tn__=%2C%3B">
        More
       </a>
      </p>

which (?<=…\s)<a href="([^"]+) does not match, as data-gt is preceding the href. This regex can be simplified - try this - https://github.com/kevinzg/facebook-scraper/commit/e7b2a50cb39ecccd66d43e0a8ff66b65f9e75311

Git master

Almost, I used print(len(post["text"])) instead of print(post["text"])

Actually the code block I posted doesn’t explicitly scrape 10157944979490952, it iterates through posts on najibrazak until it hits 10157944979490952 and then it stops. The reason I was asking for log messages, is that in order to see the full text of a post, the scraper needs to “click” on it. It doesn’t matter if there’s one layer or two, as soon as the scraper sees it should fire off a request to https://m.facebook.com/10157944979490952. Logs for that should look like this:

Looking for next page URL
Requesting page from: https://m.facebook.com/page_content_list_view/more/?page_id=157851205951&start_cursor={"timeline_cursor":"AQHR5B0gbQ5aj-f59fQIZHuc_M-p9Gnb3aQ7u5V1ji7WFlerTc3HpByNZHy53XBAIU6tHHCl_06gbt-5bR6rjicaYU00R_v_Xj119lon5gamAfBqNSHXOigII7XO2FUNO5Pw","timeline_section_cursor":null,"has_next_page":true}&num_to_fetch=4&surface_type=posts_tab
Parsing page response
Got 4 raw posts from page
Extracting posts from page 7
Fetching 10157944979490952
Fetching https://m.facebook.com/najibrazak/photos/a.294787430951/10157944978690952/?type=3&source=48&__tn__=EH-R
[10157944979490952] Extract method extract_link didn't return anything
[10157944979490952] Extract method extract_video didn't return anything
[10157944979490952] Extract method extract_video_thumbnail didn't return anything
[10157944979490952] Extract method extract_video_id didn't return anything
[10157944979490952] Extract method extract_video_meta didn't return anything
[10157944979490952] Extract method extract_factcheck didn't return anything
[10157944979490952] Extract method extract_share_information didn't return anything
[10157944979490952] Extract method extract_listing didn't return anything
2708

Does your debug log output Fetching 10157944979490952 ?

This is working fine for me, the code:

from facebook_scraper import *
import logging
enable_logging(logging.DEBUG)

for post in get_posts("najibrazak", cookies="cookies.txt"):
    if post.get("post_id") == "10157944979490952":
        print(len(post["text"]))
        break

outputs 2708. Do you get something different? Do you get any log messages that might indicate why?

I’ve committed your newline fix as https://github.com/kevinzg/facebook-scraper/commit/fb15eb5b745d09bbcfcbd45bf1425e8c349ab03c