spider: Scraped html does not match the url - chrome [with_wait_for_idle_network]

Using 1.82.4, when running the code below, the url doesnโ€™t match the page contents. It seems to mix up urls for different pages when inspecting the contents. So far it works fine on https://rsseau.fr, but it has trouble on the url below. Do I need to use website.subscribe_guard()?

//    crawl_and_scrape_urls("https://docs.drift.trade").await;
pub async fn crawl_and_scrape_urls(webpage: &str) {
    let mut website: Website = Website::new(webpage)
        .with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_caching(cfg!(feature = "cache"))
        .with_delay(200)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    // website.subscribe_guard()
    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("found {:?}, size: {}, is_some:{}, status:{:?}, {:?}", page.get_url(), page.get_bytes().map(|b| b.len()).unwrap_or_default(), page.get_bytes().is_some(), page.status_code, start.elapsed());
            fs::write(page.get_url_final().replace("/", "__"), page.get_html()).expect("Unable to write file");
        }
    });

    // crawl the site first
    website.crawl().await;
    // persist links to the next crawl
    website.persist_links();
    // scrape all discovered links
    website.scrape().await;
}

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 17 (5 by maintainers)

Most upvoted comments

Works perfectly, thank you very much @j-mendez ๐Ÿ˜ƒ ๐Ÿ˜ƒ

Sure, I can test in 1h

Thank you very much ๐Ÿฅ‡

Hi @esemeniuc , thanks for the issue! It looks like with_wait_for_idle_network causes the page to lose focus. Now fixed in v1.82.5.