spider: Scraped html does not match the url - chrome [with_wait_for_idle_network]

Using 1.82.4, when running the code below, the url doesn’t match the page contents. It seems to mix up urls for different pages when inspecting the contents. So far it works fine on https://rsseau.fr, but it has trouble on the url below. Do I need to use website.subscribe_guard()?

//    crawl_and_scrape_urls("https://docs.drift.trade").await;
pub async fn crawl_and_scrape_urls(webpage: &str) {
    let mut website: Website = Website::new(webpage)
        .with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_caching(cfg!(feature = "cache"))
        .with_delay(200)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    // website.subscribe_guard()
    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("found {:?}, size: {}, is_some:{}, status:{:?}, {:?}", page.get_url(), page.get_bytes().map(|b| b.len()).unwrap_or_default(), page.get_bytes().is_some(), page.status_code, start.elapsed());
            fs::write(page.get_url_final().replace("/", "__"), page.get_html()).expect("Unable to write file");
        }
    });

    // crawl the site first
    website.crawl().await;
    // persist links to the next crawl
    website.persist_links();
    // scrape all discovered links
    website.scrape().await;
}

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 17 (5 by maintainers)

Most upvoted comments

Works perfectly, thank you very much @j-mendez 😃 😃

esemeniuc on Feb 24, 2024

Sure, I can test in 1h

esemeniuc on Feb 21, 2024

Thank you very much 🥇

esemeniuc on Feb 20, 2024

Hi @esemeniuc , thanks for the issue! It looks like with_wait_for_idle_network causes the page to lose focus. Now fixed in v1.82.5.

j-mendez on Feb 20, 2024