spider: Scraped html does not match the url - chrome [with_wait_for_idle_network]
Using 1.82.4, when running the code below, the url doesnโt match the page contents. It seems to mix up urls for different pages when inspecting the contents. So far it works fine on https://rsseau.fr, but it has trouble on the url below. Do I need to use website.subscribe_guard()?
// crawl_and_scrape_urls("https://docs.drift.trade").await;
pub async fn crawl_and_scrape_urls(webpage: &str) {
let mut website: Website = Website::new(webpage)
.with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_caching(cfg!(feature = "cache"))
.with_delay(200)
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
// website.subscribe_guard()
let start = Instant::now();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("found {:?}, size: {}, is_some:{}, status:{:?}, {:?}", page.get_url(), page.get_bytes().map(|b| b.len()).unwrap_or_default(), page.get_bytes().is_some(), page.status_code, start.elapsed());
fs::write(page.get_url_final().replace("/", "__"), page.get_html()).expect("Unable to write file");
}
});
// crawl the site first
website.crawl().await;
// persist links to the next crawl
website.persist_links();
// scrape all discovered links
website.scrape().await;
}
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 17 (5 by maintainers)
Works perfectly, thank you very much @j-mendez ๐ ๐
Sure, I can test in 1h
Thank you very much ๐ฅ
Hi @esemeniuc , thanks for the issue! It looks like
with_wait_for_idle_networkcauses the page to lose focus. Now fixed inv1.82.5.