Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Missing or timed out dynamic request to resource #664

Open
wsdookadr opened this issue Aug 3, 2024 · 1 comment
Open

[question] Missing or timed out dynamic request to resource #664

wsdookadr opened this issue Aug 3, 2024 · 1 comment
Labels
question Further information is requested

Comments

@wsdookadr
Copy link

wsdookadr commented Aug 3, 2024

If I crawl a website with mostly static resources, I'm noticing there can be missing resources in the resulting WARC. The reason for that is either broken links or timeouts.

I have written tools to grab all WARC-Target-URI and also go through all the Content-Type: text/html WARC records, pull all the urls found in <img src> or <link href> or <a href> or <script src> and compare what's being referenced with what's available in the WARC and bring in the unavailable ones (using wget --warc-file). That works fine.

Is it possible to do the same but for dynamic pages where the requests are made by js. Does browsertrix record all XHR requests attempted(not necessarily completed) somewhere?

Thanks!

@ikreymer
Copy link
Member

Yes, the crawler should be grabbing all of these resources already, via dynamic behaviors.
The autofetch behavior should pull all img, img srcset, etc... (see: https://github.com/webrecorder/browsertrix-behaviors/blob/main/src/autofetcher.ts#L12) and is enabled by default, so you shouldn't need to do anything with wget

Do you have an example where this is failing? You can increase the timeout for behaviors with --behaviorTimeout if the crawler is timing out.

The urn:pageinfo:<url> record should record all resources that were requested on the page, including if they failed.

Do you have an example where this is happening? perhaps there's some other issue

@ikreymer ikreymer added the question Further information is requested label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Triage
Development

No branches or pull requests

2 participants