Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming in-place WACZ creation + CDXJ indexing #673

Merged
merged 76 commits into from
Aug 29, 2024
Merged

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Aug 26, 2024

Fixes #674

This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation:

  • generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again.
  • WACZ contents are streamed to remote upload (or to disk) from existing files on disk
  • CDXJ indices are merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
  • All data in the WACZ is read / written once
  • Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, CDXJ is then merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote.

@ikreymer ikreymer requested a review from tw4l August 26, 2024 21:31
@tw4l tw4l mentioned this pull request Aug 26, 2024
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments/suggestions, continuing testing though so far it's looking pretty good!

src/util/wacz.ts Outdated Show resolved Hide resolved
src/util/wacz.ts Outdated Show resolved Hide resolved
src/util/wacz.ts Outdated Show resolved Hide resolved
src/util/wacz.ts Show resolved Hide resolved
tests/basic_crawl.test.js Outdated Show resolved Hide resolved
ikreymer and others added 5 commits August 28, 2024 10:22
Co-authored-by: Tessa Walsh <[email protected]>
…ker for clarity

remove timezone offset, since running in docker always in UTC by default
(py-wacz did not have a timezone offset either)
ci: add py-wacz for ci use
remove unused test (was never used)
@ikreymer
Copy link
Member Author

Updated to latest released warcio.js / wabac.js, should be good to merge when everything else is resolved.

@ikreymer ikreymer marked this pull request as ready for review August 28, 2024 22:04
to make it clear the directory is not temporary (unlike tmp-dl) and should
be kept after a crawl is done, to allow further addition to the same collection
src/util/wacz.ts Outdated Show resolved Hide resolved
@ikreymer ikreymer merged commit 85a07af into main Aug 29, 2024
4 checks passed
@ikreymer ikreymer deleted the streaming-wacz branch August 29, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use in-place streaming to generate WACZ files
2 participants