Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WACZ support #770
base: master
Are you sure you want to change the base?
Add WACZ support #770
Changes from 9 commits
779978a
e82a4f7
133a9e4
9436999
a422b47
25e91ad
f9447d2
2566b7c
e4b9998
7e2589d
038d2bb
b6e789a
aae8b39
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big issue, but I think the temporary folders created by the
mkdtemp()
call will continue to exists (until cleaned up by the OS) because only the files inside them are deleted, not the folders themselves.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say that the creator is responsible for the deletion, so I think we should handle this. Given each WARC gets a new temp directory, it might be better to just retain the copy of this directory path and delete it along with its contents instead of deleting the WARC then the directory, which would require tracking the directory path, too. Which approach would you rather be implemented, @ibnesayeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really! It is possible to get the path of the directory if the path of a file is known that it contains.
That said, I would perhaps preferred not holding onto the list of WARC files, instead, operate on each WARC as we discover them, whether those are regular WARC files or those extracted from WACz files. I would deal with one file at a time and loop over for the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ibnesayeed This seems like it requires a revamp outside of the scope of this GH issue/PR. I agree that dealing with one WARC at a time would likely be more computationally optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it would require change in the workflow. When done, it would be more space efficient as not all the WARCs need to be extracted from WACZ files upfront, duplicating them on the disk, before processing them.
It is okay to leave it as things are right now and get back to this when we have a WARC record iterator for the WACZ files, when most of these changes will be rendered useless.