Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fsspec in place of open #168

Closed
wants to merge 7 commits into from

Conversation

hardingnj
Copy link

Hi, as discussed in #165 here is a PR leveraging fsspec.

The code changes are straightforward. Though I am having trouble getting tests to pass. Any pointers would be appreciated.

The other thing that might be worth considering is the use of Bio.bgzf. I think that reading of gz files can be done directly, as bgz should be fully compatible. It might be a case of using fsspec.open() with the compression set as "auto". Although there might be a use of bgz I am not aware of?

@mdshw5
Copy link
Owner

mdshw5 commented Aug 6, 2020

Thanks for working on this @hardingnj! I'll take a look and see why tests aren't passing. The use of BGZF is meant to allow block-based access to gzip compressed files (https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html). The Bio.bgzf module is in fact using the python gzip module, but with extra logic to allow construction of virtual offsets into these compressed blocks.

mdshw5 and others added 3 commits August 6, 2020 10:14
deprecate bgzf lib, handled via compression = 'infer'

added explicit __array__ method in FastaRecord
@hardingnj
Copy link
Author

Hi @mdshw5 ,

Have added a bit more work. I still have test failures, but these I think are expected failures that might be handled ok in fsspec.

Also added an __array__ method to FastaRecord, which I think is the preferred way of creating numpy arrays.

@mdshw5
Copy link
Owner

mdshw5 commented Aug 6, 2020

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

@@ -514,8 +488,8 @@ def read_fai(self):

def build_index(self):
try:
with self._fasta_opener(self.filename, 'rb') as fastafile:
with open(self.indexname, 'w') as indexfile:
with self._fasta_opener(self.filename, 'r', compression=self._compression) as fastafile:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hardingnj The mode here should be 'rb' if possible so that later code can see the line terminator characters.

@mdshw5
Copy link
Owner

mdshw5 commented Aug 6, 2020

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

@hardingnj
Copy link
Author

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

Ahah- I thought it was something to do with that... though I didn't understand why you expect different line lengths depending on the presence of windows line endings.

@hardingnj
Copy link
Author

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

@@ -1021,6 +999,17 @@ def __getitem__(self, rname):
if isinstance(rname, integer_types):
rname = next(islice(self.records.keys(), rname, None))
try:
#
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft- crept in when trying to debug!

@mdshw5
Copy link
Owner

mdshw5 commented Aug 6, 2020

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

It's necessary to be able to seek past all of the 64k blocks that don't contain the sequence you need, and also seek to the correct position within the block of interest. @peterjc did a fantastic job documenting his module, and the section on file handle offsets is more than worth your time to read. Thanks for your contributions here - I do think the fsspec module makes sense but I do want to make sure to incorporate BGZF support properly.

I've also looked more closely at the test results and my BGZF tests are indeed failing without the Bio.bgzf module. I've got some work in #164 that will make the BGZF support more robust and improve performance as well.

@hardingnj
Copy link
Author

Thanks @mdshw5 . Of course- very happy to make the effort to integrate properly. I'll need to read a bit further into bgzf first though

@manzt
Copy link

manzt commented Feb 8, 2022

I came across this PR yesterday when exploring if some of our API's could be changed to support fsspec file-like objects. It would be desirable to allow a users to pass their own fsspec.OpenFile if possible and only create a new one in __init__ if a string is provided. E.g.

    def __init__(self, filename, ...):
       if isinstance(filename, str):
         self.file = fsspec.open(filename)
       else:
         self.file = filename # fsspec.OpenFile

The advantage of this choice is that users can configure any file object they'd like (auth, caching, etc). It's important to note that fsspec.OpenFile objects don't actually hold an low-level file object but are containers. Low-level file objects are only created when the with context manger is used:

with self.file as f:
  f # now low-level file object is created

This means it's OK to pass around these objects rather than an opener function. In addition, the Bio.bgzf.BgzfReader can be configured explicitly to wrap a file object via bgzf.BgzfReader(fileobj=fileobj) so useage could look something like:

with self.file as f:
  if self._bgzf:
      f = bgzf.BgzfReader(fileobj=f)
  ...

I tried to implement something on my own yesterday, but ran into some issues with how to handle the indexfile.

@mdshw5
Copy link
Owner

mdshw5 commented Feb 8, 2022

@manzt Thanks for the detailed explanation and the example code. I have to admit that I did not understand fsspec but am now a bit more clear about why this is useful. If you want to contribute any code, even if it doesn't fully work I would be glad to collaborate.

@hardingnj
Copy link
Author

Thanks @manzt and @mdshw5 , I really think this is worthwhile- happy to help develop the PR to address the above issues.

@peterjc
Copy link

peterjc commented Feb 8, 2022

Speaking for the Bio/bgzf.py code in Biopython, I'm open to (hopefully backward compatible) changes, but as yet am unfamiliar with fsspec. It may be only changes here in pyfaidx are needed.

@manzt
Copy link

manzt commented Feb 8, 2022

It may be only changes here in pyfaidx are needed.

On first approximation, I don't think there will be any required changes to Bio/bgzf.py since BgzfReader can take a file-like object instead of a path.

@ap-- ap-- mentioned this pull request May 31, 2022
@mdshw5 mdshw5 closed this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants