Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mdshw5
Copy link
Owner

@mdshw5 mdshw5 commented Jun 25, 2020

This is a work-in-progress implementation for #126 and much doesn't work properly.

There is a lot here that doesn't work, but mainly I was trying
to figure out the format of the GZI file and provide methods to
unpack and pack the binary on-disk format. There are also methods
for loading the GZI into an object for use by Faidx.
@@ -603,13 +627,75 @@ def build_index(self):
% self.indexname)
elif isinstance(e, FastaIndexingError):
raise e


def build_gzi(self):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.

if not eof.empty:
raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)


Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_gzi and write_gzi methods should be re-written to use the functions from the end of this file (I think).

chunk = start0 + newlines_before + newlines_inside + seq_len
chunk_seq = self.file.read(chunk).decode()
seq = chunk_seq[start0 + newlines_before:]
bstart = i.offset + newlines_before + start0 # uncompressed offset for the start of requested string
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.

Comment on lines +671 to +698
def build_gzi(self):
""" Build the htslib .gzi index format """
from Bio import bgzf
with open(self.filename, 'rb') as bgzf_file:
for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)):
self.gzi_index[i] = BGZFblock(*values)

def write_gzi(self):
""" Write the on disk format for the htslib .gzi index
https://github.com/samtools/htslib/issues/473"""
with open(self.gzi_indexname, 'wb') as bzi_file:
bzi_file.write(struct.pack('<Q', len(self.gzi_index)))
for block in self.gzi_index.values():
bzi_file.write(block.as_bytes())

def read_gzi(self):
""" Read the on disk format for the htslib .gzi index
https://github.com/samtools/htslib/issues/473"""
from ctypes import c_uint64, sizeof
with open(self.gzi_indexname, 'rb') as bzi_file:
number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0]
for i in range(number_of_blocks):
cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2))
if cstart == '' or ustart == '':
raise IndexError("Unexpected end of .gzi file. ")
else:
self.gzi_index[i] = BGZFblock(cstart, None, ustart, None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a duplicate code block?

@ThomVett
Copy link

ThomVett commented Feb 8, 2023

Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks!

@dennishendriksen
Copy link

@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants