Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe take a walk? #122

Open
MatthewRalston opened this issue Feb 20, 2024 · 4 comments
Open

Maybe take a walk? #122

MatthewRalston opened this issue Feb 20, 2024 · 4 comments
Assignees
Labels
bug Something isn't working dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested wontfix This will not be worked on

Comments

@MatthewRalston
Copy link
Owner

MatthewRalston commented Feb 20, 2024

Here, the kmerdb project will be pivoting after the 0.7.6 release to use a modified .kdb format and no backwards compatibility is explicitly planned.

The goal of the refactor/pivot is to introduce networkx and/or cugraph to the possible toolkits used to facilitate the implementation of an assembly algorithm AND/OR a .kdbg format specification for exact .fasta assembly or approximate 'Eulerian' walk (.fastq) through the rows specified in the "Assembly algorithm prototype" Github milestone.

@MatthewRalston MatthewRalston self-assigned this Feb 20, 2024
@MatthewRalston MatthewRalston added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Feb 20, 2024
@MatthewRalston
Copy link
Owner Author

MatthewRalston commented Mar 5, 2024

Today, progress was made on generating the graph for the Eulerian walk. Metadata format/schema largely remains the same, and so far the main schema consists of three col. N1 n2 and w. The w is specified for the Eulerian path, however that might be implemented.

  • Complete the metadata writing subroutine
  • discourage the direct dumping of a jank structure >:)
  • reconstruct the reader class
  • don't commit atomically until the writing and refactoring is completed

Started reading the wiki article on plessy v Ferguson. I know Americans on the left love to shout from the rooftops that kumbaya for all is here. It's '24 after all. And yet, 50% of the populace was not dissuaded by the use of flagrantly racist language and rhetoric by the ex President Donnie grump (voldy 2.0)

And what's more startling I guess is the disparities between celebrities on television, and the reality of many as and other minority and also white persons in different housing districts than the elite. whoa but just on the deep hip hop dose on the st life thing is making my head ache. if you know you know. people get ignorant about waste and ignorant about love. tytgs.

I was fired for missing one ducking email and bc I wrestle about forgiving you for not defending me on that issue. You're privileged and I got dumped. Steve and Deb, you're no different. Doesn't matter what you thought now or then, I stuck up with your group when you needed an extra head. I got the work done.

Go to hell.

That's what I think of the goddak establishment.

Here I am door dashing and begging my parents to cut my interest rates so I can afford to eat. You pos won't ever understand that.and I hope you never have to. But miss me with that kumbaya ish rn.

on the upside, fuggin hate my brand but love the game. different directions both re self study, metrics, profiling, and graphics. still need a more concrete problem to make the feature on the algorithm biorxiv. that's what's got me stuck in loops re money.

that would be the real assembly algo and the future goal, but we might only have time for a networkx cpu strategy and then a cugraph assembler could leverage the indexing structure .kdb.gi to produce tuples rapidly to python to transfer to the gpu for a cugraph graph traversal after trimming. the networkx assembler would leverage the same thing. this is essentialy milestone 2

because id rather take the right whip out into the country and gather field samples, than to get stuck in a wfh situation churning my money on finding better digital samples when i'd rather do something combining field, wet bench, and then maybe some fastq exploration with maybe a model of the graph and the best case scenarios re: known genome (eco, bsub, cdiff, etc) full assembly (n50, ng50, contig count, orf count/gene count, pfam stats, other orthology/paralogy metrics, contig diversity, read diversity), and/or approximate Eulerian walk (after edge and node trimming strategies, followed by like.... idk yet.)

@MatthewRalston
Copy link
Owner Author

MatthewRalston commented Mar 5, 2024

Is this where I migrate plans from issue to milestone and/or documentation by modifying obj in comments and then official planning checkmarks?

NEXT:

First block could be '\n' deltimited rows. count vec (n1) and index (n2==n1) [n2 is the 4**k dimensional 1-tuple/vector)
Second block could be delimited with uWu. edge 2-tuple vector/array (n1) and weight (n2) np.array [n1 is the number of possible edges (WOOF), n2 driver variable may be --sparse or (default: --)inclusive. Inclusive makes the full matrix (don't want) in flattened form and then compressed. Sparse storage may make the n2 more reasonable, but the adjacency structure in unstructured form may make indexing worse. If the algorithm for accessing index rapidly is written in Python (or Cython), then accessing the index table(.kdbgi) should be trivial. If this feature is developed more, an in-memory solver may be next. If the --sparse option is developed.
Third block also closed with uwu.

  • [ --sparse: Data is in the "adjacency list structure" (collapsed or sparse, and, preferred) ]

Just remove the edges where weight=0

  • [ DEFAULT: Data is the adjacency matrix (full rank of the n1xn1 matrix)

Human readability

  • [ Adapt the index function for .kdb.gi or .kdb.i file. ] sike. to backlog if yee dare

  • [ Describe the graph edge list format in the readme, quickstart and website in the github-pages branch ]

@MatthewRalston
Copy link
Owner Author

#122 #123 #124 #125 #126

Neighbor construction working out well. A dictionary of dictionaries is being used to focus on local "neighbor" space only: i.e. the 8 adjacent kmers to any id.

This has been spun off in a utility function in kmer.py.

Adding some more documentation to the kmerdb.graph module as it prototypes most of what is required to write and read (some validations) .kdbg files.

Edge list and data structure still in planning.

@MatthewRalston
Copy link
Owner Author

Need to revive this stale issue. Where it left off was I was looking at Networkx and visualizers. I got sidetracked on dot format, and PyDot, and I'd like to add that support.

  • - PyDot support OR direct interop
  • - NetworkX
  • - Cython
  • - cugraph routines

Cugraph may be needed in the overall assembly algorithm, to simplify or accelerate traversals with depth-first-search, and associate inter-node metrics, scores, and optimizer.

Of course, in order to implement or refine any method of this sort, I need first to be able to check structure and progress made from naive approaches, during the score formulation, weighting, and refinement stage.

  • - metrics
  • - scores
  • - optimizer
  • - walkXpath files, intermediary formats.
  • - reimplementation of graph algorithm from scratch in Cython

@MatthewRalston MatthewRalston added bug Something isn't working good first issue Good for newcomers question Further information is requested wontfix This will not be worked on dependencies Pull requests that update a dependency file labels Jul 31, 2024
@MatthewRalston MatthewRalston changed the title Graph algorithms Maybe take a walk? Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested wontfix This will not be worked on
Projects
Status: Ready
Development

No branches or pull requests

1 participant