Expanded row metadata for graph format #130

MatthewRalston · 2024-03-29T01:44:38Z

Key Question

What is needed for working data structure initialization? Why isn't it working?

The node and edge list and prioritization or
sort strategy for edge representation, weights, > multigraph and combination representation,
orientation of edges, dual strandedness
and .kdbg row metadata (non-int, but
Boolean) (i.e. fast lookup) row metadata
fields is not yet finalized.

[[ walk file ]]

Walks files are just like path files, and
primarily contain an ordering of edges. All
walks are paths, but a walk may have a
forward and reverse direction, and so all
walks and their originating context (aka
a .kdbg file) must either be minimal (all edges
and a positioning id (i) only - a "retrospective "
bool, a "solutional" bool (if the walk is said to
be solutional from an assembly process
associated from .kdbg version 1.0 .1 or
greater, a version number associated with the
kmerdb release, the sha256 of the git release
(on each edge yes), or expanded
(retrospective, prospective, previous forking nodes, previous walka
investigate and their node IDs)

schema concepts

for format versions of course...

Should be self referential, contain nodes, edges, and walks and/or paths. Metadata includes relevant references to schema versioning, and specific file references for interpretation.

[ minimal walks ]

A minimal walk file must also include all
edges of the original context (a.k.a. all edges
observed from the dataset(s) in the .kdbg
header), marked with a retrospective bool,
along with one or more copies of the same
edge prospective bool = True when
representing a specific walk (not a minimal
path, a single linear representation of edges, a
sort order with no presumed provided source reference)

solutional path

a walk, along with all previous walks (in
chronological aka integer id, by reference,
along with the sha256sum of the git release
that produced the walk, the metadata, etc...

[[ solutional path file ]]

Header metadata will have the source and the parameters in the header. And a walk id - (a sha256 of the walk) for an associated walk file, and walk name (given at "runtime" via CLI). May be 0 to represent unspecific or unqualified walk (origin unclear)

Related issues

#126 #122 #125 #102 #124

sidenote

The neighbor structure 🌪️is manifested by particular kmer IDs🌬️, which may be accessed from kmer arrays loaded alongside the edge list during a path producing process.

A working pipeline would include all components of the workflow onto the next step but all commands are partial. Schemas' in planning stage for future release

The text was updated successfully, but these errors were encountered:

MatthewRalston · 2024-03-29T02:15:09Z

Key Question

What is needed for working data structure initialization? Why isn't it working?

Node files

No comment

Edge files

Not applicable

types of walks

Walk files
Path files
Tree files
Contains:

walks from/to "central/incidental" nodes

Forward walk
Reverse walk

[[ node schema (in progress) ]]

node_id
pos_walk (id in walk file or path file, - pos_path
next_edge id (aka edge 2-tuple), next_path id

[[ Edge schema ]] ---------

node1_id
node2_id, pos_path, pos_walk,
prospective bool (aka most edges in a walk/path/climb should be retrospective in the destination context...)
preceding walk id
next walk id,

Forward schema

Reverse schema

[[ Walk schema (in progress) ]]

path schema
Walk schema
Solution schema

[[ The walk file ]]

Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 0 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forks investigate and their node IDs)

minimal walks

A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge prospective bool = True when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed origin id)

Related issues

Issues #126 #122 #125 #102 #124

@MatthewRalston thinks the path forward towards a graph format is in creating additional structural definitions. If i think through the relationships preserved among different incomplete and completely self-referential formats, they require associated metadata schemas, and the utility function of taking a table or metadata schematic input and generating a consistently hashable representation (the metadata header format, it's parser, and the table parsing functionality, as in these modules)...

```
   `kmerdb.graph`
```
```
   `kmerdb.fileutil`
```
```
   `kmerdb.parse`
```

and references..

i.e. "the format(s)"

And associated schemas...

This utility function wouldn't be part of the algorithm per-se, but it would be incident to that which is produced by virtue of the file-metadata-log (and this version-dataset pairing) thingawhosit. That's mostly contained in our __init__, and associated module files for format access and associated value provided from features and solutions in future versions.

and tying that to a git sha256 hash, should be preserved with all nodes of a given wall or path

MatthewRalston · 2024-04-10T21:15:07Z

This issue has been tabled for the time being in favor of a cleaner UI and experience on the user end.

1. Interface overhaul (issue #132)

I want the user to understand the output and even ASCII styling (in absence of a rich.py dependency, which isn't needed)

output_dir

I want the logfile and output directories (required to collect .kdb, .kdbg, .stats.txt, output.log etc)

usage, steps, and features

I want the expanded help and usage statements, including the 'features' and 'steps' developed further.

minimal STDOUT

And finally, I want the STDOUT to be extremely minimal and/or non-existent, in the profile and graph commands. OR the formatting should display the resulting stats clearly apart from the header.

README "2.0" (issue #137)

Finally, readme overhaul

MatthewRalston · 2024-07-31T01:55:02Z

Okay, I've been working on some other features and needed documentation/UI overhauls. Delays pushed deadline back a few months, reprioritizing the assembly algorithm and possible numba/Python etc implementations of D2 metrics, more odds-ratio stuff on the horizon, more literature review and beginning to write a report and lit review on applications of kmer count matrices and distances to metagenomics and microbiomes.

MatthewRalston added this to the V0.7 stable? milestone Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded row metadata for graph format #130

Expanded row metadata for graph format #130

MatthewRalston commented Mar 29, 2024 •

edited

Loading

Related issues

MatthewRalston commented Mar 29, 2024 •

edited

Loading

minimal walks

MatthewRalston commented Apr 10, 2024 •

edited

Loading

MatthewRalston commented Jul 31, 2024

Expanded row metadata for graph format #130

Expanded row metadata for graph format #130

Comments

MatthewRalston commented Mar 29, 2024 • edited Loading

Key Question

[[ walk file ]]

schema concepts

[ minimal walks ]

solutional path

[[ solutional path file ]]

Related issues

sidenote

MatthewRalston commented Mar 29, 2024 • edited Loading

Key Question

Node files

Edge files

types of walks

[[ node schema (in progress) ]]

[[ Edge schema ]] ---------

[[ Walk schema (in progress) ]]

[[ The walk file ]]

minimal walks

Related issues

MatthewRalston commented Apr 10, 2024 • edited Loading

1. Interface overhaul (issue #132)

output_dir

usage, steps, and features

minimal STDOUT

README "2.0" (issue #137)

MatthewRalston commented Jul 31, 2024

MatthewRalston commented Mar 29, 2024 •

edited

Loading

MatthewRalston commented Mar 29, 2024 •

edited

Loading

MatthewRalston commented Apr 10, 2024 •

edited

Loading