Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bnb/dh refactor #220

Open
wants to merge 362 commits into
base: main
Choose a base branch
from
Open

Bnb/dh refactor #220

wants to merge 362 commits into from

Conversation

bnb32
Copy link
Collaborator

@bnb32 bnb32 commented Jun 23, 2024

Ok, here we go...

sup3r/preprocessing was previously just data handlers and batch handlers, essentially. Now we have Loaders, Extracters, Derivers, Cachers which are composed in sup3r.preprocessing.data_handlers.factory to build objects similar to the old DataHandlers. These do basically everything the old handlers used to do, except for training / batching related routines like sampling, normalization, etc. Loaders just load netcdf / h5 data into a xr.Dataset - like container. Extracters extract spatiotemporal regions of data. Derivers derive new features from raw feature data. Cachers, well, they cache data to either h5 or netcdf depending on the extension of the output file provided.

In sup3r/preprocessing we additionally have Samplers and BatchQueues. These are composed in sup3r.preprocessing.batch_handlers.factory to build objects similar to the old BatchHandlers. These do basically everything that the old batch handlers used to do, with some exceptions. The most notable exception is probably that they don't split data into training and validation sets. BatchHandler objects will take "collections" of data handler like objects (these can be DataHandlers, Extracters, Derivers, etc) for both training and validation and separate batch queues will be used for each. Samplers simply contain a xr.Dataset - like object and sample that data as an iterator. BatchQueue objects interface with samplers to keep a queue full of batches / samples while models are training.

All these smaller objects like loaders, extracters, derivers, samplers are built on top of xr.Dataset - like objects (sup3r.preprocessing.accessor.Sup3rX and sup3r.preprocessing.base.Sup3rDataset) which serve as the familiar .data attribute for data and batch handlers. Sup3rDataset is wrapped around Sup3rX to provide an interface for "dual" dataset objects contained by dual handlers and acts exactly like Sup3rX when datasets are not dual. Sup3rX is an xr.Dataset "accessor" class, which is the recommended way to extend xr.Datasets (as opposed to subclassing). These Sup3rX / Sup3rDataset objects act similar to xr.Datasets but with extended functionality. The tests in tests/data_wrappers/ show how to interact with these objects.

Since the fundamental dataset objects are now xr.Dataset - like, they can use dask arrays to store data. This means we don't need to load data into memory until we need the result of a computation. ForwardPassStrategy and ForwardPass have been updated accordingly, since we can lazy load the full input dataset and then index the data handler .data attribute to select generator input chunks, all before loading into memory. BatchHandler objects have a mode argument which can be set to either lazy (load batches into memory only when they are sent out for training) or eager (load .data into memory upon handler initialization).

Tests have been added for all new preprocessing modules and lots of documentation / notes have been added throughout. Tests should hopefully provide good examples of use patterns for these new objects.

@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 10 times, most recently from ebb154c to bfe2f9f Compare June 27, 2024 17:34
@bnb32 bnb32 marked this pull request as ready for review June 27, 2024 18:17
@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 53d1c66 to bbc4af1 Compare July 1, 2024 15:58
@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 59b9817 to a546b27 Compare July 19, 2024 20:07
… with simple call to handler factory. h5 cc handler tests updated and passing
…check for factory classes. solar model training tests all updated and passing
…erent being used anymore after changing spatial agg to use mean over overlapping gids. Dont need exo_resolution input resolution input anymore either.
grantbuster and others added 30 commits August 27, 2024 15:00
…reduced mem use since number of spatial chunks tends to be much lower than number of time chunks.
…ata time index can be shifted to start at the beginning of the day instead of at noon. GCM data frequently stamps daily data at noon instead of the beginning of the day. This caused an issue with the solar module thinking that given gan data had 48 time steps, since the time index had two unique day values, even though there were only 24 time steps from noon to noon on each day.
…ed to start at the zeroth hour, if it has not been shifted already.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants