Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch upload & other performance enhancements #674

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

aufdenkampe
Copy link
Member

We'll consider this step 2 of 2(ish) in addressing the larger issue (and opportunity) raised by @tpwrules with:

This PR pulls the code from https://github.com/tpwrules/ODM2DataSharingPortal/tree/wip/batch that is successfully running a private instance of Monitor My Watershed for the Univ. of Memphis.

We should probably merge this PR only after merging this smaller PR that contains the first 4 commits:

@ptomasula, @tpwrules, and @SRGDamia1, as a followup to our call today, I decided to create this PR to the new tpw_batch feature branch that I just created from develop.

This gives @ptomasula the ability to resolve the merge conflicts, complete the merge, then test it using CDK, without interfering with our current branches.

If merging and testing of the tpw_batch goes perfectly, we can then merge into develop.

If there are challenges with merging or testing, then perhaps we cherry-pick some of the easier commits first.

A few places make Django ORM queries at the module level and thus run
during import of the portal code. This implies that a working connection
to a properly set-up database is required to e.g. use the Django management
command. This greatly complicates packaging, initial setup, and deployment.

This problem is fixed by simple refactorings and caches.
Creation of the SQLAlchemy model caches occurs during import of the
portal code. This implies that a working connection to a properly set-up
database is required to e.g. use the Django management command. This
greatly complicates packaging, initial setup, and deployment.

This problem is fixed by ignoring errors creating the cache and reminding
the user to check the database.
The `iterkeys()` method on a dict has been replaced by the equivalent
`keys()` method in Python 3. Update to use the correct method.
Reject duplicate names in the controlled vocabularies list. It is unknown
why such duplicates are being served, but they are as of 2022-12-27.
352KB file (one column): 148s -> 1.6s
subrepo:
  subdir:   "ODM2DataSharingPortal"
  merged:   "33b2ff2f"
upstream:
  origin:   "https://github.com/ODM2/ODM2DataSharingPortal"
  branch:   "main"
  commit:   "e3dd03c2"
git-subrepo:
  version:  "0.4.1"
  origin:   "???"
  commit:   "???"
@tpwrules
Copy link

Thanks again for the discussion and allocating time for this.

Please note that there are several commits in here that are for my working process and should be dropped before merging. This branch was not intended to be merged directly. If I can get the site working locally with the Cognito integration then I can rebase but otherwise it is probably best to cherry-pick or have @ptomasula implement the changes in spirit rather than merge directly.

@aufdenkampe aufdenkampe deleted the branch ODM2:develop May 7, 2024 15:31
@aufdenkampe aufdenkampe closed this May 7, 2024
@aufdenkampe aufdenkampe reopened this May 7, 2024
@aufdenkampe aufdenkampe changed the base branch from tpw_batch to develop May 7, 2024 15:40
@ptomasula
Copy link
Member

ptomasula commented May 7, 2024

@aufdenkampe , @ScottEnsign @SRGDamia1

As @tpwrules noted, the tpwrules:wip/batch branch (current set for this PR) is a work in progress branch and not intended to merge in directly. I followed Thomas's recommendation and cherry-picked out select commits (see table below) into a new batch_upload. I'll be updating this PR to use this new batch_upload branch and work on testing and closing things out on there.

Commit Summary Commit Id Cherry-pick Notes
ODM2DSP: add nix specific django settings 919bb48 N These look to be configuration edits
avoid accessing ORM during import 267c3a3 N Already resolved in PR #637
turn sql model cache creation failure from error to warning b16224b Y N Already resolved in PR #637
commands/update_controlled_vocabularies: fix py3 compatibility 4b14718 Y N Already resolved in PR #637
commands/update_controlled_vocabularies: reject duplicates 1fe646a Y N Already resolved in PR #637
dataloaderservices: batch insert data from uploaded CSVs 0f80fe5 Y
remove google analytics 832cc38 Y Should address issue #700
git subrepo pull ODM2DataSharingPortal ed59e23 N Should already be included in cherrypick branch because this is a pull from upstream (this repo).
correct schema search path to use public for django stuff by default c8300c9 N setting.nix. not cherry picked. Can make this change to base.py if needed.
set django's timezone to UTC by default-ish so that data reception wo… cdfeba9 Y N excluding from cherry-pick for now and will further consider if necessary
use upsert instead of mysterious trigger function to update latest se… 8f950bc Y
replace thread pool with bundling everything in a single transaction aecd2c7 Y
speed up dataloader table sync by ditching pandas 6dc8eea Y
ditch pandas from uuid lookup 4365f6a Y
avoid extra step to retrieve site sensor ID 484ca0f Y
handle all data queries as batches 434ab81 Y
slightly optimize authentication e90b108 Y
add batch upload support a1e3eaf Y
optimize batch insertion 87a6c53 Y
fix file upload 1002c8d Y
reduce memory usage and fix issues with empty data during file upload 5a24db1 Y
double file upload speed by copying into temporary table fca8b31 Y

@tpwrules
Copy link

tpwrules commented May 7, 2024

Excited to see the progress!

Some notes, though I haven't looked at all this in a long time:

  • the first three cherry-picked commits are a part of the other PR and so have 0 changes in this one and should be dropped
  • the commit that changes the default time zone to UTC may not apply to your set up, please consider carefully
  • the commit which switches to an upsert might require changes in the database
  • feel free and I strongly suggest to expand on some of these commit messages

@ptomasula
Copy link
Member

Thanks @tpwrules! Those are helpful insights. I have updated the cherry-pick list based on your input.

@aufdenkampe aufdenkampe changed the base branch from develop to batch_upload May 8, 2024 16:50
@aufdenkampe aufdenkampe changed the base branch from batch_upload to develop May 8, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants