Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docstore/awsdynamodb panics on iter.Next() #3405

Open
thisisaaronland opened this issue Mar 18, 2024 · 16 comments · Fixed by #3406
Open

docstore/awsdynamodb panics on iter.Next() #3405

thisisaaronland opened this issue Mar 18, 2024 · 16 comments · Fixed by #3406

Comments

@thisisaaronland
Copy link

Describe the bug

Calling iter.Next on a docstore/awsdynamodb derived Query instance panics with an "index out of range" error along the lines of:

                panic: runtime error: index out of range [0] with length 0

                goroutine 98 [running]:
                gocloud.dev/docstore/awsdynamodb.(*documentIterator).Next(0x14000056050, {0x101ba36b0?, 0x10258bb40?}, {{0x101b386e0, 0x1400019bea0}, 0x0, {0x101b454e0, 0x1400019bea0, 0x199},\
 {0x140004fc808, ...}})
                        /usr/local/sfomuseum/go-activitypub/vendor/gocloud.dev/docstore/awsdynamodb/query.go:492 +0x260

The Query itself has two "Where" statements bracketing a date range:

https://github.com/sfomuseum/go-activitypub/blob/main/accounts_database_docstore.go#L45-L69

Curiously, this problem only seems to affect one table (accounts). The same code (essentially) run against a different table (deliveries) does not panic:

https://github.com/sfomuseum/go-activitypub/blob/main/deliveries_database_docstore.go#L46-L58

To Reproduce

The code in question is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/counts-for-date/main.go

Specifically:

https://github.com/sfomuseum/go-activitypub/blob/main/stats/counts.go#L44

Which is run like this:

go run cmd/counts-for-date/main.go \
   -accounts-database-uri 'awsdynamodb://accounts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -blocks-database-uri 'awsdynamodb://blocks?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -boosts-database-uri 'awsdynamodb://boosts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -deliveries-database-uri 'awsdynamodb://deliveries?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -followers-database-uri 'awsdynamodb://followers?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -following-database-uri 'awsdynamodb://following?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -likes-database-uri 'awsdynamodb://likes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -messages-database-uri 'awsdynamodb://messages?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -notes-database-uri 'awsdynamodb://notes?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -posts-database-uri 'awsdynamodb://ap_posts?partition_key=Id&allow_scans=true&region={REGION}&credentials={CREDENTIALS}' \
   -date {YYYY-MM-DD} \
   -verbose

If you need/want to set up all those tables, the code to do so is here:

https://github.com/sfomuseum/go-activitypub/blob/main/cmd/create-dynamodb-tables/main.go

And:

https://github.com/sfomuseum/go-activitypub/tree/main/app/dynamodb/tables/create

For example:

go run -mod vendor cmd/create-dynamodb-tables/main.go \
	-dynamodb-client-uri 'awsdynamodb://?region={REGION}&credentials={CREDENTIALS}'

Note that all of these URIs are using a custom awsdynamodb:// scheme. Under the hood all of this code is using the aaronland/go-aws-dynamodb and aaronland/go-aws-session packages in to support a wider range of declarative credentialing options than those offered by the default AWS libraries:

There are also Makefile targets for running everything locally against a containerizer DynamoDB instance:

Expected behavior

An iterator whose Next method triggers an io.EOF error.

Version

0.37.0

Additional context

Documentation (albeit incomplete) for the application's database design and specifics about the use of gocloud.dev/docstore are here:

The really long think-y think-y version is here:

@vangent
Copy link
Contributor

vangent commented Mar 21, 2024

This is hard to debug from here, can you use gohack and add some Printfs in the relevant code to see what's happening?

In here: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L472

Specifically it would be good to know what it.offset, it.limit, it.curr, it.last, it.count, it.limit, and len(it.items) is when this happens.

I think that this line: https://github.com/google/go-cloud/blob/master/docstore/awsdynamodb/query.go#L480 is supposed to handle this, but it's not.

It's possible that this is a regression in #3385.

@bartventer FYI

@bartventer
Copy link
Contributor

I've been digging into this issue and although I was unable to reproduce the panic, I added a preventive check for awsdynamodb here. This check returns io.EOF if there are no more items, which should prevent an index out of range panic when Next is called after it has already returned EOF once.

I also added a test case to verify this behavior. The test repeatedly calls Next until it gets io.EOF, then calls Next again to ensure it still returns io.EOF.

I've pushed these changes in a PR. Please have a look when you get a chance and let me know if there's anything that needs adjusting.

@thisisaaronland
Copy link
Author

Thanks for the fix. I can confirm that I am no longer seeing panics calling iter.Next – 

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L5

Question: In the process of investigating the panics I often noticed that the docstore.Query iterators would return/report no results. For example if I call a simple shell script wrapper around the "counts" code like this:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:16:56.081668-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:16:57.021555-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":0,"followers":0,"following":0,"likes":0,"messages":0,"notes":0,"posts":0}

All the query iterators return nothing. But then if I call the same code immediate afterwards the queries return results:

> sh ./work/counts.sh 2024-03-18
{"time":"2024-03-21T16:17:03.270628-07:00","level":"DEBUG","msg":"Verbose logging enabled"}
{"time":"2024-03-21T16:17:04.100218-07:00","level":"DEBUG","msg":"Get counts for date","date":"2024-03-18","location":"America/Los_Angeles","start":1710720000,"end":1710806400}
{"date":"2024-03-18","location":"America/Los_Angeles","accounts":0,"blocks":0,"boosts":0,"deliveries":9045,"followers":1,"following":0,"likes":0,"messages":0,"notes":0,"posts":13}

The code in question is:

I am uncertain whether this is related to the iter.Next problem or something entirely different. If it's more helpful to create a separate issue I am happy to do that.

@vangent
Copy link
Contributor

vangent commented Mar 21, 2024

Are you using Offset at all?

@vangent
Copy link
Contributor

vangent commented Mar 21, 2024

and/or Limit?

@vangent vangent reopened this Mar 21, 2024
@vangent
Copy link
Contributor

vangent commented Mar 21, 2024

@bartventer I think there's a bug in Offset that may be causing both of these issues.

Example:

  • Let's say Offset is 10.
  • We do the initial it.qr.run in RunGetQuery, let's say it returns 5 items.
  • When Next is called, it will recurse 10 times to move past the offset.
  • On the next Next call, we'll do a new it.qr.run ("Make a new query request at the end of this page"), and then set curr = 0, but we may have just gotten items 6 through 8 (for example).

@thisisaaronland
Copy link
Author

I am not currently using either Offet or Limit.

@vangent
Copy link
Contributor

vangent commented Mar 22, 2024

Hrm that damages my theory, but I still think there's a bug in those.

@thisisaaronland it would be really helpful if you could provide some debugging info as requested above, as it seems like you can reproduce this easily and we can't. Maybe even comment out the new lines that fixed the panic so we can understand what's going on there too.

@thisisaaronland
Copy link
Author

Sure thing. I may not be able to do that until tomorrow but I will get to it as soon as I can. Thanks for all the help so far.

@vangent
Copy link
Contributor

vangent commented Mar 22, 2024

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at #3408 ?

@thisisaaronland can you try it out and see if that looks better?

@thisisaaronland
Copy link
Author

thisisaaronland commented Mar 22, 2024

I couldn't figure out what the replace directive for the changes in #3408 should be so I just brute-force cloned the query.go file in my vendor folder. With only limited testing it doesn't seem to have helped and may, in fact, have made things worse. Specifically:

  • Running the code (mentioned above) for the first time the Query iterator continues to return no results.
  • Running the code (mentioned above) a second time simply hung and never returned any results.

I will carve out some time in the morning to add the debugging hooks you suggested (above) to try and get more concrete details to work with.

If it's at all helpful what I have observed is that the Query iterator seems to return no results the first time it is invoked but then subsequently it starts to return results, even though the queries themselves are happening in entirely new processes. So maybe there is something happening at the AWS/Dynamo layer where the iterator returns prematurely on the first invocation and that, by extension, Dynamo queries (in AWS) exist in some kind of "window of time" between invocations ?

@vangent
Copy link
Contributor

vangent commented Mar 22, 2024 via email

@bartventer
Copy link
Contributor

Per AWS, "A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out."

I think we're not handling that correctly, which causes the empty results sometimes, as well as the panic.

@bartventer can you take a look at #3408 ?

@thisisaaronland can you try it out and see if that looks better?

I've had a look at PR #3408 and I agree with the logic there. I've created a new PR (#3409) that builds on that work. @vangent, could you take a look when you get a chance?

@thisisaaronland
Copy link
Author

Just a quick note that I applied github.com/bartventer/go-cloud v0.0.0-20240322082227-9d139c92796d and I am not seeing time-outs but I am still seeing inconsistent behaviour:

https://github.com/sfomuseum/go-activitypub/blob/iter-fix/go.mod#L7

Specifically: The first time the query is run it will sometimes return 0 results but then if the query is run again inside a short window of time (but in an entirely new process) the query will return results. That's all I've been able to do so far but will try to carve out time to get more concrete debugging information soon.

@vangent
Copy link
Contributor

vangent commented Apr 1, 2024

Ping @thisisaaronland

@vangent
Copy link
Contributor

vangent commented Sep 9, 2024

Any updates here @thisisaaronland ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants