Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure clients see their transaction_ids #146

Merged
merged 46 commits into from
Jul 31, 2023

Conversation

DMRobertson
Copy link
Contributor

@DMRobertson DMRobertson commented Jun 8, 2023

Suppose

  • Alice has made an incremental (i.e. with pos) sliding sync.
  • The proxy has nothing to give to Alice right now and is waiting for a live update.
  • Alice /sends a message into a room using a transaction_id.
  • Bob is also in the room and using the proxy.
  • Bob's poller sees Alice's message without a transaction ID.
  • This message is handed off to Alice's /sync handler and sent to Alice without a transaction id.
  • Alice's poller later sees the message with the transaction id.
  • (Not sure if that txn_id ever arrives at Alice's client?)

The bold bullet is bad because it means that the client doesn't realise it has just received the event it sent, meaning that the "local echo" flickers.

We want to fix this by never notifiying a waiting sync handler about Things until its dedicated poller has seen that Thing. Do this by:

  1. Introduce a buffer to the API machinery which holds events which might have a transaction ID, but don't. Have entries leave the buffer (and proceed to conns) after at most a second. Stack up new message events behind it in the bugger.
  2. Change the poller to track which devices have seen an event for their user which lacks a transaction ID.
  3. If a poller sees a transaction ID, or if all pollers see no transaction ID, emit a payload telling the API machinery to stop waiting for a transaction ID.
  4. When the API machinery receives that payload, clear up to its corresponding event in the buffer.

@DMRobertson DMRobertson force-pushed the dmr/prevent-local-echo-flicker branch from 18b5483 to b67ea78 Compare June 12, 2023 17:09
@DMRobertson DMRobertson force-pushed the dmr/prevent-local-echo-flicker branch 2 times, most recently from 1210663 to 7a5335f Compare July 25, 2023 14:34
@DMRobertson DMRobertson force-pushed the dmr/prevent-local-echo-flicker branch from 4bd7531 to 142290f Compare July 25, 2023 18:32
@DMRobertson DMRobertson linked an issue Jul 25, 2023 that may be closed by this pull request
@DMRobertson
Copy link
Contributor Author

I assume everything is broken because I'm now delaying certain messages for up to a second.

Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible. Needs more testing and removal of visiblity checks.

sync3/handler/txn_id_waiter.go Outdated Show resolved Hide resolved
sync3/handler/txn_id_waiter.go Show resolved Hide resolved
sync3/handler/txn_id_waiter.go Show resolved Hide resolved
sync3/lists.go Outdated Show resolved Hide resolved
sync3/handler/connstate_test.go Outdated Show resolved Hide resolved
David Robertson added 3 commits July 27, 2023 15:05
This was the best way for me to make sense of the test log.
@DMRobertson DMRobertson marked this pull request as ready for review July 27, 2023 17:19
@DMRobertson DMRobertson changed the title WIP: ensure clients see their transaction_ids Ensure clients see their transaction_ids Jul 27, 2023
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some e2e tests to try to repro this bug (it'll flakey fail but that's better than not having any e2e tests here). This ensures that we are using the right API (e.g checking the right fields, etc.

Specifically:

  • The alice/bob use case described in a comment.
  • Multiple devices for alice, where other devices see the event without a txn id.

v3.go Show resolved Hide resolved
tests-integration/timeline_test.go Outdated Show resolved Hide resolved
tests-integration/timeline_test.go Show resolved Hide resolved
tests-integration/timeline_test.go Show resolved Hide resolved
@@ -233,17 +233,18 @@ func TestInviteRejection(t *testing.T) {
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We neeed e2e tests for this. We cannot force the race to happen, but we can greatly encourage it by spamming events. I would do something like:

  • Make alice and bob, join to same room.
  • goroutine to repeatedly sliding sync as alice and accumulate the events without making assertions.
  • goroutine to repeatedly sliding sync as bob and accumulate the events without making assertions.
  • for i in N:
    • Alice sends message with txnid i, remember event_id -> txnid mapping.
  • Terminate goroutines when final event id is seen on both alice and bob.
  • Assert that alice has the correct txn IDs for all N events.

Copy link
Contributor Author

@DMRobertson DMRobertson Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for something simpler, see baa3596.

Running on that commit:

$ SYNCV3_SERVER=http://localhost:8888 SYNCV3_ADDR=http://localhost:8844 go test ./tests-e2e -v -count=10 -v -run TestTransactionIDsAppearWithMultiplePollers | egrep 'PASS|FAIL'
egrep: warning: egrep is obsolescent; using grep -E
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.53s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.67s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.67s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.49s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.71s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.63s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.80s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.70s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.66s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (2.65s)
PASS

Running on 595e1b3 (an older main):

$ SYNCV3_SERVER=http://localhost:8888 SYNCV3_ADDR=http://localhost:8844 go test ./tests-e2e -v -count=10 -v -run TestTransactionIDsAppearWithMultiplePollers | egrep 'PASS|FAIL'
egrep: warning: egrep is obsolescent; using grep -E
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.65s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.57s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.80s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.63s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.69s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.61s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.69s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.78s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.56s)
--- FAIL: TestTransactionIDsAppearWithMultiplePollers (1.70s)
FAIL
FAIL	github.com/matrix-org/sliding-sync/tests-e2e	16.677s
FAIL

In another test of 100 runs, I saw this fail 95 times and pass 5 times.

The 1 second difference in timing makes me wonder if something has gone wrong with the all clear mechanism. I will investigate.

Copy link
Contributor Author

@DMRobertson DMRobertson Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we would not emit the V2TransactionID payloads if there were no new events.

28b11f8 contains the fix. With it:

$ SYNCV3_SERVER=http://localhost:8888 SYNCV3_ADDR=http://localhost:8844 go test ./tests-e2e -v -count=10 -v -run TestTransactionIDsAppearWithMultiplePollers | egrep 'PASS|FAIL'
egrep: warning: egrep is obsolescent; using grep -E
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.65s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.61s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.84s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.53s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.96s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.69s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.64s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.68s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.65s)
--- PASS: TestTransactionIDsAppearWithMultiplePollers (1.63s)
PASS

sync3/handler/txn_id_waiter.go Outdated Show resolved Hide resolved
sync3/handler/txn_id_waiter_test.go Show resolved Hide resolved
sync3/handler/txn_id_waiter_test.go Show resolved Hide resolved
@DMRobertson DMRobertson merged commit 2cc58cf into main Jul 31, 2023
6 checks passed
@DMRobertson DMRobertson deleted the dmr/prevent-local-echo-flicker branch July 31, 2023 15:16
DMRobertson pushed a commit that referenced this pull request Aug 1, 2023
More sense conceptually---pollermap doesn't track any cross-poller
state---and removes some indirection.

Followup to #146. No functional changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

txn IDs: can be unset when /sync poller lags
2 participants