Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sender key distribution unreliable when running in embedded mode #2415

Closed
hughns opened this issue Jun 7, 2024 · 12 comments · Fixed by matrix-org/matrix-js-sdk#4343
Closed
Assignees
Labels
T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems

Comments

@hughns
Copy link
Member

hughns commented Jun 7, 2024

When running as a widget (embedded mode) within Element X and Element Web, there have been reports of participant sender keys not finding their way to other recipients. This has resulted in those recipients not able to decrypt the media and so shows a grey box instead.

Apparently it is intermittent. This could be in the widget API interface or somewhere else. The keys are distributed via encrypted room events.

Whilst there is discussion about moving key distribution to use to-device messaging instead of encrypted room events for other reasons, it is not known if the root cause of the problem would also affect an alternative implementation

Possible next steps:

  • Add tactical logging around key distribution (there is very little at the moment)
  • Wait for some rageshakes reports
  • Investigate those rageshakes and iterate
@toger5
Copy link
Contributor

toger5 commented Jun 7, 2024

Thanks for the detailed report!

I think it is clear from your comment that you are talking about the first case, but can you confirm:

  • This happens when using calls initiated in EX or EW that run inside the matrix room.
  • This does not happen when clicking on a call.element.io (or .dev) link and using EX as the link handler?

@fkwp
Copy link
Contributor

fkwp commented Jun 7, 2024

maybe matrix-org/matrix-js-sdk#4198 is related

@hughns
Copy link
Member Author

hughns commented Jun 28, 2024

I think it is clear from your comment that you are talking about the first case, but can you confirm:

  • This happens when using calls initiated in EX or EW that run inside the matrix room.

Yes, this is the case.

@hughns
Copy link
Member Author

hughns commented Jun 28, 2024

I have been able to find one reproducible issue:

  1. Start a room call using "new group call experience" on device A and have Joining a room call in embedded mode sometimes fails within Element Web #2458 occur
  2. When the call rings on another device (device B) then hit Join on device B
  3. Complete joining on device B
  4. Then try to join on device A again by hitting the Join banner
  5. Complete joining on device A
  6. Then you find that device A is missing the encryption key for device B

In this video device A is on the left and device B is on the right:

Missing.key.after.failure.to.join.mov

(n.b. device A is running a customised version which shows decryption errors across the top of the media)

@toger5
Copy link
Contributor

toger5 commented Aug 5, 2024

This can be updated to now be reliable in EW but not in EC SPA right?

@hughns
Copy link
Member Author

hughns commented Aug 6, 2024

I locally reverted the fix for #2458 and re-tested to see what the cause of the lost encryption key was.

Adding commentary to the reproduction steps I see:

  1. Start a room call using "new group call experience" on device A and have Joining a room call in embedded mode sometimes fails within Element Web #2458 occur

During this device A sends a org.matrix.msc3401.call.member state event. This state event effectively gets orphaned when the EC gets terminated by EW.

  1. When the call rings on another device (device B) then hit Join on device B
  2. Complete joining on device B

Device B sends a io.element.call.encryption_keys to the room containing it's encryption key.

  1. Then try to join on device A again by hitting the Join banner
  2. Complete joining on device A

Device A starts EC again which sends a new org.matrix.msc3401.call.member state event which is not orphaned.

However, when device B receives the state event update from A it sees that the memberships/participants haven't changed and so does not rotate its encryption key or resend the encryption key.

  1. Then you find that device A is missing the encryption key for device B

Because device B doesn't send another io.element.call.encryption_keys message the EC on device A does not have the encryption key for B.

Some initial thoughts on how this could be addressed:

  • have embedded EC go through the timeline on startup to load encryption keys for participants.
  • do the above but on a lazy basis for missing keys
  • have each device (e.g. device B) resend the existing encryption keys each time there is an org.matrix.msc3401.call.member change even if the participants haven't changed.
  • allow a device (e.g. A) to request another device (e.g. B) to resend their keys.

@hughns
Copy link
Member Author

hughns commented Aug 7, 2024

During discussion today with @toger5 it became clear that a similar case could also happen with EC running in SPA mode with per participant keys.

Specifically, in SPA mode EC doesn't attempt to fetch encryption keys from room history and assumes that it will receive them all after it joins.

A hypothetical (unvalidated) scenario would be: device A is using SPA on a call; the user reloads the SPA browser tab; device A reconnects but doesn't have any encryption keys; the other participants (e.g. device B) don't think that the participants have changed (so no rotation) nor do they resend their keys.

What this suggests is that we should address this behaviour even though #2458 is in place.

In choosing an approach we should be mindful that we expect the distribution of encryption keys to be migrated to to-device messages.

It feels unlikely that it makes sense for a Matrix client to allow for previous to-device messages to be searched and processed.

As such, this suggests that the third (have the devices pro-actively resend their keys) or fourth (allow a device to request the resend) approaches are most suitable.

@toger5
Copy link
Contributor

toger5 commented Aug 7, 2024

During the meeting I forgot to mention that we basically have a "request key" mechanism in place.
The user with missing keys could send a leave and a following join call.member state event. This could be the default behavior if a device tries to join the call and detects, that the same device is already part of the call.
A reload as described by Hugh would then look like an actual disconnect/reconnect to other call members.

(Really good we tackle this now since reliable call memberships would have made this even more hidden. If users take more then 5s to reload and go through the lobby this issue would not occur. Whcih is also a "almost" fix but I think the logic: on call initialization, check if the server state thinks you are already connected and if so disconnect and reconnect since you abviously just started up the connection with this device -> should be in not connected state. Sounds like sth we should do anyways)

@hughns
Copy link
Member Author

hughns commented Aug 9, 2024

During the meeting I forgot to mention that we basically have a "request key" mechanism in place. The user with missing keys could send a leave and a following join call.member state event. This could be the default behavior if a device tries to join the call and detects, that the same device is already part of the call. A reload as described by Hugh would then look like an actual disconnect/reconnect to other call members.

This feels like an elegant approach but I think it will have unintended consequences such as in embedded mode Element Web will think that the call has ended and terminate the iframe containing Element Call.

Instead I have made matrix-org/matrix-js-sdk#4343 which will cause a resend of the current encryption keys if the membershipID or created_ts on a call membership state event is changed.

@hughns hughns self-assigned this Aug 9, 2024
@hughns hughns added the T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems label Aug 9, 2024
@hughns
Copy link
Member Author

hughns commented Aug 12, 2024

This will need a js-sdk bump once matrix-org/matrix-js-sdk#4343 is merged.

@hughns
Copy link
Member Author

hughns commented Aug 14, 2024

Needs js-sdk bump.

@hughns
Copy link
Member Author

hughns commented Aug 16, 2024

Bump was done in #2567

@hughns hughns closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants