Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining a room call in embedded mode sometimes fails within Element Web #2458

Closed
hughns opened this issue Jun 28, 2024 · 6 comments
Closed
Labels
T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems

Comments

@hughns
Copy link
Member

hughns commented Jun 28, 2024

When running Element Call embedded within Element Web using the "New group call experience" labs feature I sometimes see a behaviour where I attempt to start a room call and after pressing "Join call" I get returned to the timeline rather than actually entering the call.

In the console you see The widget died; treating this as a user hangup logged

@toger5 had spotted this previously and sent messages in #matrix-dev about it. The message describing when this was observed matches my own observations:

The reason I am looking into this is that Calls sometimes just stop when clicking the join button in the lobby.
I noticed that this makes sense since the condition for them to stop is that the state changes to no memberships memberships=[].

Here is how it looks from a UX point of view:

Immediate.hangup.after.re-starting.call.mov

When looking at it I also came to the conclusion that the sequence of room state events being (re-)emitted and (re-)processed by https://github.com/matrix-org/matrix-js-sdk/blob/d90292bff5ed026fba72b1ffa8b5284b1fb76577/src/matrixrtc/MatrixRTCSessionManager.ts#L133-L148 that causes an issue.

How I originally came on to this issue was that I was looking at #2415 and found that the key distribution failed after encountering this behaviour. I will add some context on that issue shortly too.

Possibly related issues:

Thoughts:

  • is it expected and "correct" that js-sdk would be emitting these state events twice?
  • if not, then what is the root cause of the reemissions?
  • if it is expected that room state events might be emitted twice and out of order then the current approach in MatrixRTCSessionManager is not suitable
@hughns hughns added the T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems label Jun 28, 2024
@erikjohnston
Copy link
Member

erikjohnston commented Jul 2, 2024

The HAR of the sync requests
https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002136_1_12_1333_4049_1_426_5717_0_1
{
  "next_batch": "s1002138_1_12_1333_4050_1_426_5717_0_1",
        "timeline": {
          "events": [
            {
              "type": "org.matrix.msc3401.call.member",
              "sender": "@hughns2:call-unstable.ems.host",
              "state_key": "@hughns2:call-unstable.ems.host",
              "unsigned": {
                "replaces_state": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ",
              },
              "event_id": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM"
            }

---------------

https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002138_1_12_1333_4050_1_426_5717_0_1
{
  "next_batch": "s1002140_1_12_1334_4051_1_426_5717_0_1",
        "state": {
          "events": [
            {
              "type": "org.matrix.msc3401.call.member",
              "sender": "@hughns2:call-unstable.ems.host",
              "state_key": "@hughns2:call-unstable.ems.host",
              "unsigned": {
                "replaces_state": "$9mFjm_Ox58jeAvdSL8OWlSShagn65niqyLLAHX23QGs",
              },
              "event_id": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ"
            }
          ]
        },
        "ephemeral": {
          "events": [
            {
              "type": "m.receipt",
              "content": {
                "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM": {
                  "m.read": {
                    "@hughns2:call-unstable.ems.host": {
                      "ts": 1719851861973,
                      "thread_id": "main"
                    }
                  }


----------------

https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002143_1_12_1335_4052_1_426_5717_0_1
{
  "next_batch": "s1002146_1_12_1335_4052_1_426_5717_0_1",
        "timeline": {
          "events": [
            {
              "type": "org.matrix.msc3401.call.member",
              "sender": "@hughns2:call-unstable.ems.host",
              "state_key": "@hughns2:call-unstable.ems.host",
              "unsigned": {
                "replaces_state": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM",
                }
              "event_id": "$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY"
            }

-----------------------------------

https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002146_1_12_1335_4052_1_426_5717_0_1
{
  "next_batch": "s1002148_1_12_1335_4052_1_426_5717_0_1",
        "state": {
          "events": [
            {
              "type": "org.matrix.msc3401.call.member",
              "sender": "@hughns2:call-unstable.ems.host",
              "state_key": "@hughns2:call-unstable.ems.host",
              "origin_server_ts": 1719851861110,
              "unsigned": {
                "replaces_state": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ",
              },
              "event_id": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM"
            }


----------------

https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002148_1_12_1335_4052_1_426_5717_0_1
{
  "next_batch": "s1002149_1_12_1335_4052_1_426_5717_0_1",
        "timeline": {
          "events": [
            {
              "type": "org.matrix.msc3401.call.member",
              "sender": "@hughns2:call-unstable.ems.host",
              "state_key": "@hughns2:call-unstable.ems.host",
              "unsigned": {
                "replaces_state": "$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY",
              },
              "event_id": "$8LDSegFL8fkNUaNI_Jc2pgXla17t2FUmsE6JusnYCyU"
            }
          ],

--------------------

This can be summarised as the following changes:

timeline: $91M -> $q5m
state:    $9mF -> $91M
timeline: $q5m -> $aem
state:    $91M -> $q5m
timeline: $aem -> $8LD
Looking at the DB we get
select stream_ordering, event_id, type, state_key, (select array_agg(prev_event_id) from event_edges where events.event_id = event_edges.event_id), (select state_group from event_to_state_groups as sg where events.event_id = sg.event_id)  from events where room_id = '!totKtECNNKyhzLOiNl:call-unstable.ems.host' and 1002136 < stream_ordering and stream_ordering <= 1002149 order by stream_ordering;

 stream_ordering |                   event_id                   |              type              |            state_key            |                                          array_agg                                          | state_group 
-----------------+----------------------------------------------+--------------------------------+---------------------------------+---------------------------------------------------------------------------------------------+-------------
         1002136 | $YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0 | org.matrix.msc3401.call.member | @hughns:call-unstable.ems.host  | {$89_MkkYqF-ViNH0qGkqmQyb5LPnslb4VqQarURn0iUA}                                              |       37615
         1002138 | $q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0}                                              |       37616
         1002139 | $tIhui6kBfzQDGV5FaZZJn_yKsPqeUwM22JeUIikdFRw | m.room.encrypted               |                                 | {$YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0}                                              |       37615
         1002140 | $bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA | m.room.encrypted               |                                 | {$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM,$tIhui6kBfzQDGV5FaZZJn_yKsPqeUwM22JeUIikdFRw} |       37616
         1002146 | $aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA}                                              |       37617
         1002147 | $ugzpWM2F6VRnEcpnp0dYSzfXnKKRyTCHH-TQc9dcwcM | m.room.encrypted               |                                 | {$bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA}                                              |       37616
         1002148 | $xH5t5qKv0tQO7_wUdxoatp9pz-YsV_wnNSLxntLYuqg | m.room.encrypted               |                                 | {$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY,$ugzpWM2F6VRnEcpnp0dYSzfXnKKRyTCHH-TQc9dcwcM} |       37617
         1002149 | $8LDSegFL8fkNUaNI_Jc2pgXla17t2FUmsE6JusnYCyU | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$xH5t5qKv0tQO7_wUdxoatp9pz-YsV_wnNSLxntLYuqg}                                              |       37618
(8 rows)

Which is roughly:

weird_sync dot

Where bottom is most recent and background colour is the state groups.

If you look at the DB output, which is ordered by stream_ordering you can see that the state keeps bouncing back and forth. This is likely confusing the algorithm for computing the state deltas to send to clients, due to poor handling of these cases in Synapse. The fix for this is probably to move to tracking state changes via current state, but that is not a small change.

@hughns
Copy link
Member Author

hughns commented Jul 10, 2024

My latest understanding of this issue is as follows, working through the layers:

Synapse/Homeserver
The is a bug in Synapse (tracked by element-hq/synapse#17430) which is causing a sequence of inconsistent/invalid /sync responses to be returned. This is described in the HAR extract referenced by #2458 (comment).

There is some suggestion (TODO: where?) that the might be circumstances were a sequence like this is valid within the spec. If this is the case then it possibly suggests that the client should be better at handling this.

Element Web - js-sdk
The js-sdk is processing the state events received in the state and timeline portions of the /sync response "as seen". i.e. it assumes that they are consistent.

This explains the sequence of memberships emissions described in #2458 (comment).

As a workaround, @toger5 has implemented matrix-org/matrix-js-sdk#4242 which implements a check to ignore some repeated room state events. The repeated case that is identified is where there is an immediate repeat (e.g. A -> B -> A) rather than a general case at arbitrary depth (e.g. A -> B -> C -> A).

This doesn't address the general case of inconsistent data coming back from Synapse or indeed if there are spec compliant cases which need handling.

Element Web - react-sdk implementation of embedded calls

EW appears to process the end of embedded calls at two levels.

Firstly, when the widget explicitly indicates a hangup (listener registered at https://github.com/matrix-org/matrix-react-sdk/blob/19f9f9856451a8e4cce6d313d19ca8aed4b5d6b4/src/models/Call.ts#L894) which marks the call as disconnected.

Secondly, via the MatrixRTCSessionManager with the SessionEnded event (listener registered at https://github.com/matrix-org/matrix-react-sdk/blob/19f9f9856451a8e4cce6d313d19ca8aed4b5d6b4/src/models/Call.ts#L805) which is based on room state.

In the case of the room state coming out of sequence, the second method is being invoked on stale events.

If the end call handling was based purely on the explicit hangup (or use interaction) then might we not need to process the state events in this way?

Element Call
The doesn't appear to be anything technically wrong at this layer.

@hughns
Copy link
Member Author

hughns commented Jul 10, 2024

One more thing, there was a suggestion that the behaviour in Synapse was triggered by EW/EC sending the events simultaneously to Synapse causing a race.

I tried modifying EW/js-sdk et al to ensure that the events were sent sequentially but this didn't appear to help. I might not have done it correctly though.

@hughns
Copy link
Member Author

hughns commented Jul 10, 2024

One more thing, there was a suggestion that the behaviour in Synapse was triggered by EW/EC sending the events simultaneously to Synapse causing a race.

I tried modifying EW/js-sdk et al to ensure that the events were sent sequentially but this didn't appear to help. I might not have done it correctly though.

For some more context on this, I think the relevant points that need to be serialised are:

I think these are being triggered across different execution paths. In some places Promises are also being discarded rather than awaited.

@toger5
Copy link
Contributor

toger5 commented Jul 10, 2024

If the end call handling was based purely on the explicit hangup (or use interaction) then might we not need to process the state events in this way?

A Call is the object that also represents not connected calls. Whenever the join button is rendered a call object is allocated and stored in the CallStore so the second condition using the MatrixRTCSessionManager with the SessionEnded event is needed to destroy the call -> remove the join button.
A possible solution would be to ignore SessionEnded events if the user is connected or in the lobby of a call (is currently looking at the widget).
But from a maintainability point of view I am not a huge fan of the custom hangup event.

  • Tracking calls by those events is unreliable since an additional call connect state needs to be tracked on the client (we need to keep this in sync with the room state that is available to the client anyways)
  • It is redundant to the information we get from state changes.
  • Things like jitsi can easily be mapped to matrixRTC (they use a custom infrastructure/focus type and are could insert them as spec conform MatrixRTC sessions. This would greatly simplify handling jitsi and EC on clients without all the special Call logic we currently have. (jitsi needs the hangup event that is why it is there in the first place)

TLDR:
I would much rather try to change all the related bits to use the state change as the one source for call connects/disconnects then using the hangup widget action and some conditional logic (even though that would reacquire us to refactor the jitsi calls similar to how we do it with BigBlueButton: matrix-org/matrix-react-sdk#12452)

@hughns
Copy link
Member Author

hughns commented Aug 6, 2024

The original behaviour in EW/EC is addressed by matrix-org/matrix-js-sdk#4242 in that we no longer rely on the homeserver sending the events in the correct/consistent order.

The bug for the underlying issue in Synapse that causes it to return the inconsistent /sync response is tracked by element-hq/synapse#17430.

@hughns hughns closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Defect Something isn't working: bugs, crashes, hangs, vulnerabilities, or other reported problems
Projects
None yet
Development

No branches or pull requests

3 participants