Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

Open
jaller94 opened this issue Jan 10, 2023 · 11 comments
Open
Labels
bug Something isn't working type:appservices Appservice APIs

Comments

@jaller94
Copy link
Contributor

jaller94 commented Jan 10, 2023

Describe the bug
An appservice that enables E2BE-capabilities, may crash after a few hours. It seems to be based on a timer.

matrix-org/matrix-hookshot#609

To Reproduce

  1. Write an encrypted appservice, e.g. Hookshot
  2. Run it for a while (1 to 4 hours)

Expected behavior
Don't crash.

Log snippet
This Hookshot crash happened exactly one hour after starting the bot, making me think that it is a timeout or other internal event, instead of an incoming network event.

WARN 13:07:44:863 [Config] The `widgets.openIdOverrides` config value SHOULD NOT be used in a production environment.
INFO 13:07:44:873 [Appservice] Initialising Redis storage (on localhost:6379)
INFO 13:07:44:901 [Bridge] Starting up
INFO 13:07:44:901 [UserTokenStore] Loading token key file passkey.pem
INFO 13:07:44:920 [RedisASProvider] Successfully connected
INFO 13:07:44:942 [Bridge] Connecting to homeserver and fetching joined rooms..
INFO 13:07:44:966 [Config] Prefilling room membership for permissions for 0 rooms
INFO 13:07:45:174 [FeedConnection] Connection 0e343504299411063c7324cc5166baa8 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://chrpaul.de/feed.xml","label":"Christian Paul"}
INFO 13:07:45:176 [FeedConnection] Connection e125d0ada12e1f980c8dfdbaa4020344 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://www.tagesschau.de/xml/rss2/","label":"Tagesschau"}
INFO 13:07:45:232 [Bridge] Bridge is now ready. Found 2 connections
INFO 13:07:45:235 [ListenerService] Listening on http://0.0.0.0:9000 for webhooks
INFO 13:07:45:235 [ListenerService] Listening on http://127.0.0.1:9001 for metrics, provisioning
INFO 13:07:45:236 [ListenerService] Listening on http://0.0.0.0:9002 for widgets
INFO 13:09:15:146 [Appservice] Processing transaction 7935
INFO 13:09:15:147 [Appservice] Processing event of type m.room.message
INFO 13:09:15:160 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:09:15 +0000] "PUT /transactions/7935?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"
…
INFO 13:35:15:521 [Appservice] Processing transaction 8038
INFO 13:35:15:521 [Appservice] Processing ephemeral event of type m.presence
INFO 13:35:15:522 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:35:15 +0000] "PUT /transactions/8038?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"

ERROR 14:07:45:815 [Appservice] Failed to set up crypto on intent @hookshot:localhost [Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }] {
  code: 'GenericFailure'
}
Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
error Command failed with exit code 1.

Additional context
I've seen the same issue in a project different to Hookshot. I don't have access to their code, but as Element employees we can request the code of this customer appservice.

@jaller94 jaller94 added the bug Something isn't working label Jan 10, 2023
@jaller94 jaller94 changed the title E2EE-eanbled AppServices crash a while (IO error: could not acquire lock) E2EE-enabled AppServices crash a while (IO error: could not acquire lock) Jan 10, 2023
@jaller94 jaller94 changed the title E2EE-enabled AppServices crash a while (IO error: could not acquire lock) E2BE-enabled AppServices crash a while (IO error: could not acquire lock) Jan 10, 2023
@turt2live turt2live added the type:appservices Appservice APIs label Jan 10, 2023
@turt2live
Copy link
Owner

my hunch is this would be the Intent object expiring, but not cleaning up crypto properly, so when it gets recreated it has a conflicting lock deep within the rust-sdk.

@Half-Shot
Copy link
Contributor

Having looked at this, I'm unsure. The only expiry I can see for intents is the LRU cache, which defaults to 10000 entries. I'd certainly never expect a inactive hookshot to get close to that number. LRU doesn't expire objects until the maximum object threshold is reached, so my gut feeling is it's not expiry that's causing it.

That said: @jaller94 mentioned that the timeout was exactly an hour, and the intents have a TTL of a hour.

The sled database is the thing that's crying out here (spacejam/sled#1234). Sounds like Rust would impliclty cause the DB to be dropped when it falls out of scope, but I don't really know how that works in a JS world. I presume the act of the Intent dropping would cause the OlmMachine to drop, which would cause the Sled DB to drop but maybe there is something missing here.

@Half-Shot
Copy link
Contributor

Right, so. Lots of digging led me to realise that the LRU store will evict things when their ttl is up, regardless of how many items are in the store. This leads to Intents being deleted and immediately recreated, which leads to the Sled DB conflicting with itself as it hasn't been GC'd.

Ultimately, Sled is a bit poor for this use case and we should be using something like Redis or Postgres (both of which appear to be in progress for the rust-sdk). In the meantime, my proposals would be to either disable the LRU, or check the lock status on the Sled DB prior to loading crypto.

@Half-Shot
Copy link
Contributor

Checking the lock status is horrible in Node. I'm going to push for matrix-org/matrix-rust-sdk#1256 support to ultimately fix our DB woes, but in the meantime look at disabling the LRU for encrypted bridges.

@turt2live
Copy link
Owner

Disabling the LRU has memory exhaustion concerns, fwiw. It currently patches over a memory leak that'd be ideal to fix, but so does just throwing away objects.

The rust-sdk's destructor should be able to handle the lock status internally?

@turt2live
Copy link
Owner

from oob: hookshot's usecase is covered by disabling the LRU, with consequences being known due to urgency.

It's not clear why the rust-sdk isn't destructing itself properly, but the hope is we can just abandon Sled instead of thinking about it.

@AndrewFerr
Copy link
Contributor

This is fixed by using the SQLite-based crypto store that was recently added to the rust-sdk.

@Half-Shot
Copy link
Contributor

I think we can close this now @turt2live

@turt2live
Copy link
Owner

I'm still seeing some crashes come up, though agreed that the majority of cases appear to be fixed.

@AndrewFerr
Copy link
Contributor

@turt2live are the crashes you saw from t2bot bridges? Do you have any stack traces / error logs available?

@turt2live
Copy link
Owner

They were t2bot.io bridges and local development. I don't have stacktraces on hand, sorry (there's other problems with the bridges which are taking my attention).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working type:appservices Appservice APIs
Projects
None yet
Development

No branches or pull requests

4 participants