dendrite: Dendrite 0.6.2 fails to sync/federate
Background information
Dendrite version or git SHA: v0.6.2 (last known good was 0.6.0)
Monolith or Polylith?: monolith
SQLite3 or Postgres?: postgresql
Running in Docker?: no
go version: 1.16.13 and 1.18beta1
Client used (if applicable): app.element.io, Hydrogen and Fluffychat
Description
Dendrite fails to receive new events for any room and fails to sync existing events to some clients as of version 0.6.2. In the clients, is shown by either frozen rooms and disconnection messages (element) or never finishing the initial sync (Fluffychat and Hydrogen). Rolling back to 0.6.0 resolves the issue.
The following logs may be relevant:
level=debug msg="Transaction: Failed to query room version for room!" error="context canceled" req.id= req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
level=debug msg="Transaction: Failed to parse event JSON of event {\"auth_events\":[\"\",\"\",\"\"],\"content\":{\"algorithm\":\"m.megolm.v1.aes-sha2\",\"ciphertext\":\"\",\"device_id\":\"\",\"sender_key\":\"\",\"session_id\":\"\"},\"depth\":20,\"hashes\":{\"sha256\":\"h0hdF/eV9J+\"},\"origin\":\"matrix.org\",\"origin_server_ts\":,\"prev_events\":[\"$\"],\"prev_state\":[],\"room_id\":\"!:.\",\"sender\":\"@\",\"signatures\":{\"matrix.org\":{\"ed25519:a_RXGa\":\"h2jmBBg\"}},\"type\":\"m.room.encrypted\",\"unsigned\":{\"age_ts\":}}" error="gomatrixserverlib: unsupported room version ''" req.id=KzqABiJjMlbU req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader (delegator.go:65)
NOTE: user data from the logs has been stripped.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 5
- Comments: 40 (14 by maintainers)
@grisu48 Yep, sometime this week.
I just updated to 0.6.3 and federation is still broken. When grepping for
level=error
in the docker container output, I still see a lot of “failed to query device keys for some users” entries, same as above. The instance was working just fine prior to updating to 0.6.0, now internal messaging works but federation is still broken.@imyxh Glad to hear that’s helped — if you run into any more problems, please capture and chuck up some new profiles and we can look again. 😃
You’ve also got headroom of 20 unused database connections so you could increase the roomserver’s
max_open_conns
by another 10 or 15 to make better use of those resources. That should help with processing data for more rooms in parallel if you aren’t already limited by CPU or RAM.I deleted the jetstream dir (https://github.com/matrix-org/dendrite/issues/2181) and now it appears to be working after waiting awhile. I also bumped my max connections up to try and avoid https://github.com/matrix-org/dendrite/issues/2173
I’m using the NATS build into Dendrite
I could finally solve this issue by deleting the old
jetstream
directory. Before I had still some ongoing issues with messages not going out and not being received. Nowdendrite
0.6.5 is rocking again since Saturday 🙂@alistair23 maybe that’s also worth a shot for you? I just renamed the jetstream directory, and once everything worked, dumped it completely.
@imyxh Please follow the instructions a couple posts up and if you can supply profiles from the next time it happens, that’d be amazing.
Deleting the entire JetStream folder is not ideal and doing so is a very good way for downstream components to get in an out-of-sync state with the roomserver, so I can’t recommend that as a fix. A much much safer approach if absolutely necessary is to delete just the
jetstream/$G/streams/DendriteInputRoomEvent
sub-folder only rather than the whole JetStream folder.OK, so to understand what’s really going on, I could use a goroutine trace and a profile from Dendrites that are experiencing these issues.
To do this, you need to start Dendrite with the
PPROFLISTEN=localhost:65432
environment variable set and then leave it running with that so that the profiler is accessible when the problem occurs. You should see alevel=warning msg="Starting pprof on ..."
line at startup if the profiler starts successfully.Then the next time you run into problems, capture the following profiles:
… and then upload all three files along with the commit ID that you are running — they don’t contain configuration or anything sensitive (apart from possibly the folder names that Dendrite was built in) so should be safe to share. The two
goroutine
profiles should download pretty much instantly, theprofile
one will take 30 seconds to complete.FWIW
failed to query device keys for some users
is specific to E2EE and ultimately a separate issue to failing to federate in public rooms.Out of curiosity, are you all running the internal NATS deployment built into Dendrite or standalone NATS Server? If any of you are running a standalone NATS Server, which options are you running with?
@grisu48
A lot of those issues will be genuine connection errors or bad keys so I wouldn’t worry about those log lines unless you are having problems with E2EE specifically — in that case best to open a separate issue.