December 04, 2024 [Matrix, Tech]
After lots of conversations with Element colleagues about message order in
Matrix, and lots of surprises for me, I wanted to write down what I had learned
before I forgot, and also write down some principles I think we should try to
follow. A lot of this is just my half-formed opinions, and while I am very
grateful to everyone who helped educate me about all of this, it in no way
represents any kind of policy or consensus from Element or Matrix or anyone else
:-)
Finding messages
If you’re writing a Matrix client (e.g. a chat app), you need to ask the server
for messages that have been sent in a room. To do this, you need to download the
“events”, which are just messages plus other things you might need to know
about. Messages are one type of “timeline” event, meaning that they appear in
the main display area of a room, showing you what people said.
Via /sync
The first and most common way to do this is to ask for the latest stuff that’s
happened, by hitting the
/sync
API:
GET { "rooms": {"join": {"!roomid:example.com": {"timeline": { "events": [ { "content": {"body": "How many roads?", ... }, ... } { "content": {"body": "Forty-two.", ... }, ... } ], "prev_batch": "s2222", ... }}}}, ... }
(Note: we’re not talking about
“state”
events here, which deal with e.g. who is a member of this room. Things get even
more interesting when you start thinking about them, because the order in which
they happen is critical to deciding who is banned, and similar issues.)
Via /messages or similar
The second way to get events is via one of the other APIs such as
/messages
,
/context
,
or /relations
:
GET { "chunk": [ { "content": {"body": "How many roads?", ... }, ... } { "content": {"body": "Forty-two.", ... }, ... } ], "start": "s2222", "end": "t1111", ... }
This seems unremarkable at first glance: the /sync
response even contains a
prev_batch
token which we can use as the from
query parameter to /messages
so we can page back through messages to find older ones.
So what is the problem?
These APIs return messages in a different order.
The /sync
API returns events in an order “according to the arrival time of the
event on the homeserver”.
The spec for /messages
says it returns events “in chronological order. (The exact definition of
chronological is dependent on the server
implementation.)”.
For /context
it also mentions chronological order.
For /relations
it contradicts itself, stating both “events will be returned
in chronological
order”
(when talking about the dir
parameter) and events will be “ordered
topologically”
(in the chunk
section). My guess is that the dir
parameter docs were
erroneously copied from elsewhere, and topological ordering was intended.
Topological ordering: events in a Matrix room are stored in a mathematical
structure known as a directed acyclic graph. Topological ordering means using
this graph structure (which is independent of the timing of when messages
arrived on the server) to decide an order. This order is easy to calculate
consistently, but it can be illogical from a common-sense point of view.
Synapse, and (I think) other homeservers actually use topological order for
/messages
and /context
as well as /relations
. I am not convinced that this
actually complies with the spec, since topological order is very much not
chronological, by my understanding of the word.
Why is this a problem?
Imagine I have two Matrix clients, both logged in as me. I leave the first
client open, polling the /sync
API and fetching events in order of their
arrival on the homeserver. I close the second client, and open it later. It will
run a /sync
, but it will only receive the latest messages. If I scroll the
room upwards, it will fetch more messages using the /messages
API.
The two clients will show me the messages in a different order. Normally, the
orders are similar or identical, but if two homeservers were disconnected for a
while (a “netsplit” occurred) they can be very different.
Which order is correct? I would generally argue that the first client is most
likely to fit with your intuition (because messages that you saw later are
further down the screen), but it’s definitely arguable. In actual fact, when
messages were sent effectively in parallel, there is no correct order. What I
am hoping for is a consistent order, as far as possible.
I would strongly argue that these two clients should show messages (and other
events) in the same order.
I do feel honour-bound at this point to say that I spoke to a colleague
recently who disagreed with this principle, and said that because of the
different usage of these two clients, it was OK, or even useful, that they
showed different results. I definitely disagree, but it’s worth pointing out
that this is a debatable point.
It’s also worth saying that even a lone client can exhibit this inconsistency,
if it doesn’t store all messages forever. If it deletes some messages to save
storage space, when the user pages up to read those messages, they will be
fetched using /messages
, so will appear in a different order from what the
user saw originally when they were fetched via /sync
. There is currently no
API that can re-fetch messages in the same order they were first received over
/sync
.
How big a problem is this?
Does it really matter if a few messages are in a different order? On the face of
it, maybe not. In most cases, the differences are minor, and when they are more
significant this is the result of a significant problem like a netsplit or
malicious behaviour.
I will admit that, even though I said I was not talking about state events (the
important events that define e.g. who is a member of the room), part of my
motivation here is that I want the order of state events to be consistent,
because there are times when a user really wants to examine the history of what
happened, and doesn’t want that to change under them.
However, I personally think that even if we ignore state events, we should do
our level best to order messages consistently. How should a user interpret a
change in order? What does it mean to them? Most likely, if they notice, they
will figure that Matrix is just a bit flaky. In the worst case, we might
“gaslight” them: they remember things happening in a particular order, but
when they check back they find that the evidence contradicts them.
If we accept that clients should display a linear view of what happened (despite
the fact that in reality things may have happened in parallel) then I think we
should work hard to make that view consistent.
How to fix it?
Use topological order everywhere?
One way that a client could “fix” this problem without any spec changes at all
would be to ignore /sync
timeline responses completely, and repeatedly call
/messages
to get messages to display. This would ensure that messages are
displayed in a consistent order, but it has several critical disadvantages.
Firstly, this is clearly is not the intention of the spec authors, and is
inefficient since it involves throwing away information that the server worked
to produce.
Secondly, assuming that messages appear in topological order, if some old
messages arrive late (e.g. due to a netsplit), this will mean that messages
appear “in the past”, high up the timeline, even though the user has not read
them. To make this happen, the client would need to keep repeating old calls to
/messages
, to check whether the past has changed, and the client would need to
find a way to display these late-arriving changes to the user.
Persistent sync order
I believe that the order messages arrive from /sync
is the correct order for a
client: as soon as the homeserver has a message, it should hand it over to the
client, and the client should show that it arrived by rendering it at the bottom
of the timeline.
I also believe that all of my clients should see the same order of messages.
So the logical conclusion is that the homeserver should be able to provide a
back-paginatable view of messages in the order they were provided via /sync
and by extension, if no client happened to be syncing at the time, the order in
which they would have been provided, which is essentially the order in which
they arrived at the homeserver.
One way to implement this would be to change the /messages
and other APIs to
return messages in this order. In the case of the /messages
and /context
APIs, I even think this would comply with the spec as it is now. One possible
implementation for homeservers would be to mark each message with a timestamp
when they first saw it, and sort their responses based on this timestamp. Spec
issue #852 actually
proposes this change for /messages
.
Note: this might make it difficult for homeservers that process incoming
events in parallel, requiring some kind of synchronisation mechanism to assign
timestamps, or some other mechanism to provide a consistent order. The exact
order of events that arrive very close to each other is not important though,
so long as the order is consistent.It is worth noting that whenever a client is syncing, the homeserver already
chooses an order for the events it provides over/sync
, proving that a
linear order is possible in principle.
An alternative is to continue providing events in any order, but add some kind
of order number that allows clients to sort events into /sync
order.
MSC4033
proposes this.
Of course, it’s much worse than this
Feel free to stop reading here. I’ve made my main point.
But if you want to know how difficult this problem really is, read on!
State resolution can change the past
So far, we’ve been assuming that if the homeserver just passes on messages to
the client as soon as it receives them, this will result in a reasonably
sensible order and an accurate reflection of the timeline. In fact, this is not
really true, because whether we like it or not, history as we understand it can
change.
In particular, when the homeserver performs “state resolution” (an evaluation of
which state events are considered valid based on who is a member of the room and
what permissions they have), some events can be effectively removed because it
turns out the person who created them didn’t have permission to do so. Because
Matrix is distributed, it definitely does happen that a homeserver evaluates an
event as valid at one point (and passes it on to clients) and then later has to
change its mind and decide it is not valid.
Currently, we don’t have a way of telling clients that messages should be
removed because of a change like this. (For state events, there are mechanisms
to tell clients they need to update their state, but not actually to delete
state events, as I understand it.)
I think we need to allow homeservers to send “deny” items to clients, which tell
a client to delete events, to cover this case. (Note: I don’t use the word
“event” here since these items would by necessity be created by the homeserver, not a
client, and would not be signed by a client device.)
I also think these deny items should be part of the linear history of a room,
as opposed to being signals to edit that history retrospectively. This way,
clients can show clearly the history of what they displayed to the user, and why
it changed when it did. The alternative is to “lie” to the user, “pretending”
that these events never existed, when the user actually saw them. How the client
presents this to the user would certainly need to be explored. For example, the
events might disappear from the timeline but some kind of “detailed history”
view might show that they used to exist and were later denied.
I think it’s really important that we show the user what really happened from
their perspective (in a persistent form). Anything else is confusing and
betrays users’ trust.
When the homeserver backpaginates
So far, we’ve been assuming that the homeserver has access to all the events,
but that is not always true. If a client asks for some events that a homeserver
does not have, the homeserver can ask another homeserver for them.
Now, when these new, old events arrive, they are very strange: they are new to
this homeserver, but we only had to fetch them from the other server because
they are old! Clients don’t want to display them at the bottom of the timeline,
because they were only requested when a user scrolled a long way up to the
top of the timeline.
So these new, old events need to be inserted further back in the linear history
that we are building, not at the end.
This is a difficult problem because we need to figure out where they need
inserting – it may not be at the start because we may have multiple gaps.
However, I still argue that we should solve this difficult problem on the
server, and present it as straightforward to the client i.e. the server should
respond to a request for these events by returning them as if they already
existed in the linear timeline, and from that moment on always returning them at
the same point when asked.
So from a client’s point of view, these events should be indistiguishable from
events that were on the server already.
Different people just are going to see things in different orders
It would be nice if everyone had the same view. But from different people’s
point of view, things genuinely happened in different orders. If I typed and
sent my message while someone else’s message was travelling to me over the
Internet, I will think that my message was first, and they will think theirs was
first.
With two users on the same homeserver, we could choose to make the homeserver
the arbiter, picking one message to come first. (Note: we don’t currently do
this, but we could.)
But, when two users are on different homeservers, this problem is unsolvable.
An important part of the design of Matrix is that two homeservers can disagree
on the exact order of messages, and still interoperate with each other. This is
what the long words are for: “directed acyclic graph”, “eventual consistency”
and “state resolution”.
So we can never give everyone a consistent view of the order of messages.
In this article I am arguing that a single person should always get a consistent
order when they come back to a room or look via a different client.
I also think I would like to argue that users on the same homeserver should see a
view consistent with each other, but I have not developed that argument here.
Addendum: receipts
The spec for read receipts states that a receipt means that
“the user has read up to a given event”.
In order to understand this, it is, of course, critically important to know
which events are before or after this event i.e. what their order is. In
practice, when existing homeservers report the read status of a room they use
the order in which they received the message as the order for receipts, which I
believe is a good order for this purpose.
As it stands, for an arbitrary event, the spec does not provide a way for a
client to determine which events are before or after it in this ordering, making
it essentially impossible for clients to handle read receipts in a fully-correct
way, or to resolve receipts consistently with the homeserver. The information is
hidden from the client! See Deciding whether a room or thread is
unread
for more detail.
Ordering events in server-arrival order would improve this situation, but to
make it easy for client authors to get receipts right, I believe we need an
order number for each event, making it easy to compare any two events’ order,
without constructing a timeline and placing them on it. This is
MSC4033.
Addendum: linear timeline
After I wrote the first draft of this post, some colleagues and I discussed it,
and thought up some reasons why a “persistent sync order” or “linearised
timeline” has problems we would need to solve.
The problems come when the homeserver has a gap in its timeline, and then uses
backpagination (or “backfilling”) to fetch events within that gap.
Quoting Rich vdh’s summary of our conversation:
Suppose Homeserver A has been participating in a busy room for a long time (10 years, for argument’s sake).
A netsplit happens; during the netsplit, homeserver B sends 200 messages.
Eventually, the netsplit heals. Homeserver A ends up receiving some, but
definitely not all, of those 200 messages. We now have a “gap” in the DAG.Time passes, and another 50 messages or so get sent.
Now, a user comes online. First of all, they only see the recent 50 messages.
But they scroll up, and get to the point of the “gap”. Assuming we backfill
homeserver B’s messages, where do they fit in the timeline?
- at the beginning (ie, 10 years ago)?
- at the end (ie, the user doesn’t actually see the messages until they scroll back down to “now”)
- or do we try and slot them in at the right point in the timeline?
I think the only plausible answer is 3, but it means that you could scroll
past the same point in the timeline twice and see different messages (because
we might not have been able to backfill the message the first time the user
scrolled past).…. Which I think really means that we have to say to the client “<some
messages missing here>” and the client needs to reflect that in the UI.
Thanks
Thank you to Erik J, Rich vdH, Kegan D, Florian H and others for discussions
leading up to this article, and for help with writing it. All mistakes,
misunderstandings and naiveties are my own.
Gaming Center
Gaming center adalah sebuah tempat atau fasilitas yang menyediakan berbagai perangkat dan layanan untuk bermain video game, baik di PC, konsol, maupun mesin arcade. Gaming center ini bisa dikunjungi oleh siapa saja yang ingin bermain game secara individu atau bersama teman-teman. Beberapa gaming center juga sering digunakan sebagai lokasi turnamen game atau esports.