I've designed and advised on a fair number of distributed systems back in the 2010s. I figured it would be fun to speculate on some of the core challenges going on inside of Twitter/X right now. Distributed Consistency is the single largest challenge in this space and it lies at the heart of deliver once guarantees.

Have you ever received a message or notification twice?

Kind of annoying, right.

Well the reality is that almost all systems at scale represent a sort of funnel with data coming in from hundreds of millions of endpoints and getting ordered on a distributed queue. Now some operations are idempotent, which means that we can process them 5 times and it has no effect on the system of record. This is an ideal distributed system.

Unfortunately, this is not acceptable for a certain class of problems. Ironically messages, notifications and inventory updates are examples of this.

We can't add a message to the DB multiple times, notify the user 3 times and we can't sell more things than we have in stock.

It kind of sucks to sit in a queue on Ticket Master for those Taylor Swift tickets, right? Well it would be even worse if they sold you a ticket that didn't exist.

So we build distributed queuing systems that give us guarantees that a particular event will only be processed once. This allows us to make the systems that process the data such as inserting into a DB or triggering another system stateless. IE: You can have 50 copies of the same container, but the queue ensures that an event will only be handed out once.

In a normal state, these systems exist in a base load. They hum along without much concern, but every second they are offline, we are adding to the peak load when a system attempts to come back online. This creates what is known as a thundering herd. A system is designed and scaled for a certain level of base load but when the flood gates open you're now exposed to a massive and sustained peak load.

There are mechanisms called circuit breakers to deal with this by ensuring that we throttle load and protect downstream systems from this new peak load. However, you now have a new problem. If you can process 10,000 records/sec and you now have 100,000 records + 10,000 records/sec, you're going to have to scale the system to catch up.

Twitter and LinkedIn were pioneers of the Kafka project back in the day and while the downstream systems are likely isolated, there is likely a large amount of shared queuing infrastructure that is powering both notifications and messages.

These are especially problematic as well, because a message in a group chat could simultaneously trigger message notifications for hundreds of people and when the system comes back online everyone starts sending stupid messages to each other which amplifies load even further. This sort of knock on effect, is called write amplification in which one message can cause up to N writes.

This likely put strain on the notification infrastructure in a typical noisy neighbor situation and now you have a cold start problem on notifications when it went offline, because notifications are core to every interaction on the site.

In the same way that you can't just restart a power grid, because the peak usage trips as refrigerators and ACs draw significantly more power at startup. It's incredibly hard to cold start a distributed and consistent system, because dealing with the pent up demand and accounting for peak demand as people unknowingly amplify the problems that took it offline is challenging.

I'm pouring one out for the SREs on call right now and I'll leave you with a story.

There exists a rumor that the Google SRE team that worked on Spanner maintained such high quality atomic clocks to power their distributed consistency algorithm that they once called up the US Airforce to notify them that their atomic clocks used for GPS synchronization were wrong.

Consistency is Beautiful

Connect with me