How Discord Indexes Trillions of Messages: A Major Technical Evolution

10 minutes de lecture

Discord, the communication platform favored by millions of users, has shared fascinating details about the evolution of its search infrastructure. This infrastructure now allows it to index trillions of messages while introducing new features, such as cross-server search. This article delves into the technical workings of this transformation, drawing inspiration from the original article published on Discord’s blog.


Context and initial challenges

In 2017, Discord deployed its first search system to index billions of messages. At that time, the infrastructure relied on Elasticsearch, a powerful and flexible open-source search engine. Messages were organized – or sharded – by server (called a guild in Discord terminology) or by direct message (DM). Each shard was stored in indices distributed across two Elasticsearch clusters. This organization made it possible to group messages from the same server to speed up searches while maintaining manageable cluster sizes.

To save resources, Discord used lazy indexing. Concretely, messages were only indexed when a user launched a search on a server for the first time. A queue based on Redis served as an intermediary: it stored messages to be indexed, and workers (background processes) retrieved them in batches to send them to Elasticsearch. This approach took advantage of Elasticsearch’s bulk indexing capabilities, which allows processing multiple documents in a single operation.

But with Discord’s explosive growth – millions of new users and ever-larger servers – this architecture revealed its limitations:

  • Message loss in Redis : Redis, while extremely fast for in-memory operations, was not designed to guarantee message delivery in high-availability scenarios. In case of overload or failure, messages could be lost.
  • Elasticsearch node overload : Bulk indexing, while efficient, could overwhelm Elasticsearch nodes, especially when thousands of messages needed to be processed simultaneously.
  • Hotspots from large servers : Popular servers, with millions of members and a constant stream of messages, generated disproportionate traffic. This created hotspots – points where a node or shard was overloaded – slowing down the entire system.

These challenges prompted Discord to rethink its infrastructure to make it more robust, scalable, and capable of supporting continuous growth.


Migration of the queue to Google Cloud PubSub

The first problem to solve was queue reliability. Redis, while suited to simple and fast tasks, could not guarantee message delivery in an environment as demanding as Discord’s. To address this, the team migrated the queue to Google Cloud PubSub, a cloud messaging service designed to handle massive volumes of data with maximum availability.

Why PubSub?

  • Guaranteed delivery : Unlike Redis, PubSub stores messages persistently and redelivers them in case of failure, eliminating data loss.
  • Automatic scalability : PubSub adapts to traffic spikes without manual intervention, a crucial asset for a rapidly growing platform like Discord.
  • Seamless integration : The team was able to integrate PubSub into its existing architecture with minimal changes, accelerating the transition.

Concrete example

Imagine a queue like a mailbox. With Redis, if the mail carrier (the worker) arrives when the box is overflowing, letters (messages) can be lost. With PubSub, each letter is registered and kept until the mail carrier picks it up, even in case of delay or failure.

This migration resolved message loss and laid the foundation for other improvements, while opening the door to new use cases, such as scheduling complex tasks.


Improvement of bulk indexing

Bulk indexing is a key feature of Elasticsearch: it allows processing thousands of messages in a single request, thus reducing system load. However, in Discord’s initial approach, workers retrieved batches of messages without regard for their final destination – that is, the Elasticsearch cluster and index where they should be stored. As a result, an indexing operation could touch multiple nodes simultaneously, increasing the risk of overload or failure.

To optimize this process, Discord introduced a PubSub router, an intelligent intermediate layer that:

  1. Retrieves messages from PubSub.
  2. Groups them by destination cluster and index.
  3. Sends coherent batches to the correct Elasticsearch cluster.

Benefits

  • Reduced load : Each indexing operation concerns only a single node, avoiding unnecessary overloads.
  • Fault isolation : If a node fails, only operations destined for it are affected, not the entire system.
  • Increased performance : Batches are more targeted, which speeds up processing.

An analogy

Think of a restaurant kitchen. Before, servers brought mixed orders to all chefs at the same time, creating chaos. Now, a maître d’ (the router) sorts the orders and distributes them to the right chef (node), making preparation smoother.

This improvement was crucial for managing a growing number of clusters and indices, while maintaining fast and reliable indexing.


Introduction of Elasticsearch cells

To push scalability further, Discord introduced the concept of cells. A cell is a logical grouping of multiple small Elasticsearch clusters, each dedicated to a specific portion of the data.

Why cells?

  • Load distribution : Messages from different servers are distributed across multiple clusters, avoiding hotspots.
  • Isolation of large servers : A server with millions of messages can be isolated in its own cell, thus protecting the performance of other users.
  • Simplified maintenance : A cell can be updated or repaired without affecting the entire system.

Impact

This structure enabled Discord to handle massive data volumes while unlocking advanced features. For example, cross-server search – which allows searching across multiple servers at once – relies on this ability to efficiently distribute and coordinate data.

Imagine a library: instead of having one giant bookshelf for all books, you have multiple small sections (cells). If a section becomes too popular, it can be moved or enlarged without disrupting the rest.


Architecture of clusters and nodes

Within each cluster of a cell, Discord adopted an architecture with dedicated nodes, each optimized for a specific task:

  • Ingest nodes : They preprocess and route messages. Stateless, they can be easily multiplied to absorb traffic spikes.
  • Eligible master nodes : Responsible for cluster coordination, they have dedicated resources to ensure stability.
  • Data nodes : They store indexed messages and execute queries. Configured with sufficient memory (heap), they support intensive operations.

Why this separation?

This specialization allows resources to be allocated where they are needed. For example, an ingest node doesn’t need to store data, so it can be lightweight and fast, while a data node must be robust to handle large indices.

The pods (instances) of these nodes are deployed on separate machines and nodepools in Kubernetes, further optimizing resource utilization.


Deployment in Kubernetes

To orchestrate this complex infrastructure, Discord migrated its Elasticsearch clusters to Kubernetes, a container orchestration system. Kubernetes provides:

  • Automation : Pod deployment, scaling, and management are automated.
  • Resilience : In case of failure, Kubernetes restarts or replaces failing pods.
  • Optimization : Resources are allocated precisely through nodepools.

Example

If a cluster is like a team of workers, Kubernetes is the orchestrator who assigns tasks, replaces absent members, and ensures everything runs harmoniously.

This transition reduced operational burden for Discord teams while improving infrastructure reliability.


Benefits and new features

These developments brought significant improvements:

  • Reliable indexing : With PubSub and optimized routing, messages are indexed without loss and faster.
  • Management of large servers : Cells isolate high-traffic servers, eliminating hotspots.
  • Scalability : The system supports trillions of messages without compromising performance.
  • New features : Cross-server search, for example, is now possible thanks to this flexible architecture.

These changes go beyond technical gains: they directly improve user experience, making search faster and more powerful.


The evolution of Discord’s search infrastructure perfectly illustrates how a technology company can meet the challenges of growth. By migrating to Google Cloud PubSub, optimizing indexing, introducing Elasticsearch cells, and adopting Kubernetes, Discord transformed a limited system into an architecture capable of handling trillions of messages.

This transformation demonstrates the importance of constantly adapting technologies to changing needs. For Discord, this means delivering a seamless experience to millions of users while laying the groundwork for future innovations.

Source : This article is based on « How Discord Indexes Trillions of Messages », published on Discord’s blog.

Partager cet article
Laisser un commentaire