Instant Messaging at LinkedIn: Scaling to Hundreds of Thousands of Persistent Connections on One Machine

Akhilesh Gupta

Making LinkedIn ⚡️ fast!

October 24, 2016

We recently introduced Instant Messaging on LinkedIn, complete with typing indicators and read receipts. To make this happen, we needed a way to push data from the server to mobile and web clients over persistent connections instead of the traditional request-response paradigm that most modern applications are built on. In this post, we’ll describe the mechanisms we use to instantly send messages, typing indicators, and read receipts to clients as soon as they arrive. We’ll describe how we used the Play Framework and the Akka Actor Model to manage Server-sent events-based persistent connections. We’ll also provide insights into how we did load testing on our server to manage hundreds of thousands of concurrent persistent connections in production. Finally, we’ll share optimization techniques that we picked up along the way.

Server-sent events

Server-sent events (SSE) is a technology where a client establishes a normal HTTP connection with a server and the server pushes a continuous stream of data on the same connection as events happen, without the need for the client to make subsequent requests. The EventSource interface is used to receive server-sent events or chunks in text/event-stream format without closing the connection. Every modern web browser supports the EventSource interface, and there are readily-available libraries for iOS and Android.

In our initial implementation, we chose SSE over Websockets because it works over traditional HTTP, and we wanted to start with a protocol that would provide the maximum compatibility with LinkedIn’s widespread member base, which accesses our products from a variety of networks. Having said that, Websockets is a much more powerful technology to perform bi-directional, full-duplex communication, and we will be upgrading to that as the protocol of choice when possible.

Play Framework and Server-sent events

At LinkedIn, we use the Play Framework for our server applications. Play is an open source, lightweight, fully asynchronous framework for Java and Scala applications. It provides out-of-the-box support for EventSource and Websockets. To maintain hundreds of thousands of persistent SSE connections in a scalable fashion, we use Play’s integration with Akka. Akka allows us to raise the abstraction model and use the Actor Model to assign an Actor to each connection that the server accepts.

The code snippet above demonstrates the use of Play’s EventSource API to accept an incoming connection in the application controller and assign it to be managed by an Akka Actor. The Actor is now responsible for the lifecycle of this connection, and so sending a chunk of data to the client as soon as an event happens is as simple as sending a message to the Akka Actor.

Notice how the only way to interact with the connection is to send a message to the Akka Actor managing that connection. This is fundamental to what makes Akka asynchronous, non-blocking, highly performant, and designed for a distributed environment. The Akka Actor in turn handles the incoming message by forwarding it to the EventSource connection that it manages.

This is it. It’s this simple to manage concurrent EventSource connections using the Play Framework and the Akka Actor model.

How do we know that this works well at scale? Read the next few sections to find out.

Load-testing with real production traffic

There is only so much that one can simulate with load-testing tools. Ultimately, a system needs to be tested against not-easily-replicable production traffic patterns. But how do we test against real production traffic before we actually launch our product? For this, we used a technique that we like to call a “dark launch.” This will be discussed in more detail in a later post.

For the purposes of this post, let’s say that we are able to generate real production traffic on a cluster of machines running our server. An effective way to test the limits of the system is to direct increasing amounts of traffic to a single node to uncover problems that you would have faced if traffic had increased manifold on the entire cluster.

As with anything else, we hit some limits, and the following sections are a fun story of how we eventually reached a hundred thousand connections per machine with simple optimizations.

Limit I: Maximum number of pending connections on a socket

During some of our initial load testing, we ran into a strange problem where we were unable to open more than approximately 128 concurrent connections at once. Please note that the server could easily hold thousands of concurrent connections, but we could not add more than about 128 connections simultaneously to that pool of connections. In the real world, this would be the equivalent of having 128 members initiate a connection to the same machine at the same time.

After some investigation, we learned about the following kernel parameter.

net.core.somaxconn

This kernel parameter is the size of the backlog of TCP connections waiting to be accepted by the application. If a connection indication arrives when the queue is full, the connection is refused. The default value for this parameters is 128 on most modern operating systems.

Bumping up this limit in /etc/sysctl.conf helped us get rid of the “connection refused” issues on our Linux machines.

Please note that Netty 4.x and above automatically pick up the OS value for this parameter and use it while creating the Java ServerSocket. However, if you would rather configure this on the application level too, you can set the following configuration parameter in your Play application.

play.server.netty.option.backlog=1024

Limit II: JVM thread count

A few hours after we allowed a significant percentage of production traffic to hit our server for the first time, we were alerted to the fact that the load balancer was unable to connect to a few of our machines. On further investigation, we saw the following all over our server logs.

java.lang.OutOfMemoryError: unable to create new native thread

The following graph for the JVM thread count on our machines corroborated the fact that we were dealing with a thread leak and running out of memory.

We took a thread dump of the JVM process and saw a lot of sleeping threads in the following state.

On further investigation, we found that we had a bug in Netty’s idle timeout support on LinkedIn’s fork of the Play framework where a new HashedWheelTimer instance was being created for each incoming connection. This patch demonstrates the bug pretty clearly.

If you hit the JVM thread limit, chances are that there is a thread leak in your code that needs to be fixed. However, if you find that all your threads are actually doing useful work, is there a way to tweak the system to let you create more threads and accept more connections?

The answer, as always, is fun. It’s interesting to discuss how available memory limits the number of threads that can be created on a JVM. The stack size of a thread determines the memory available for static memory allocation. Thus, the absolute theoretical maximum number of threads is a process’s user address space divided by the thread stack size. However, the reality is that the JVM also uses memory for dynamic allocation on the heap. With a few quick tests with a small Java process, we could verify that as more memory is allocated for the heap, less is available for the stack. Thus, the limit on the number of threads decreases with increasing heap size.

To summarize, you can increase the thread count limit by decreasing the stack size per thread (-Xss) or by decreasing the memory allocated to the heap (-Xms, -Xmx).

Limit III: Ephemeral port exhaustion

We did not actually hit this limit, but we wanted to mention it here because it’s a common limit that one can hit when maintaining hundreds of thousands of persistent connections on a single node. Every time a load balancer connects to a server node, it uses an ephemeral port. The port is associated with the connection only for the duration of the connection and thus referred to as “ephemeral.” When the connection is terminated, the ephemeral port is available to be reused. Since persistent connections don’t terminate like usual HTTP connections, the pool of available ephemeral ports on the load balancer can get exhausted. It’s a condition where new connections cannot be created because the OS has run out of the port numbers allocated to establish new local sockets. There are various techniques to overcome ephemeral port exhaustion on modern load balancers, but those are outside the scope of this post.

We were lucky to have a very high limit of 250K connections per host possible from the load balancer. However, if you run into this limit, work with the team managing your load balancers to increase the limit on the number of open connections between the load balancer and your server nodes.

Limit IV: File descriptors

Once we had significant production traffic flowing to about 16 nodes in one data center, we decided to test the limit for the number of persistent connections each node could hold. We did this by shutting down a couple of nodes at a time so that the load balancer directed more and more traffic to the remaining nodes. This produced the following beautiful graph for the number of file descriptors used by our server process on each machine, which we internally dubbed the “caterpillar graph.”

A file descriptor is an abstract handle in unix-based operating systems that is used to access a network socket, among other things. As expected, more persistent connections per node meant more allocated file descriptors. As you can see, when only 2 of the 16 nodes were live, each of them were utilizing 20K file descriptors. When we shut down one of them, we saw the following error in the logs of the remaining one.

java.net.SocketException: Too many files open

We had hit the per-process file descriptor limit when all connections were directed to one node. The file descriptor limit for a running process can be seen in the following file under Max open files.

$ cat /proc/<pid>/limits

Max open files 30000

This can be bumped to 200K (as an example) by adding the following lines to /etc/security/limits.conf:

<process username> soft nofile 200000

<process username> hard nofile 200000

Note that there is also a system-wide file descriptor limit, a kernel parameter that can be tweaked in /etc/sysctl.conf.

fs.file-max

We bumped the per-process file descriptor limit on all our machines and, voila, we could easily throw more than 30K connections to each node now. What limit do we hit next…?

Limit V: JVM heap

Next up, we repeated the above process with about 60K connections directed to each of two nodes, and things started to go south again. The number of allocated file descriptors, and correspondingly, the number of active persistent connections, suddenly tanked and latencies spiked to unacceptable levels.

On further investigation, we found that we had run out of our 4GB JVM heap space. This produced another beautiful graph demonstrating that each GC run was able to reclaim less and less heap space till we spiraled toward being maxed out.

We use TLS for all internal communication inside our data centers for our instant messaging services. In practice, each TLS connection seems to consume about 20KB of memory on the JVM, and that can add up quickly as active persistent connections grow, leading to an out-of-memory situation like the above.

We bumped up the JVM heap space to 8GB (-Xms8g, -Xmx8g) and re-ran our tests to direct more and more connections to a single node, till we ran out of memory again at about 90K connections on one node and connections started to get dropped.

Indeed, we had run out of heap space once again, this time at 8G.

We had not run out of processing power, as CPU utilization was still well under 80%.

What did we do next? Well, our server nodes have the luxury of 64GB of RAM and thus, we bumped up the JVM heap space to 16GB. Since then, we haven’t hit the memory limit in our perf tests and have successfully held 100K concurrent persistent connections on each node in production. However, as you have seen in the sections above, we will hit some limit again as traffic increases. What do you think will it be? Memory? CPU? Tell us what you think by tweeting to me at @agupta03.

Conclusion

In this post, we presented an overview of how we used Server-sent events for maintaining persistent connections with LinkedIn’s Instant Messaging clients. We also showed how Akka’s Actor Model can be a powerful tool for managing these connections on the Play Framework.

Pushing the boundaries of what we can do with our systems in production is something we love to do here at LinkedIn. We shared some of the most interesting limits that we hit during our quest to hold hundreds of thousands of connections per node on our Instant Messaging servers. We shared details to help you understand the reasons behind each limit and techniques for extracting the maximum performance out of your systems. We hope that you will be able to apply some of these learnings to your own systems.

Acknowledgements

The development of Instant Messaging technology at LinkedIn was a huge team effort involving a number of exceptional engineers. Swapnil Ghike, Zaheer Mohiuddin, Aditya Modi, Jingjing Sun, and Jacek Suliga pioneered the development of a lot of the technology that is discussed in this particular post along with us.

Topics: Optimization Scalability Product Design Infrastructure