Open Source Hits Scale in 2015

Igor Perisic

VP Engineering; AI, Privacy and Data

December 22, 2015

In 2015 we made our biggest contributions yet to the open source community by open sourcing more than 10 original projects, including Pinot, Burrow and Gobblin, and pushing significant updates to Samza, Rest.li, Kafka and Voldemort, four of our most broadly adopted open source projects.

We‘ve worked to scale our infrastructure as we reached 400 million LinkedIn members, so it’s no surprise many of our open source projects this year focus on building out our data pipelines and tools to help make sense of our data. The infrastructure improvements we’ve made in Kafka have allowed us to handle 1.3 trillion messages per day, and Espresso now serves 2.2 million rows per second. This trend of scaling, of course, runs parallel with some of the larger trends among tech companies. Producing and acting on insights built from data is a foundational element of today’s technology companies, and having the correct data is critical to getting those insights. There are dozens and dozens of complex issues around this single problem: sourcing the correct data, getting it into a data warehouse, using the right tools and schemas to sort it and create those insights.

Open source is no longer just a community or hobby for engineers, it’s part of core business strategy that we are fully invested in and something that will produce projects that will become – or have become – key pieces of LinkedIn’s infrastructure. This strategy helps push our engineers to write better software, develop faster, and become part of a larger community of many of the world’s most talented engineers. While 2015 was a great year for our open source contributions, we have a strong pipeline of projects we plan to open source in 2016 that we believe will be valuable for companies to adopt.

Here’s a look at some of the highlights from LinkedIn’s year in open source.

2015 Milestones

January: Apache Samza Graduates from Apache Incubator
February: Open sourcing Spyglass - a flexible library for implementing mentions on Android
March: Rest.li 2.x and a Protocol Upgrade Story
April: Developing Play Applications using Gradle
April: Optimizing Java CMS garbage collections, its difficulties, and using JTune as a solution
June: Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics
June: Burrow: Kafka Consumer Monitoring Reinvented
August: Open-Sourcing the LinkedIn Gradle Plugin and DSL for Apache Hadoop
August: Introducing QARK: An open-source tool to improve Android application security
September: How We’re Improving and Advancing Kafka at LinkedIn
September: FeatureFu: Building Featureful Machine Learning Models
September: Bridging Batch and Streaming Data Ingestion with Gobblin
October: Voldemort 1.10 Open-Source Release
October: Open-sourcing PalDB, a lightweight companion for storing side data

This data at scale revolution is only getting started, of course, and the future holds a lot of really exciting work for data-focused open source projects, both from LinkedIn and the larger community. In the coming years, I think we’ll see great advances at the streaming data layer, which hasn't yet reached maturity, as well as a continued focus on cloud computing through the entire stack. Whether it is on the container level (Docker, etc.) or the platform level (Hadoop, Spark, etc.)

It will be interesting to track the impact open source will make on more consumer-facing industries like drone technology and the Internet of Things. It is bound to make a massive impact on healthcare – the explosion of digital health information can only work correctly on a common platform, and that will most likely be open source. Open source is already affecting every single domain that traffics in code, and that impact will only continue to grow.

Topics: Open Source