DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

Varun S.

Apache Hadoop Committer and PMC | Staff Software Engineer @ LinkedIn

January 28, 2022

Co-authors: Varun Saxena, Harikumar Velayutham, and Balamurugan Gangadharan

LinkedIn is the largest global professional network and generates massive amounts of high-quality data. Our data infrastructure scales to store exabytes of data; data analysts, data scientists, and AI engineers then use this data to power several LinkedIn products and the platform as a whole, ranging from job recommendations to each member’s personalized feed. Over the last few years, data scientists and AI engineers at LinkedIn have been using various tools for interacting with data, via different query and storage engines, for exploratory data analysis, experimentation, and visualization. But we soon realized a need for building a unified “one-stop” data science platform that would centralize and serve the various needs of data scientists and AI engineers.

Our solution to this challenge was to build DARWIN: the Data Science and Artificial Intelligence Workbench at LinkedIn. DARWIN addresses similar use-cases as various popular data science platforms in the industry, and while it leverages the Jupyter ecosystem, we go beyond merely Jupyter notebooks to support the whole gamut of the needs of data scientists and AI engineers at LinkedIn. This blog post covers the motivation behind building DARWIN and the key capabilities we wanted to include in the platform. We also touch upon the foundations and key concepts, which serve as the base for various features provided by DARWIN. Lastly, we introduce the features we have added in DARWIN thus far, powering different use cases.

Motivation for building a unified data science platform

Data scientists and AI engineers at LinkedIn have historically used various tools to leverage the power of data. However, this posed several challenges for productivity. Two principal themes broadly characterize these challenges:

Developer experience and ease of use: Before DARWIN, context switching across multiple tools was required and collaboration was difficult to achieve, thereby hampering productivity.
Fragmentation and variation in tooling: Another pre-DARWIN problem was fragmentation in tooling due to historical usage and personal preferences, which led to knowledge fragmentation, lack of easy discoverability of prior work, and difficulty in sharing results with partners. Moreover, making each tool compliant with LinkedIn's privacy and security policies, especially when tools were used locally, resulted in an ever-increasing overhead.

To solve these challenges, we needed to unify our scattered tooling for data science and AI workflows, something which industry research also indicated a move towards.

DARWIN was built to address the needs of not only data scientists of different skill sets and AI engineers, but all data producers and consumers at LinkedIn. It was important to identify various personas DARWIN had to cater to and later address their use cases. These included: expert data scientists and AI engineers; data analysts, product managers, and business analysts; metrics developers who use LinkedIn’s Unified Metrics Platform (UMP) to create and publish metrics; and data developers.

To truly make DARWIN a unified, one-stop tool for these personas, it was important to support the various phases of their development workflows, as well as the tools that they used. These included:

Data exploration and transformation: While many expert data scientists use Jupyter notebooks extensively at this step, others, such as citizen data scientists, product managers, and business analysts, use SQL extensively via user-interface-based tools such as Alation and Aqua Data Studio. Some business users also use Excel at this step.
Data visualization and evaluation: This work is often done primarily on Jupyter notebooks. AI engineers also use various machine learning libraries, such as GDMix, XGBoost, and TensorFlow, to train and evaluate different ML algorithms. For data visualization and delivering insights, we use products such as Tableau and other internal tools, targeted at different user personas ranging from engineers to sales reps.
Productionizing: This step may take the form of scheduling a production flow in Azkaban or using various other tools and frameworks, like Frame or Pro-ML, to perform everything from feature engineering to model deployment. Moreover, the code written has to be reviewed and checked into a Git repository.

phases-of-the-development-workflows-for-data-science

Building DARWIN, LinkedIn’s data science platform

Once we established that we needed a unified data science platform, personas had been recognized, and tools had been identified at various steps of development, we worked closely with key stakeholders to identify some of the capabilities DARWIN should support to make it the platform of choice. We also wanted to position DARWIN as a platform that partner teams could leverage and build on. Based on the above, we identified a list of key requirements for DARWIN:

Be a hosted platform for exploratory data analysis: DARWIN should be a hosted platform acting as a single window for all data engines, fulfilling the exploratory data analysis needs such as data analysis, data visualization, and model development.
Act as a knowledge repository and enable collaboration: Engineers should be able to share their work and review the work of others within DARWIN. We wanted to build the ability to discover others’ work, datasets, and insights about datasets and articles, and to create a data catalog. We also wanted DARWIN to allow users to consolidate artifacts through tags and to version their artifacts.
Include code support: We wanted to enable users to develop code in DARWIN as they would on an IDE, with support for multiple languages, and to give users the ability to commit their code directly into their project repositories.
Ensure governance, trust, safety, and compliance: We wanted DARWIN to provide secure and compliant access to the hosted platform, in line with LinkedIn’s principle of building trusted solutions.
Manage scheduling, publishing, and distribution: Users should be able to schedule executable DARWIN resources and generate results for repeatable analysis based on different parameters. Users should also be able to productionize their work by publishing the final results of their analysis in various formats and distributing them to stakeholders.
Integrate with other tools and platforms: We wanted DARWIN to leverage the power of other tools in the ecosystem and integrate with them to enable different user personas to have a unified experience of building ML pipelines, metric authoring, and data catalog in a single tool.
Be a scalable and performant hosted solution: Our goal was to move users away from standalone tools, many of which were used on their personal machines, while ensuring that our new solution was horizontally scalable and provided a similar experience as those tools, along with resource and environment isolation.
Have extensibility: We wanted DARWIN to have support for: different environments having different libraries, multiple languages for development, integration with various query engines and data sources, custom extensions, and kernels. We aimed to go beyond merely creating a notebook and to allow users to bring their own app (BYOA) and onboard it to DARWIN. We wanted democratization of the platform by enabling users to extend the platform and build solutions independently.

We cover the platform capabilities in greater detail in the next section. Below is a platform view of DARWIN:

DARWIN platform view

A key principle we decided to abide by was to leverage open source projects and contribute to the open source community while keeping the platform extensible for accommodating rapid innovations in this space. Some of the key open source technologies we chose were JupyterHub, Kubernetes, and Docker.

To realize these requirements, we came up with the following high-level architecture for the DARWIN platform.

DARWIN architecture

We will explain many of these components in more detail throughout the rest of this post but at a high level, the architecture above enables the provision of platform foundations such as scale, extensibility, governance, and management of concurrent user environments, on top of which various features are built. We also introduced a notion of DARWIN resources and isolated metadata from storage, which helps us easily evolve DARWIN into a knowledge repository. Moreover, access to various data sources and compute engines makes DARWIN a unified window to various data platforms.

DARWIN: Unified window to data platforms

DARWIN supports multiple engines to query datasets across LinkedIn. Spark, using languages such Python, R, Scala, or Spark SQL, is supported. Moreover, access to Trino and MySQL is supported. And, Pinot will be available soon. DARWIN also provides direct access to data on HDFS, which is helpful while using platforms such as Tensorflow. While this is the current set of platforms we support, our objective is to provide access to data across LinkedIn, irrespective of the platform it's stored in.

DARWIN platform foundations

Scale and isolation using Kubernetes
DARWIN needed to be horizontally scalable to accommodate the increasing number of users who leverage the power of data. Moreover, users having a dedicated and isolated environment of their own was equally important. Kubernetes helped us achieve both, and its support for long-running services, along with security features, made it an obvious choice. The rich feature set provided off-the-shelf in Kubernetes helped us focus on building differentiating features in DARWIN quickly and to go beyond merely notebooks, without investing in building some of these foundations.

Extensibility through Docker images
Docker was chosen not only to launch user notebook containers on Kubernetes, but also as a means of true democratization of the DARWIN platform, allowing other users and teams to extend and build on top of it. Furthermore, Docker enables users to package different libraries and applications due to its ability to isolate environments. Hence, it became an excellent fit for our vision of “Bring Your Own Application” (BYOA) to DARWIN. App developers can focus on packaging their app code and deploying to DARWIN, instead of worrying about scaling, site reliability support, compliance and governance, discovery, sharing, etc.

Partner teams build custom Docker images on top of base DARWIN images, encapsulating their apps or libraries and hosting them on the DARWIN platform. An independent Docker registry serves as an app marketplace for DARWIN.

We have supported various use cases from teams across LinkedIn, allowing them to build their solutions on top of the DARWIN platform. Notably,

An on-call dashboard with a custom front-end developed by the Abuse Incident Response and Prevention (AIRP) team
Support for Greykite, an end-to-end forecasting library, including input data visualization, model configuration, time-series cross-validation, and forecast visualization/interpretation, exposed to users via the Jupyter Notebook interface.

Management of concurrent user environments using JupyterHub
JupyterHub is highly customizable and can serve multiple environments with pluggable authentication. JupyterHub also provides a Kubernetes spawner to launch independent user servers on Kubernetes, thereby providing users with their own isolated environment. The flexibility which JupyterHub provides helps us integrate it with the LinkedIn authentication stack and ensures support for a wide variety of applications in the DARWIN ecosystem. JupyterHub also manages the user server lifecycle, with an inherent ability to cull user servers on inactivity, in addition to explicit logout, thereby providing some of the key capabilities we wanted.

Governance: Safety, trust, and compliance
At LinkedIn, we take data privacy very seriously. DARWIN, in line with that, ensures security and compliance. We maintain an audit trail for every operation performed, encrypt execution results, and store them securely to avoid leaks. Moreover, access to DARWIN resources is controlled using fine grained access control, thus preventing any unauthorized access.

DARWIN: A knowledge repository

DARWIN was designed to act as a means for accessing and sharing knowledge amongst users to enhance collaboration and learning. We envisaged DARWIN to be the one-stop place for all the knowledge related to working with data, without having to leave the platform, be it accessing data, understanding it, analyzing it, finding references to build context, or generating reports. Next, we cover some of the work we have done towards achieving this vision.

Modeling as resources
Every top-level knowledge artifact which users work on, want to collaborate with other users on, or which is stored in DARWIN is modeled as a resource in DARWIN. Examples of a resource may include notebooks, SQL workbooks, outputs, markdown files, reports, articles, projects which encapsulate these artifacts, or any other artifact we may support in the future, each being a different resource type. Resources can also be linked to each other, with an ability to define a hierarchy that helps us pass down operations that may apply to the secondary resources when the operation is invoked on a top-level resource.

Modeling these artifacts as resources gives us a powerful capability to add new resource types seamlessly in DARWIN while focusing on the frontend and backing unique functionality, with common operations across resources, such as CRUD operations, storage, features that enable collaboration, searching, and versioning provided off the shelf, as they apply to every resource type.

DARWIN resource metadata and storage
Platform service: Platform service manages the DARWIN resource metadata and is effectively the entry point for DARWIN, providing support for authentication and authorization, managing the launching of user containers via JupyterHub, and mapping resources to file blobs, for storing actual content, by interacting with storage service. We also store DARWIN resource metadata in DataHub for centralized metadata management and establishing relationships with other metadata entities.

Storage service: Storage service stores the backing content for a DARWIN resource abstracted away as file blobs in a persistent backend store. Having a separate service that handles storage allows us to evolve the storage layer and choice of backend storage. The user content being transferred from the user container to storage service is managed by a client-side DARWIN storage library, which can be plugged into the content manager of any app. For Jupyter notebooks, this is achieved by plugging it into a custom implementation of the Notebook Contents API.

Enabling collaboration
Collaboration is vital in an enterprise setting. DARWIN enables collaboration amongst engineers and data scientists through two key features, explained below.

Sharing resources with others: In the spirit of distributing knowledge, DARWIN allows users to share resources with other users. Sharing resources enables developers to learn from each other’s code and reuse it, serves as a means to share analysis for review, and allows data scientists to share the final analysis to the consumers of that analysis, be it end-users, product managers, or executives. DARWIN allows for doing all of this in a single place, without switching across tools. As discussed earlier, to leverage the work of others, LinkedIn allows published work to be visible to other users. DARWIN works on the same principle and, by default, allows any user to view the work of others, i.e., code, without the results, to ensure data is not shared with unauthorized users. However, resource owners can also explicitly share resources with results to users authorized to view that data, with security audits tracking such shares.

Search and discovery of resources: DARWIN also enables the search and discovery of metadata of DARWIN’s resources, with users able to search resources using various attributes. All of this capability is powered by DataHub, with the complementary capability of DataHub also allowing discovery of DARWIN resources currently in the works. The resource surfaced has the default view of “code” only, unless the resource owner explicitly shares it with results.

We serve all of this capability by a user interface in DARWIN. In DARWIN, we use React.js heavily by building React-based JupyterLab extensions for supporting the frontend of most of our features. React.js, with its vibrant community, rich plugin support, and excellent performance, has become the framework of choice for DARWIN.

DARWIN frontend also provides support for browsing of resources, the ability to perform CRUD operations on them, and support for switching execution environments, too.

DARWIN user workspace view

Searching for resources in DARWIN

JupyterLab in DARWIN

Key features provided by the DARWIN platform

In this section, we talk about the features we built in DARWIN catering to the needs of different personas discussed earlier.

Support for multiple languages
DARWIN provides an authoring experience to end-users in various languages, including Python, SQL, R, and Scala for Spark, covering all the languages used by data scientists and AI engineers across LinkedIn. Support for so many languages gives users the flexibility to focus on their analysis or experimentation and use the libraries of their choice, without worrying about learning a new language.

Intellisense capabilities
Intellisense is short for capabilities such as code completion, doc help, and function signatures, which are some of the most important features in an IDE. Having a similar experience in DARWIN helps developers to unify code development and testing with data in a single place. Intellisense in DARWIN spans several languages, including SQL, Python, R, and Scala, with support for SQL autocomplete achieved by a backing data catalog built by leveraging metadata stored in DataHub.

SQL workbooks
To address the needs of citizen data scientists, business analysts, or anyone comfortable working with SQL, we built SQL workbooks in DARWIN. SQL workbooks provide a SQL editor and display results in a tabular format with an ability to perform spreadsheet operations such as searching, filtering, sorting, pivoting, etc. The eventual aim is to support built-in visualizations for the data queried and the ability to publish reports, and to have a data catalog view with dataset profiles. These additional features would allow business analysts and citizen data scientists who do not perform complex analysis (like building models, querying data, or looking through visualizations and dataset profiles) to quickly analyze and understand data.

SQL workbook querying Trino

Scheduling of notebooks and workbooks
A critical step in productionizing, for data scientists, is to be able to perform repeatable analysis with new data generated continuously. The ability to schedule notebooks and workbooks from DARWIN addressed an important need of our users. Scheduling in DARWIN leverages Azkaban and allows for the specification of parameters which can then be used in code.

Sidebar to manage schedules and executions

Integration with other products and tools
While we built several features in DARWIN to address the needs of different user personas, we also leverage the capabilities of other tools by integrating them with DARWIN to create a unified experience for users of DARWIN.

To address the needs of expert data scientists and AI engineers, DARWIN added support for both Frame, which is an internal feature management tool we use for ML applications, and Tensorflow. Additionally, we are actively pursuing tight integration with LinkedIn's productive machine learning (Pro-ML) framework.

Also, to address the need of metrics developers, DARWIN has integrated with internal LinkedIn tools that provide capabilities such as error and validation, building metric templates, testing, reviewing, and code submissions, all in a single place.

The DARWIN customization capability has also enabled Greykite, a forecasting, anomaly detection, and root cause analysis framework, to leverage DARWIN.

Adoption within LinkedIn

Building DARWIN was just the first step to making DARWIN the tool of choice for the targeted user base. After the initial launch, we formed a product user council, which serves as a voice of the customer for each organization that uses DARWIN at LinkedIn. This paved the way for us to prioritize features in our roadmap and accommodate council feedback before releasing each feature, thereby helping us co-create with our users.

All this effort paved the way towards DARWIN being adopted by over 1400 active users across a wide range of orgs, including Data Science, Artificial Intelligence, SRE, Trust, Business Analysts and key product teams. We have seen our user base grow by over 70% in the past year alone. With the upcoming features in the pipeline, prioritized based on user feedback, we expect this number to increase in the future.

What’s next?

DARWIN continues to evolve, with the vision to be the default and one-stop platform for data scientists, AI engineers, and data analysts at LinkedIn. Below are some features we plan to support in DARWIN in the near future.

Publishing dashboards and apps
Once a notebook or a workbook is ready to be published for production use, the author typically only shares parts of the results with an end user, achieved by hiding a lot of details from them. For this, we intend to let the author manage the view by hiding sections of code and outputs, thus refining the end user view to what’s essential for the end user. Extending this functionality, we also plan to host always-running applications like Viola, Dash, Shiny, or any other custom applications.

Built-in visualizations
We plan to provide rich code-free visualization capabilities in DARWIN to enable citizen data scientists to quickly visualize data with familiar features present in applications like Microsoft Excel, Google Sheets, etc.

Projects, user workspaces, and version control support
In DARWIN, we have the notion of projects that act as namespaces for users. While projects are publicly available now, we plan to let users manage their projects on Git and enable version control. We also plan to introduce workspaces that would let users clone their projects and work on them until they commit their changes to Git. These workspaces will be backed by network attached storage, mounted as a volume.

Exploratory data analysis
DARWIN will leverage DataHub to power search and discovery for datasets, along with their metadata (including dataset schema), without users having to leave DARWIN. DataHub also allows us to surface lineage and relationships of datasets with other entities such as users, flows, and metrics.

Open sourcing DARWIN
We eventually plan to open source DARWIN so that other organizations looking for similar capabilities can leverage it.

Our eventual vision for DARWIN is to realize all the use cases that support the development lifecycles of various user personas and reach a state where either the functionalities of surrounding tools are supported in DARWIN or we integrate with external apps and frameworks.

Conclusion

In this blog post, we touched upon the motivations behind building DARWIN, the platform’s foundations, and described some of the key features of DARWIN. Our goal is that the DARWIN platform continues to evolve to best meet the growing (and changing) needs of our users.

Acknowledgments

It takes a village to build a product that impacts so many users across LinkedIn. We’re thankful to teams across Analytics Platform & Apps, Data Science, and Artificial Intelligence that we have worked closely with.

A big note of thanks to (in alphabetical order):

Core Engineering Team: Anushika Gupta, Debasree Mitra, John Sushant Sundaram, Manohar M, Manu Ram Pandit, Navneet Verma, Sakthivel Elango, Sarthak Jindal, Shubham Kharose, Somnath Pal, Swasti Kakker, and Yatin Arora.

User Council: Andy Edmonds, Arun Swami, Joojay Huyn, Song Lin, and Xiaofeng Wang, for helping us bring a user view and representing the data scientists, product managers, AI engineers, and several others across the engineering team for their enthusiastic and continuous inputs, and contributions towards DARWIN.

Leadership: We would also like to thank Chid K, Kapil Surlaker and Tai Tran from the leadership team for the continued support and investment in the project, as well as Ya Xu, Head of Data, for the help to enable adoption within the Data Science org.

Topics: Data Data Science