Open Source Articles

  • gif-showing-smarg-arg-coding-in-action

    Smart Argument Suite: Seamlessly connecting Python jobs

    January 25, 2021

    Co-authors: Jun Jia and Alice Wu Introduction It’s a very common scenario that an AI solution involves composing different jobs, such as data processing and model training or evaluation, into workflows and then submitting them to an orchestration engine for execution. At large companies such as LinkedIn, there may be hundreds of thousands of such executions per...

  • FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

    January 6, 2021

    Co-authors: Zihan Li, Sudarshan Vasudevan, Lei Sun, and Shirshanka Das Data analytics and AI power many business-critical use cases at LinkedIn. We need to ingest data in a timely and reliable way from a variety of sources, including Kafka, Oracle, and Espresso, bringing it into our Hadoop data lake for subsequent processing by AI and data science pipelines. We...

  • coral-a-sql-translation-analysis-and-rewrite-engine

    Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses

    December 10, 2020

    Co-authors: Walaa Eldin Moustafa, Wenye Zhang, Sushant Raikar, Raymond Lam, Ron Hu, Shardul Mahadik, Laura Chen, Khai Tran, Chris Chen, and Nagarathnam Muthusamy Introduction At LinkedIn, our big data compute infrastructure continually grows over time, not only to keep pace with the growth in the number of data applications, or their domains spanning data...

  • explaining-metadata-architectures

    DataHub: Popular metadata architectures explained

    December 7, 2020

    When I started my journey at LinkedIn ten years ago, the company was just beginning to experience extreme growth in the volume,...

  • pegasus-data-language

    Pegasus Data Language: Evolving schema definitions for data...

    November 19, 2020

    Pegasus Data Schema (PDSC) is a Pegasus schema definition language that has been used for data modeling with Rest.li services for...

  • architecture-diagram-of-magnet

    Magnet: A scalable and performant shuffle architecture for ...

    October 21, 2020

    Co-authors: Min Shen, Chandni Singh, Ye Zhou, and Sunitha Beeram At LinkedIn, we rely heavily on offline data analytics for...