Spark Articles

  • Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

    March 23, 2023

    Co-Authors: Yuhong Cheng, Shangjin Zhang, Xinyu Liu, and Yi Pan Efficient data processing is crucial in reducing learning curves, simplifying maintenance efforts, and decreasing operational complexity. This, in turn, helps engineers to develop and deploy data processing applications quickly and easily, powering various business requirements, and enhancing member...

  • Reducing Apache Spark Application Dependencies Upload by 99%

    March 9, 2023

    Co-authors: Shu Wang, Biao He, and Minchu Yang At LinkedIn, Apache Spark is our primary compute engine for offline data analytics such as data warehousing, data science, machine learning, A/B testing, and metrics reporting. We execute nearly 100,000 Spark applications daily in our Apache Hadoop YARN (more on how we scaled YARN clusters here). These applications...

  • title-card

    Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

    October 20, 2021

    Co-authors: Venkata Krishnan Sowrirajan and Min Shen We are excited to announce that push-based shuffle (codenamed Project Magnet) is now available in Apache Spark as part of the 3.2 release. Since the SPIP vote on Project Magnet passed in September 2020, there has been a lot of interest in getting it into Apache Spark. As of March 2021, 100% of LinkedIn’s Spark...

  • an-illustration-of-the-distributed-tier-merge

    Distributed tier merge: How LinkedIn tackles stragglers in ...

    September 27, 2021

    Co-authors: Andy Li and Hongbin Wu Indexing plays the key role in modern search engines for fast and accurate information retrieval,...

  • diagram-showing-cycle-of-bias-reinforcement-over-time

    Using the LinkedIn Fairness Toolkit in large-scale AI...

    February 8, 2021

    Co-authors: Preetam Nandy, Yunsong Meng, Cyrus DiCiccio, Heloise Logan, Amir Sepehri, Divya Venugopalan, Kinjal Basu, and Noureddine...

  • architecture-diagram-of-magnet

    Magnet: A scalable and performant shuffle architecture for ...

    October 21, 2020

    Co-authors: Min Shen, Chandni Singh, Ye Zhou, and Sunitha Beeram At LinkedIn, we rely heavily on offline data analytics for...