Co-Authors: Yuhong Cheng, Shangjin Zhang, Xinyu Liu, and Yi Pan Efficient data processing is crucial in reducing learning curves, simplifying maintenance efforts, and decreasing operational complexity. This, in turn, helps engineers to develop and deploy data processing applications quickly and easily, powering various business requirements, and enhancing member...
Spark Articles
-
- Topics:
- Apache Samza,
- Spark,
- Stream Processing,
- apache,
- Data Streams
-
Co-authors: Shu Wang, Biao He, and Minchu Yang At LinkedIn, Apache Spark is our primary compute engine for offline data analytics such as data warehousing, data science, machine learning, A/B testing, and metrics reporting. We execute nearly 100,000 Spark applications daily in our Apache Hadoop YARN (more on how we scaled YARN clusters here). These applications...
- Topics:
- Spark,
- infrastructure
-
Co-authors: Venkata Krishnan Sowrirajan and Min Shen We are excited to announce that push-based shuffle (codenamed Project Magnet) is now available in Apache Spark as part of the 3.2 release. Since the SPIP vote on Project Magnet passed in September 2020, there has been a lot of interest in getting it into Apache Spark. As of March 2021, 100% of LinkedIn’s Spark...
- Topics:
- Spark,
- Open Source
-
Co-authors: Andy Li and Hongbin Wu Indexing plays the key role in modern search engines for fast and accurate information retrieval,...
- Topics:
- Spark,
- Data,
- Distributed Systems
-
Co-authors: Preetam Nandy, Yunsong Meng, Cyrus DiCiccio, Heloise Logan, Amir Sepehri, Divya Venugopalan, Kinjal Basu, and Noureddine...
-
Co-authors: Min Shen, Chandni Singh, Ye Zhou, and Sunitha Beeram At LinkedIn, we rely heavily on offline data analytics for...
- Topics:
- Spark,
- infrastructure,
- Data,
- Open Source