LinkedIn is a data-driven company. We operate on massive datasets that were produced both internally by flagship products and externally by partners. In order to better serve our members and enterprise, we constantly need new data from other vendors. The internet-scale volume of data integration and the industry-leading standards of privacy, compliance, and schema evolution add engineering complexities to the integration.
The mission of LinkedIn Data Integration team is to unify and leverage the Gobblin ecosystem for seamless inter-company and intra-company data exchange. This team strives to build leverageable libraries, plugins, and extensions for Gobblin in order to streamline and standardize data integration among Linkedin and other third-party vendor’s systems.
As more and more of our partners offer their data via APIs, we have contended with increasing diversity and velocity of our integrations. Our main objective in the near future is to integrate data from a vast number of sources, formats, and protocols efficiently with a well-defined interface leveraging robust metadata. We truly believe that a configuration-driven shared service (Generic Connectors) is the answer with a common object model with an ability to expand the suite with distinct patterns.
Currently, a large number of posts created on LinkedIn involve media such as videos, images, and docs, and this data is growing rapidly on the platform. Our team’s most recent integration is media ETL that enables the extraction of audio and frames to Hadoop. It met the demands of throttling requirements, while extracting the downloads from encrypted URLs with prescribed TTL. The next chapter of this integration is to expand our system to extract media assets from Linkedin’s Ambry, a distributed object media store.
The other impactful integrations are with the CRM systems that serve a larger business community and enabling them to make key business decisions. We are currently building a privacy store, which includes standardizing member opt-outs for 3rd party integrations. This service integration comprises a larger ecosystem with Kafka event creation, a Venice Key-Val Store, and a Rest Interface.