New Analytics for Sharing on LinkedIn: See Who’s Viewed Your Post
November 1, 2016
Coauthor: Bharath Kumarasubramanian
If you’re sharing content on LinkedIn, you’re positioning yourself as a thought leader among the largest group of professionals assembled on the web. With more than 460 million members, LinkedIn is the place you want to be to share your expertise and discuss issues that matter to you with other like-minded professionals. But, how do you know what is resonating and who is listening to what you have to say? How are you reaching them?
To help answers these questions, we’re excited to introduce new insights on mobile that make it easier to understand who is seeing your posts. These analytics provide insights into who’s reading and engaging with the posts you’ve shared or articles you’ve written. With this knowledge, you’ll be able to evaluate whether you’re reaching the right audience, which shares are resonating with readers, and more.
Content analytics detail page
Overview and challenges
At the most basic level, an impression event gets fired from the client based on member action (when content appears in the member’s feed, a gesture event when the user performs a comment or share, etc.). But before we dive into the details of what we do with these events, it’s important to have a high-level understanding of the challenges of implementing this kind of a content analytics system.
One of the fundamental issues with any analytics system is creating a scalable, low-latency pipeline for capturing events. Fortunately for us, LinkedIn has already embraced the principle of tracking as a product requirement right from feature inception. This allows us to take advantage of prior data infrastructure investments to support our content analytics pipeline.
Similar problems exist when it comes to accessing and displaying these events to content authors at scale, with the addition of contextual information from the LinkedIn Knowledge Graph.
We’ll discuss how we solved both of these challenges later in the blog post.
As you can see from the figure above, the analytics detail page is divided into multiple modules: header, highlights carousel, network reach module, and suggested articles (those last three modules are called “social update analytics”). For the Share Analytics page, we created two new endpoints in our frontend API service, which we call Voyager-API: one for fetching only the header, and one for fetching the rest of the modules. The header contains the article title, article image, and counts for views, likes, and comments.
The client does a Rest.li multiplexer call to fetch both the header and collection of social update analytics.
To analyze members’ behavior and the success metrics (such as Engaged Feed Sessions, EFS), we also added custom tracking. On the API side, we fire a data served event through Kafka every time the API gets a client request. Later, this event is joined with impression and action events fired from clients.
As an analytics platform, providing real-time analytics is one of our foremost tenets. With the data available in real time from Kafka, we need a system that can process streams of data and perform aggregation and decoration before making the data available for clients to consume. In order to handle the stream of data, we leverage one of our own in-house systems, Samza, which is a distributed, scalable stream processing framework.
Consume the relevant input Kafka events: ImpressionEvents, LikeEvents, CommentEvents, ShareEvents.
Decoration phase: For each input event, we make server calls to fetch demographic data, i.e., the actor’s company, actor’s industry, etc.
Joining phase: In some scenarios, we need to associate the input tracking event with the actual entity, i.e., a like event is usually associated with an activity which could represent various entities (share, article, etc.). Since our system is not the source of truth for the data, we either make remote calls or store some minimal state within our system in order to perform the join.
Output: Once the system has all the data in place, we generate the appropriate output Kafka message, which is in turn consumed by the data store.
We had some unique challenges related to the joining phase operating in real time. We had to make downstream calls to fetch the entity information for certain use cases, and this was adding to the processing time and creating lots of load to the downstream service. We chose to persist the state about the entities inside each container using RocksDB. With this approach, we could split the data across different Samza containers and have the stream route the events based on the relevant partitioning key.
In the offline flow, we consume the related events from a Hadoop distributed file system (HDFS) and perform similar processing to that which occurs in real time. The offline system is analogous to a batch processing system in lambda architecture. Typically, it handles deduplication, spam filtering, and other logic that are expensive to implement in real time. We also perform additional optimizations to restructure the data in order to make slicing and dicing effective for the data store. One other benefit of these systems is that, during the case of data loss or bugs in the data processing pipeline, these jobs come in handy for bootstrapping the data with ease.
During our initial design, we chose to have independent business logic in both the offline and real-time systems. Over the course of development, however, data consistency, code duplication, and maintenance became concerns. The revised approach of sharing code between systems using a library helped us overcome the maintenance overhead and also avoid code duplication.
Our mid tier API system is responsible for serving data to clients. The service is built using our in-house REST + JSON framework Rest.li. It offers a rich typed interface for clients to query analytics data based on various facets and filters. Our typical client use cases slice the data across time and faceting dimension like actor’s industry, company, etc. The service consists of an API layer and a core layer.
The API layer is responsible for handling specific business logic and defines strongly typed APIs that are leveraged by clients. It hosts business logic for constructing the query based on the client request and dispatching it to the core layer. It is also responsible for processing the results and building the appropriate data projections.
The core layer abstracts the underlying data store components. It offers a type-safe, fluent interface to construct a query that follows SQL-like grammar. The generic query from the data type layer is translated to a data store-specific query and then dispatched to the data store-specific executor. Adding a new data store support involves the one-time effort of adding executors and specific query formatters.
API layer: Handles the incoming request from clients; builds the generic query and dispatches it to the core layer.
Core layer: Forwards the generic query to the appropriate data store-specific formatter and delegates the query execution to the data store-specific executor.
Core layer: Translates the results from the data store into generic results and hands them off.
API layer: Processes the results and translates them to strongly-typed models that are returned to clients.
LinkedIn audience demographics provide a way for all of our members to gather insights about the posts they have shared and articles they have written on the platform. The analytics platform we have created also opens up the avenues for tracking various other user-generated content like articles, videos, etc. We envision a future where there is a single centralized hub for the members to gather insights about their content. As engineers, we are constantly working on improvements to our architecture, iterating and evolving the platform to support broader classes of analytics.
We would like to thank each and everyone of our team members for making this project a huge success, and a special callout to the management for their continual support as always.