Privacy-preserving analytics and reporting at LinkedIn
April 10, 2019
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications. With the growing appreciation of the impact of data breaches and comprehensive data regulations, such as GDPR, user privacy has witnessed a renewed focus. At LinkedIn, we focus on the problem of computing robust, reliable analytics in a privacy-protective manner while satisfying analytics feature requirements. In this post, we share more information on PriPeARL, a framework for privacy-preserving analytics and reporting. We describe the overall design and architecture of the framework and the key modeling components, focusing on the unique challenges associated with privacy, coverage, utility, and consistency.
Our approach to protecting member privacy through PriPeARL makes use of random noise addition inspired by differential privacy, wherein the underlying intuition is that the addition of a small amount of appropriate noise makes it harder for a bad actor to reliably infer whether any specific member performed a private action or not. Our system incorporates techniques such as deterministic pseudorandom noise generation to address certain limitations of standard differential privacy and performs post-processing to achieve data consistency. In this post, we’ll present an experimental study in the context of analytics and reporting at LinkedIn that showcases the tradeoffs between privacy and utility needs, and the applicability of privacy-preserving mechanisms to real-world data. Additionally, we’ll highlight the lessons learned from the production deployment of our system at LinkedIn.
Photo by Sebastien Wiertz, licensed under CC BY 2.0
Privacy and security “by design” is a practice adopted by many companies who need to process personal data of individuals. As part of their products, online social media and web platforms typically provide different types of analytics and reporting to their users. For example, LinkedIn provides several such applications for our members as well as customers, such as content analytics (aggregated demographics of members that viewed a content creator's article or post), profile view statistics (statistics of who viewed a member's profile, aggregated along dimensions such as profession and industry), ad analytics (key campaign performance metrics along different demographic dimensions), and InMail insights (performance statistics on InMails sent by recruiters).
These analytics applications are clear cases for practicing privacy and security by design since they are areas where members are taking “private actions” that could be considered sensitive if attributed to a specific individual. We use the term “private actions” for those online actions a user takes (or that are observed by the service provider) that should not be made public or visible to others in most cases (e.g., reading or searching for specific content, clicking on an article or an ad). In social media, these private actions are in contrast to “social actions” such as posting content, liking content, sharing content, or commenting on content, because these social actions are understood by the user as being visible to other users. We strive to ensure that any one private action (e.g., clicking on an article or an ad) will not be attributable or inferred to an individual person by anyone who is observing the results of the analytics system. At the same time, we work to make the analytics and reporting features for the associated product viable and usable.
How we describe “privacy by design” in this post
We employ a variety of privacy by design and data minimization techniques at LinkedIn. It starts with allowing members to review and delete or change “social actions” and access their own “private action” data through download tools (providing members this control has been a core feature of LinkedIn’s platform for many years). It also includes internal access controls to data, de-identification of data, and deletion of data. Where we refer to privacy protecting techniques applied to data held by us in this blog post, we refer to data that has not been deleted by the member, is accessible for this use under our policies, and has not been purged in accordance with LinkedIn deletion policies.
Analytics and reporting at LinkedIn
A key characteristic of analytics and reporting platforms at LinkedIn is that they admit only a small number of predetermined query types as part of their user interface and associated APIs. Our analytics platforms allow querying for the number of member actions, for a specified time period, together with the top demographic breakdowns. We can abstractly represent the underlying database query as “select the number of member actions (e.g., clicks on a given article) performed by members from a demographic sub-group (e.g., Senior Directors) over a specific time range (e.g., from Oct. 2018 to Dec. 2018).”
We take a privacy by design approach to this reporting. Our goal is to ensure that a bad actor cannot infer whether a member performed an action (e.g., clicked on an article or an ad) by observing the results shown by the analytics and reporting system, possibly over time. We assume that a bad actor may have knowledge of attributes associated with the target member (e.g., obtained from this member's LinkedIn profile or even from sources outside of LinkedIn services) as well as knowledge of all other members that performed similar actions. With these considerations in mind, we would like to apply rigorous techniques for protecting member privacy in analytics applications. However, we still desire utility and data consistency, which we discuss next.
Coverage and utility: It is desirable for the aggregate statistics to be available and reasonably accurate for as many action types, entities, demographic attribute/value combinations, and time ranges as possible for the analytics and reporting applications to be viable and useful.
Data consistency: Ideally, an analytics platform needs to have consistency across data, even if the platform can't display true counts due to privacy requirements. Consistency is valuable for the end user in numerous situations, including repeated queries, over time, between totals and breakdowns, across entity/action hierarchies, and for top k queries. Please refer our ACM CIKM 2018 paper for a discussion.
Our problem can thus be stated as follows: How do we compute robust, reliable analytics in a privacy-preserving manner, while addressing the desiderata such as coverage, utility, and consistency? How do we design the analytics computation system to meet the needs of LinkedIn products?
Privacy model and algorithms
Our approach modifies the reported aggregate counts using a random noise addition mechanism, inspired by differential privacy. Differential privacy is a formal guarantee for preserving the privacy of any individual when releasing aggregate statistical information about a set of people. In a nutshell, the differential privacy definition requires that the probability distribution of the released results be nearly the same irrespective of whether an individual's data is included as part of the dataset. As a result, upon seeing a published statistic, a bad actor would gain very little additional knowledge about any specific individual.
Formally, this guarantee is achieved by adding appropriate noise (e.g., from Laplace distribution) to the true answer of a statistical query function (e.g., the number of members that clicked on an article, or the histogram of titles of members that clicked on an article), and releasing the noisy answer. The magnitude of the noise to be added depends on the sensitivity of the query (namely, the upper bound on the extent to which the query output can change, e.g., when a member is added to or removed from the dataset), and the desired level of privacy guarantee. For our application setting, we adopt event-level differential privacy, in which the privacy goal is to hide the presence or absence of a single event, that is, any one action of any member.
We next describe our approach for adding appropriate random noise to demographic-level analytics, and for performing post-processing to achieve different levels of consistency. First, we’ll look at an algorithm for generating pseudorandom rounded noise from Laplace distribution for a given query, followed by an algorithm for computing noisy counts for certain canonical queries, and finally the main algorithm for privacy-preserving analytics computation, which builds on the first two algorithms.
Pseudorandom Laplace noise generation
A key limitation with the standard differential privacy approach is that the random noise can be removed by issuing the same query many times and computing the average of the answers (referred to as an “averaging attack” in the diagram above). For this reason, and also to ensure consistency of the answer when the same query is repeated (e.g., the user returning to check the analytics dashboard with the same filtering criteria), we chose to use a deterministic, pseudorandom noise generation algorithm. The idea is that the noise value chosen for a query is fixed to that query—the same noise is assigned when the same query is repeated.
Given the statistical query, the desired privacy parameter, and a fixed secret seed parameter, we generate a (fixed) pseudorandom rounded noise from the appropriate Laplace distribution as follows. First, the secret and the query parameters are given as input to a deterministic function that returns a pseudorandom fraction between 0 and 1. Treating this obtained fraction as sampled from the uniform distribution on (0,1), we apply the inverse cumulative distribution function (cdf) for the appropriate Laplace distribution to get the pseudorandom noise. Finally, we round the noise to the nearest integer, since it is desirable for the reported noisy counts to be integers.
Canonical noisy count computation
We compute the noisy count for a canonical statistical query as follows. A canonical query takes the form shown earlier, wherein the startTime and the endTime are constrained to map to an atomic time range (discussed below). The algorithm computes the true answer for the query by probing the underlying database, and then adds fixed, rounded, pseudorandom Laplace noise. In case the noisy answer is negative, a count of zero is reported instead to ensure consistency over time. In this manner, we ensure that the combined action counts do not decrease over time, since a query over a longer time range could then be broken into canonical queries, whose results could be summed up.
Privacy-preserving analytics computation
We next discuss our main algorithm for computing privacy-preserving analytics. This algorithm computes the answer to an arbitrary statistical query of the form shown earlier, achieving a balance between privacy, consistency, and utility needs.
If the entity is at a broader level and consists of very few children entities (e.g., an ad campaign id that corresponds to just one or two ads), we would like to provide consistency across entity hierarchy. The underlying rationale is that discrepancy between the reported statistics for parent and children entities would cause a poor experience in extreme cases of the parent containing just one child or very few children. For instance, a user who creates a campaign with just one ad may even perceive such discrepancy as a bug. However, this difference becomes less pertinent as the number of children increases. Hence, our algorithm recursively sums up the noisy counts for the children entities when the number of children is less than or equal to the entity hierarchy consistency parameter (given as input). As the value of this parameter increases, the extent of consistency increases, although at the cost of reduced utility (noise with larger variance is added) and increased latency (due to a fan-out of recursive calls, possibly across multiple levels in the entity hierarchy).
Our algorithm similarly provides consistency across various time granularities as follows. The algorithm first partitions the input time range into a minimal set of atomic time ranges, obtains the noisy counts for each atomic time range using the canonical noisy count computation algorithm, and then computes their sum. Given a fixed hierarchy of time ranges, we define an atomic time range to be a range that exactly maps to some level in the hierarchy.
The above partition-based approach is chosen for the following reasons. First, it provides better utility for longer time range queries (by adding noise with smaller variance), although it partially sacrifices consistency over time. Second, it helps with privacy by bounding the number of time range queries involving a given member action event. For a given demographic attribute/value predicate and entity, each event would be part of at most as many queries as levels in the time range hierarchy. Finally, our partitioning approach protects against the time range split-based averaging privacy attack in which a bad actor could average out the noise for a given time range query by splitting the time range into two halves at different intermediate points, obtaining the sum of the noisy counts for each such pair of queries, and then averaging these sums to reliably estimate the true count for the original time range query.
Please refer our ACM CIKM 2018 paper for a detailed technical description of the above algorithms.
Privacy-preserving analytics system design and architecture
Our system consists of an online component that provides an interactive interface (or API) for presenting different types of analytics about member actions, and an offline/nearline component for generating the necessary tables containing the granular member actions (events).
Analytics computation workflows: The tables needed for computing different analytics about member actions are stored as part of LinkedIn's Pinot system. Pinot is a realtime distributed OLAP datastore, designed to deliver real-time analytics with low latency in a scalable manner. Two workflows are set up to process raw tracking data (in the form of various Kafka events generated from the member-facing application, such as impressions, as well as actions, including clicks) and compute partially aggregated analytics, which are combined and ingested into the Pinot system. The first workflow is run daily in an offline fashion to compute the analytics for the previous day and persist to the datastore. The second workflow is run every few hours in an “online” fashion to compute intra-day incremental changes. The data pushed to Pinot contains several key fields, including the granular time range, entity (and its hierarchy, if applicable), demographic attribute and value, and impression counts and different types of action counts. This data is fine-grained, and is aggregated according to the analytics request arising from the web interface or the API.
Online system for presenting analytics: Our online system uses a service-oriented architecture for retrieving and presenting privacy-preserving analytics about member actions corresponding to the request from the user-facing product (web interface) or API. First, this request is issued to the “Query Interface” component (step 1), which processes and transforms the request into underlying database queries, which are then passed to the “Privacy Mechanism” component (step 2). This component obtains the true answers to these queries by probing the Pinot system (steps 3a and 4a), generates the appropriate noise (steps 3b and 4b), performs post-processing consistency checks following the algorithm described earlier, and returns the set of noisy counts (step 5), which are presented to the calling service (step 6).
Please refer our ACM CIKM 2018 paper for a study of the empirical tradeoffs between privacy and utility needs over a web-scale ad analytics dataset.
Lessons learned in practice
We next present the challenges encountered and the lessons learned through the production deployment of our privacy-preserving analytics computation system as part of the LinkedIn Ad Analytics and Reporting platform for more than one year.
We needed to take into account key business and usability factors when deploying the privacy-preserving mechanisms to the ad analytics application. These included:
Semantic consistency vs. unbiased, unrounded noise: The Laplace mechanism for satisfying differential privacy involves noise drawn from a continuous distribution with a mean of zero, so that the expectation of the noisy count would be the same as the true count. However, without any post-processing, the noisy counts could be negative, while the users would expect to see cardinal numbers for counts. Although it would be ideal to add unbiased (i.e., with mean of zero) and unrounded noise to the true counts, we chose to round the noise, and also cap the reported noisy count to be at least zero, to avoid confusion for the end users.
Consistency vs. utility trade-off: In the initial version, we computed daily noisy counts as part of an offline workflow and persisted them to storage. We then aggregated these counts whenever the analytics were requested over a multi-day range. This implementation has some advantages, including simplicity of implementation, ease of debugging and maintenance, and consistency. For instance, we could retrieve the stored noisy count values for debugging any frontend issues or investigating problems reported by users. The computation cost is one-time as part of the offline workflow, instead of incurring potential latency at query time. Also, the query results would be consistent across different time ranges. However, one main disadvantage of this approach is that the variance of the noise added increases with the time range, causing the noisy count to be less reliable for queries over a longer time, range such as one month or one quarter. In our application, the users are typically interested in the performance of their campaign over its lifetime, rather than just on individual days. Given such usage behavior, we decided to move to an online implementation, with a hierarchical approach for time range queries (e.g., day, month, quarter, ...), trading consistency for better utility. Further, the analytics for the current day are also computed over completed, discrete time epochs to prevent averaging attacks.
Suppression of small counts: The analytics and reporting applications at LinkedIn involve reporting of aggregate statistics over pre-determined and known sets of attribute values such as the set of standardized titles, the set of companies, and the set of regions. In other words, revealing the attribute value (e.g., “Research Scientist”) as part of the reported analytics does not violate the privacy of a member. This is in contrast to applications such as releasing search queries, wherein the search query itself may be private, and hence it is not desirable to reveal queries with small counts.
However, we chose to suppress small counts for the following reason. The relative distortion due to the added noise will be large for attribute values with small counts, and hence the reported counts may not have sufficient validity for the purposes of observing patterns and making inferences.
Online computation and performance requirements
The implementation of the online pipeline to compute noisy analytics on the fly imposes strict latency requirements. We did a few iterations to optimize the noise computation code for performance. For instance, we chose to implement the function for generating pseudorandom noise following the approach of applying an efficient hash function and then scaling to (0, 1) range using optimized floating point computations (e.g., using double instead of BigDecimal type). These modifications helped reduce latency significantly, resulting in a responsive user experience when interacting with the analytics web interface.
Scaling across analytics applications
We first developed this privacy-preserving approach in the context of the LinkedIn Ad Analytics and Reporting platform. This platform is intended for ad customers to see how their ad campaigns are performing in terms of response. In the course of refining our approaches to suit the need to protect private actions and requirements to provide customers a useful service, we chose to design our system to be broadly applicable and scalable across various analytics applications at LinkedIn. Consequently, we abstracted the application-independent functionality into a stand-alone library so that this library could be invoked as part of other analytics or reporting features across LinkedIn’s business. For example, the functions for generating pseudorandom noise given query parameters are not specific to any one application, and hence are included as part of this library.
A broad direction for future work would be to create a taxonomy of web-scale analytics and reporting applications, and study the applicability of different privacy approaches (e.g., interactive querying mechanisms vs. data perturbation and publishing methods) for each class of applications in the taxonomy. Another direction would be to associate a notion of utility with the availability/correctness of different types of statistics, and to formulate this as a utility maximization problem given constraints on the “privacy loss budget” per user. For example, we could explore adding more noise (i.e., noise with larger variance) to impressions but less noise to clicks (or conversions) since the number of impressions is at least an order of magnitude larger than say, the number of clicks. Similarly, we could explore adding more noise to broader time range sub-queries and less noise to granular time range sub-queries, with an eye towards maximizing utility. More broadly, we are investigating hard problems such as how to preserve privacy for machine learning and data applications.
This blog post is based on the ACM CIKM 2018 paper (slides) co-authored by Krishnaram Kenthapadi and Thanh Tran. We would like to thank all members of the LinkedIn Ad Analytics and Reporting team for their collaboration in deploying our system as part of the launched product, and Deepak Agarwal, Taylor Greason, Sara Harrington, Sharon Lee, Igor Perisic, Rohit Pitke, Kalinda Raina, Arun Swami, Ya Xu, and Yang Zhou for insightful feedback and discussions.