Privacy Preserving Single Post Analytics

Ryan Rogers

December 12, 2023

Authors: Ryan Rogers, Subbu Subramaniam, Lin Xu

Contributors: Mark Cesar, Praveen Chaganlal, Xinlin Zhou, Jefferson Lai, Jennifer Li, Stephanie Chung, Margaret Taormina, Gavin Uathavikul, Laura Chen, Rahul Tandra, Siyao Sun, Vinyas Maddi, Shuai Zhang

Content creators post on LinkedIn with the goal of reaching and engaging specific audiences. Post analytics helps creators measure their post performance overall and with specific viewer demographics, so they can better understand what resonates and refine their content strategies. The number of impressions on each post shows how many total views the post has received. Demographic information helps members understand their audience and content performance. However, because the identities of post viewers are not made visible to post authors, there is a need to ensure that demographic information about viewers does not allow post authors to identify specific members that have viewed the post.

A natural approach to ensuring analytics do not reveal the identity of a viewer is to only show aggregates to the post author by certain demographics: company, job title, location, industry, and company size. For example, the post author can only see the top job title of viewers with the corresponding percentage of unique viewers. However, a post author attempting to re-identify post viewers might monitor post analytics such as top job titles, companies, and location as they are updated in real time and attempt to deduce the identity of a member who viewed the post. Hence, after a new viewer on a post, the post author might be able to deduce the job title, company, and location of the member that just viewed the post.

We wanted to better understand whether it was possible to identify members based on these analytics, based on changes in view counts from demographics. We estimate that it is possible for a bad actor to uniquely identify more than a third of weekly active members with three attributes: company, job title, and location. While we are not aware of such an attack occurring on LinkedIn, we are constantly looking for ways to protect our members from increasingly sophisticated attacks by bad actors. Furthermore, when providing only top-20 results in each demographic, we found that it is possible to identify roughly 9% of the initial viewers on a sample of posts. Changing top-20 to only top-5 results dropped this identifiability risk from 9% to less than 2%, but we wanted to reduce this risk even further. This then posed a challenging question: how to protect the privacy of the viewers of posts while still providing useful post analytics to the post author in real-time?

We will detail our approach to add even more safeguards to viewer privacy on post analytics, which is the result of a joint venture across multiple teams at LinkedIn. We are excited to announce the various contributions we have made to provide a privacy-by-design approach to measure and mitigate reidentification risks. This problem poses an interesting juxtaposition between post viewers and post authors that view post analytics: members can be viewers on posts that are not theirs and then those same members can then view analytics on their own posts. Hence, we want to safeguard the privacy of members when they view posts while also providing useful analytics of viewers on their own posts. We will start with a general overview of differential privacy, the gold standard of enforcing privacy for data analytics, which we adopt in post analytics. Next we introduce our privacy service, PEDAL, that works in conjunction with existing infrastructure, such as Pinot, to enable private analytics at scale. Lastly, we discuss the analytics platform at LinkedIn, LEIA, which enables PEDAL to scale across multiple analytics products.

Differential Privacy

To provide a way to safeguard the privacy of viewers while maintaining utility for post authors, we wanted to introduce differential privacy to these analytics. Differential privacy has emerged as the go-to privacy enhancement solution for analytics and machine learning tasks. We say that an algorithm is differentially private if any result of the algorithm cannot depend too much on any single data record in a dataset. Hence, differential privacy provides uncertainty for whether or not an individual data record is in the dataset. To achieve this uncertainty, differential privacy introduces carefully calibrated noise to the results. Differential privacy also quantifies the privacy risk, typically referred to as epsilon, so that the higher the epsilon the more privacy risk the result can have. Hence differential privacy treats privacy risk on a spectrum, rather than a binary choice of there being privacy risk or not.

The privacy risk (epsilon) parameter takes into account multiple decisions that must be made when introducing differential privacy to any task, which includes:

The level of privacy protection - granularity of privacy
The amount of noise added to each result
The number of results returned

The first is referred to as the granularity of privacy, which is typically trying to protect the privacy of all data records of a user. A user might have contributed to multiple data records in a dataset, as in the case of total views because a member can view the same post multiple times. In this case, we would need to consider the changes to analytics when we remove all views from any single member, which can range from 1 to hundreds, although there can be strict limits on how many views any member can contribute. Less stringent privacy baselines have been proposed in the literature, such as event-level privacy which would consider the privacy of each new view in our setting. In our case, we provide top demographic breakdowns based on distinct viewers where each viewer can modify the count of at most one demographic. Note that it is possible for a user, for example, to view a post, change locations, then view the same post again and contribute a distinct view to both locations, but as long as the location remains the same, that user can contribute one distinct view despite viewing the post multiple times. However we will treat the number of total views and distinct viewers as public quantities, not requiring noise.

The next consideration for determining the overall privacy parameter for differential privacy is how much noise is added to each result. Intuitively, adding more noise will ensure more privacy, but we need to better understand how noise will impact the overall product – too much noise would result in strange results that will lead to an untrustworthy product. There are two scenarios that we need to consider when introducing differentially private algorithms. In one case, the data domain is known in advance, which we refer to as the “known domain” setting, and we need only add noise to the counts of these domain elements. This means that we would need to add noise to counts, even if the true count is zero. As (non-differentially private) analytics do not typically give results with zero counts, we need to autofill in the missing zeros prior to adding noise to them. For company size breakdowns, there are only a few possible categories so we can easily fill in the missing values for this case and treat it as a known domain. However adding noise to zero counts might result in positive values leading to false positives in the analytics, meaning that we might show a viewer came from a small company, despite no viewer actually being from a small company. Hence, we introduce thresholds to ensure we show results that are likely to be from true viewers, in order to provide better utility.

In another case, the data domain is not known in advance or can be very large so that adding all possible elements as zero counts in the known domain setting would drastically slow down computation. The “unknown domain” setting then only adds noise to counts that are at least 1, which could reveal, for example, that someone with a particular job title must have viewed the post, despite the noisy count. To prevent this privacy risk, unknown domain algorithms from differential privacy add a threshold so that only noisy counts above the threshold will be shown. The choice of threshold typically depends on the scale of noise that is added as well as an additional privacy parameter, referred to as delta, so that the smaller the delta the larger the threshold but also the better the privacy. The benefit of unknown domain algorithms is that, for example, only companies of actual viewers on a post will be shown, resulting in no false positives, but at the cost of returning fewer results. Some previous applications that used unknown domain algorithms include Wilson, Zhang, Lam, Desfontaines, Simmons-Marengo, Gipson ’20, which used algorithms from Korolova, Kenthadapi, Mishra, Ntoulas ’09, as well as LinkedIn’s Audience Engagements API and Labor Market Insights, which used algorithms from Durfee and Rogers’ 19.

The last consideration for computing the overall privacy risk is to determine how many results will be shown from a given dataset, in our case, viewers on a post. Although a differentially private algorithm might return a lot of noise to each result, adding fresh noise to each result will allow someone to average out the noise and determine the true result, making noise pointless. This is why the privacy parameter, epsilon, is also commonly called the privacy budget, so there should be a limit on the number of results any differentially private algorithm provides. We want to provide real-time data analytics on posts, so that after each new viewer the analytics are updated and if a post has received no new views we would show the same analytics. To handle this with noise, we then introduce a seed to our randomized algorithms and use the number of distinct viewers on the post as part of the seed. This will ensure consistent results when there are no new viewers, but importantly will not allow someone to repeatedly get fresh noise added to the same analytics despite no new viewers, allowing someone to average out the noise.

Although we can ensure that analytics will remain the same if there are no new viewers, we still want to update the results after each new viewer and there can be thousands of viewers, or more. Hence, the first viewer in the post would appear in potentially thousands of different analytics, updated with each new viewer. Applying differentially private algorithms independently after each new viewer would result in a very large overall privacy budget. We refer to differentially private algorithms that add independent noise to each result as the “one-shot” setting. To help reduce the overall privacy loss, also referred to as the epsilon in differential privacy, we can instead add correlated noise after each new viewer. Luckily there is a line of work in differential privacy on “continual observation” algorithms, originating with Dwork, Naor, Pitassi, and Rothblum ’10 and Chan, Shi, Song ’11 which considers a stream of events, in our case distinct views from members, and returns a stream of counts, in our case demographic breakdowns, with privacy loss which only scales logarithmically with the length of the stream.

Consider the following example where we want to provide a running counter for the number of Data Scientists that have viewed a particular post. Say the stream of views looks like the following where 1 is a Data Scientist viewer and 0 is a different viewer:

The stream of counts would then show 1 data scientist after one, two, three, and four viewers then after five, six, seven, and eight viewers would show 2 data scientists. Concisely, the stream of counts would be 1,1,1,1,2,2,2,2. Note that the first viewer is present in all the subsequent counts, so changing the first viewer from a data scientist to something else will modify all the counts by one. If we apply fresh noise to the running count after each new viewer, as in the “one-shot” case, the overall epsilon parameter for differential privacy will scale with the number of viewers, which can get quite large, while each count will have noise with a constant standard deviation.

Instead of adding independent noise to each count, we can use the Binary Mechanism from Chan, Shi, Song ’11 to form intermediate partial sums. We then write the stream of data scientist viewers in the following way, where each row adds up the corresponding cells in the row above:

Note that to compute the number of Data Scientist viewers after say 8 viewers, we could either sum up all the entries in the top row, or simply use the bottom row value. To apply differential privacy, we can now add independent noise to each cell in the table. Changing the first viewer from a 1 to a 0 will modify a cell in each row of the table, which can only be logarithmic in the number of viewers. Furthermore, to compute the count after t viewers, we need at most logarithmic in t many noisy cells in the table to add up. For example, after 6 viewers, you need only add up the first cell in the 3rd row and the 3rd cell of the 2nd row. Hence, for the Binary Mechanism, the overall privacy loss will only scale logarithmically with the number of viewers and each count will have at most logarithmically many noise terms added to it. This provides a much better epsilon parameter versus accuracy tradeoff than the one-shot algorithms.

The original algorithms for the continual observation setting worked only for the known domain setting, where we would create such a partial sum table for each possible domain element in the stream. Recent work from Cardoso and Rogers ‘22 shows how to design algorithms for the unknown domain setting, which allows for only adding noise to positive count elements yet introduces a threshold so that only noisy counts above the threshold can be shown. We use Gaussian noise when applying noise, rather than Laplace noise as is traditionally used in differential privacy, due to better privacy loss composition properties using Concentrated Differential Privacy from Bun and Steinke ‘16, which can be converted to differential privacy as well.

Although we can prescribe a way to achieve an overall privacy guarantee with differential privacy with various algorithms, there is still the challenge of integrating differential privacy into an existing system that can return analytics for streaming data in real-time under intense query loads.

Pinot

Without differential privacy, there is still the challenge of delivering analytics in real-time. We use Apache Pinot, open sourced by LinkedIn, as the backend OLAP store to serve our queries. Pinot is widely used within LinkedIn for serving site-facing traffic (like Who Viewed My Profile, Talent Insights, and more). Pinot is a columnar OLAP store that serves analytics queries on data ingested from realtime streams. Pinot has a scalable columnar store, that is capable of serving SQL queries at high throughput and low latencies (tens of milliseconds). Over time, Pinot has become the store of choice for OLAP across the industry.

We have integrated differential privacy with Pinot queries before in the Audience Engagement API from Rogers, Subramaniam, Peng, Durfee, Lee, Kancha, Sahay, and Ahammad ‘21. The method we used there was good for a minimal viable product (MVP), and helped us get to our real goal – that of a scalable general purpose service for differential privacy positioned between the querying application and Pinot. The application would issue queries in SQL to the service like it would to any database, and the new service would return differentially private results.

In order to serve multiple applications, the new service would need to be configurable with the various parameters that different applications may need. It should also be smart in terms of choosing the right algorithms for the queries issued by different applications.

Privacy Enhanced Data Analytics Layer (PEDAL)

We are happy to introduce our Privacy Enhanced Data Analytics Layer (PEDAL), which serves as a mid-tier service between an application and any backend service, such as Pinot, to provide differential privacy. PEDAL consists of multiple components, which together can be used to return differentially private results to analytics products at LinkedIn:

Differentially private algorithms
Metadata store
Privacy loss tracker

The suite of differentially private algorithms covers both “one-shot” differentially private algorithms, which treats each result as being independent of the other as used in Audience Engagement API, and “continual observation” algorithms used for post analytics. PEDAL also consists of a metadata store that holds various algorithmic parameters, including the scale of noise that we introduce and whether we should use one-shot or continual observation algorithms. This metadata is determined as part of an onboarding process with each new application where we determine which algorithms should be used for each type of query.

PEDAL is meant for real-time data analytics, so it is generally applicable to event level data, pre-aggregated, and SQL queries that do not contain joins or subqueries. PEDAL then takes a SQL query that would run with any backend database and parses the query so that with the query and the configured metadata for a particular application, it then selects the algorithm to run and modifies the original query if it needs to. The modified query or queries are then sent directly to Pinot to fetch the true results. With the true results and the differentially private algorithm selected, it then uses the configured metadata parameters to compute the private result that is then shared with the application. The application may then choose to further post-process the result, but PEDAL ensures the result is private. PEDAL does not provide any caching of results, so that it cannot compare previous results that were already provided. Lastly, PEDAL also has a Privacy Loss Tracker (PLT), which uses an external database, that can track the privacy loss for an entity (such as a member, a post ID, a recruiter, an advertiser, etc.) for all the results that have been provided. The PLT updates the quantified privacy loss with each new result that is returned and, if enabled, blocks new results that would exceed an overall privacy budget, configured in the metadata.

For post analytics, when a post author selects a post to view top companies of viewers on that post, the SQL query from post analytics will get parsed by PEDAL where it will match metadata for continual observation, as opposed to one-shot, with a configured noise. PEDAL will then take the original SQL query and remove the LIMIT parameter so that it will not just return top-5 results. Based on the metadata, it will select a known domain algorithm, as for company size breakdowns that only has a few possible values, or an unknown domain algorithm, as for job title, company or location, which it can determine by the GROUP BY column in the original SQL query. The corresponding algorithm will then use the number of distinct views – by summing up the distinct counts in the breakdown – as the seed in the noise that is added, along with a secure key, so that if no new views have occurred, the same result will be shown, ensuring consistency. From the total number of distinct views, PEDAL will compute the binary representation of that number to determine which corresponding noise terms to use for the Binary Mechanism.

For example, say we want to know how many Data Scientists have viewed a post given there have been 128 viewers. We can write this aggregate as Y1:128=Y1+Y2+ Y128 , with each term either being 0 in which case the viewer was not a Data Scientist, or 1 if the viewer is a Data Scientist. Because 128 is a power of two, the Binary Mechanism need only add a single noise term, so that we return Y1:128 + Z where Z ~ N(0, 2). After one additional viewer, we then have the sum Y1:129 =Y1:128 + Y129 that we want to release with the Binary Mechanism. In this case, we will reuse the noise from the partial sum Y1:128 and then add fresh noise to Y129. Hence, we will release (Y1:128 + Z) +(Y129 + Z' ) where Z is from before and Z'~N(0,2). This sum can be written equivalently as Y1:129 + Z+ Z' , so we do not need to add noise to each partial sum, Y1:128 and Y129 , but rather to the total count. We will not know the exact order of views in the total counts returned by PINOT, but we need to only know the length of the sequence, i.e. the number of distinct viewers, to determine the noise Z used at the previous round and the new noise Z'.

After the algorithm has been executed on the true results from PINOT, it will return the top-k (k = 5 in the case of post analytics) or fewer (due to a threshold) results back to post analytics. Using differentially private algorithms introduces some latency compared to the non-private version, completely bypassing PEDAL. However, we found the additional latency to be minimal, adding tens of milliseconds.

Privacy and Utility Metrics

Although we have ensured that differentially private algorithms have been applied, we still want to empirically show that privacy has improved. Furthermore, differential privacy provides a theoretical guarantee on privacy risk which is quite strong. Practical deployments of differential privacy provide better privacy guarantees compared with not using it, even with epsilon in the tens or even hundreds, as is the case for this application in post analytics. Hence, it is important to answer the question, is PEDAL working to ensure privacy with the selected parameters?

For post analytics, we then developed a privacy metric that creates a dataset of all uniquely identifiable members from a few attributes, including: company, job title, and location. We then take a sample of posts and recreate the post analytics on those posts after each new viewer. After each new view we can then determine the company, job title, and location of the viewer and see if that is an identifiable member. Note that we only show the top analytics for each breakdown, so after some views, we might not be able to determine some attribute of the viewer.

After we introduce PEDAL to single post analytics, we can then run the same privacy metrics to see how often we can identify someone. Note that we will only determine the demographic information of the viewer if the noisy count from one view to the next changes substantially. We estimated that PEDAL will ensure that a bad actor attempting a reidentification attack would only be able to reidentify less than 1/10th as many as before using PEDAL. PEDAL also ensures some false discoveries, meaning that a post author might think they have identified the post viewer, but will in fact be mistaken, so a bad actor will never be confident that their attack was successful.

We acknowledge that this may not be the only privacy attack that can be done on post analytics, and the post author might have additional context beyond what is shown in post analytics. If other attacks should be considered, we can then reduce the overall privacy parameter for the application without any significant change to PEDAL. However, we want to point out that it is important to provide accompanying privacy metrics to show that PEDAL is providing protections without having to discuss a theoretical entity like epsilon to relevant stakeholders. These metrics are also helpful in demonstrating that other privacy approaches, like strict thresholds, can still fail to fully mitigate privacy risks but with a significant drop in utility.

To determine the amount of noise that would be appropriate, we looked at precision and recall of the demographic breakdowns. For precision, we only wanted to show top results for each breakdown, so we wanted to ensure that when we showed top-5 noisy results, almost all the time, the results would be in the true top-10. We also recognized that there are several items with the same counts, so that the top-10th result would have the same count as the top-5th result, essentially resulting in an arbitrary tie breaking.

In order to show precise results, we needed to limit the results which are shown. The recall metric counts the number of things that are returned, which should be 5 results but due to thresholds can be lower. We then aimed to provide a target level of precision while returning as many results as possible (maximizing recall). This change would result in posts with very few distinct viewers, or a majority of viewers having different companies, job titles, locations, etc., not appearing in post analytics. We then worked to balance the number of identified members and false discoveries with the precision and recall utility metrics to achieve the right balance of privacy and utility.

The LinkedIn Edge and Insights Analytics (LEIA) Platform

The LinkedIn Edge and Insights Analytics (LEIA) Platform powers multiple member-facing analytics products at LinkedIn. In this blog, we focus on the use case of single post impression analytics where LEIA provides demographic analytics on members who viewed a post on different dimensions like company, job title, location, industry, and company size. On a high level, LEIA platform can be divided into two parts: data ingestion and data retrieval, while leveraging Pinot as the OLAP datastore. On the data ingestion part, we set up Lambda or Lambda-less architecture to populate Pinot tables with similar data processing needs and Pinot table schemas across different use cases. For data retrieval, we expose multiple Rest.li APIs for LEIA clients to consume data by abstracting similar business requirements across multiple use cases, such as fetching analytics results for specific time windows or demographics dimensions with aggregation at different time granularities like day, month, week etc.

By leveraging PEDAL, we make it easy to integrate differential privacy into analytics use cases served from LEIA. LEIA does not add differential privacy by default to all analytics use cases as it depends on the privacy design and business needs. For those cases where noise is deemed necessary, such as impression analytics, LEIA makes a request to PEDAL before the data is served to LEIA. For those where noise is not needed, such as public actions like reactions and comment analytics, LEIA sends requests to the Pinot service directly and no noise is added to the results from Pinot.

Different from the raw results retrieved directly from Pinot broker service, noisy results from PEDAL service need to be sanitized from the perspective of business logic. For example, the counts retrieved from PEDAL could be non-positive in breakdowns for known-domain algorithm use cases due to the nature of noise added (positive, zero, or negative). There is a logic within the LEIA platform for post processing the results from PEDAL to eliminate any nonsensical results due to noise injected before data is served to LEIA clients.

Conclusion

Member trust is paramount at LinkedIn. Post analytics provides rich demographic information about who is viewing each post, yet we wanted to preserve and enhance member privacy with such a product. PEDAL provides an easy way to integrate state of the art privacy protections into data analytics at scale. We also point out the importance of privacy metrics and how it should accompany any privacy analytics service, especially when the theoretical privacy budget becomes very large, as is the case with many practical deployments, see for example the 2020 U.S. Census, and applications for federated learning with differential privacy. As private algorithms improve and new advances are made, PEDAL allows us to easily introduce these new private algorithms that have better accuracy and privacy tradeoffs. Even if these algorithms allow us to have stronger theoretical guarantees, we should still have privacy metrics to show the relative improvement and to catch whether there are unintentional implementation errors. Communicating privacy and utility tradeoffs are just as important as designing new algorithms. We hope that this case study will help other privacy practitioners communicate privacy risks and mitigations to all stakeholders. So go ahead, make a LinkedIn post to view differentially private post analytics using PEDAL and know that LinkedIn is safeguarding your privacy as you view others’ posts!

Acknowledgements

We would like to thank our leadership for helping us with the collaboration: Neha Jain, Shraddha Sahay, Parvez Ahammad, and Souvik Ghosh.

Finding AI-generated (deepfake) faces in the wild

Gonzalo Aniano Porcile, PhD

Mar 21, 2024
Federated Anti-Abuse Defense Ecosystem using AI Migration from...

Bing Wang

Feb 13, 2024
The Evolution of Enforcing our Professional Community Policies...

Amit M.

Jan 16, 2024