LinkedIn Sales Insights: Quality data foundations for smarter sales planning

February 17, 2021

Co-authors: Sabeer Thajudeen, Dan Melamed, Sashikanth Damaraju, Jean Baptiste Chery, Tai Bendit, and Ajay Singh

Having reliable and trusted data is essential for Sales Operations and Sales leaders. Sales Ops professionals need to optimize for results by evaluating and defining territories, and sales leaders and CXOs need to understand the opportunity landscape to help set strategy. However, these groups often lack visibility into their total addressable market and need tools to strategically allocate resources and books of business. They need reliable data that they can trust to help power their decisions. 

With the launch of LinkedIn Sales Insights, now generally available, sales leaders have access to one of the most powerful datasets in the world, the LinkedIn Economic Graph—a source of information that is both real time and people-powered. Understanding a company means knowing who works there and what they’re trying to accomplish, including how many people at that company could benefit from a product or service and how that number is changing over time. LinkedIn Sales Insights (LSI), powered by the Economic Graph, can help answer those questions because it maps the relationships between people, companies, skills, jobs, and schools.

First introduced to the market late last year, LSI is a data enrichment and analytics platform powered by over 722 million members, allowing sales leaders to be more strategic while also keeping member privacy preserved. This new tool helps Sales Ops focus their teams on the right accounts—the ones with the most opportunity—through a foundation of real-time market, account, and relationship strength data and insights. In this blog post, we will talk about the advantage of using our member-powered data and how we leverage AI and data mining to further improve the quality of insights we serve to our customers.

Intelligent systems to ensure data quality

The raw data that we start with to deliver Sales Insights is huge, noisy, and mostly unstructured. The best way to corral such data into reliable structures and useful insights is to use artificial intelligence (AI). LinkedIn’s Company Standardization team uses AI to power LinkedIn Sales Insights in a few different ways.

First, we define what counts as a “company.” Because anyone can create a Company Page on LinkedIn, we use AI, including character-level language models, to help us decide whether a company page represents a real company or another entity, such as a blog or spam page. 

We also use AI to detect and connect duplicate records. For example, we might have separate records for “Morgan Chase Bank,” “JP Morgan Chase,” and “JPM,” each with a different HQ address. Our AI-powered record linkage model uses machine-learned measures of similarity to help us determine if they represent distinct entities.

Although a great deal of information about companies is publicly available, most of it is formatted for human consumption, not for convenient packaging into a database. For example, most companies publish their addresses somewhere on their websites. We use state-of-the-art AI techniques, such as deep learning and distant supervision, to find those addresses, and to parse them into components, such as the country, the city, and the postal code. Having this information parsed correctly into structured, database-friendly components helps us to incorporate the information into the product and improve discoverability of companies.

One of the most powerful features of the LSI platform is that customers can use it to enhance their existing company databases. To do so, LSI must match customers’ existing company records to records in our company database. This kind of matching can be very difficult to do accurately, due to the many different ways to express most company attributes—just think of the “Chase Bank” example above, or the many ways to write an address. We use AI in the form of entity resolution models to find the best matches, thus maximizing the value of our data for our customers.

Maintaining ongoing data quality

Given the quantity and significance of our data, we must ensure quality is built into our processing pipeline. We’ve added validations at every stage to enforce intrinsic and contextual aspects of data quality. The pipeline is built using our Hadoop infrastructure and reads several terabytes of data from various sources (e.g., standardized company, member, jobs, etc.). We leverage LinkedIn’s Data Sentinel platform to validate both our source and target data through advanced data mining, data management, and software engineering techniques. Some examples of these validations include ensuring that:

  • Upstream data sources for members and jobs are refreshed in a timely manner
  • The percentage of companies with a valid geographic location exceeds a specified threshold
  • The distribution of members among titles (Analyst, Manager, etc.) has not changed significantly
  • Flow chart of the ETL (Extract, Transform, Load) pipeline

High-level depiction of the ETL (Extract, Transform, Load) pipeline

A simplified depiction of our Extract, Transform, Load (ETL) pipeline is shown above. The pipeline runs daily to refresh metrics using the latest data from upstream sources.

Our source data for Company, Members, Jobs, Sales Navigator, and CRM needs to be consistent, complete, and fresh to make sure that LSI is as useful as possible for sales leaders. There are several processes and systems we use to help ensure these qualities in our data at LinkedIn.

  • Consistent: To ensure consistency across LinkedIn products, we’ve established a single source of truth for each shared metric. For example, “Headcount” is a metric shown in multiple locations, like on a Company Page and in Sales Navigator, but is consistent because it uses a shared standardized source of truth.
  • Complete: Source systems also ensure accuracy and completeness by implementing their own Data Sentinel alerting while producing these source datasets. These checks help to verify the validity of upstream data and prevent computing incorrect metrics.
  • Fresh: Our pipeline has set thresholds for freshness for all input data sources. Before metric computation begins, freshness is recorded and asserted against the threshold. Alerts are triggered if the pipeline finds that the input data is stale.

Our data is also real-time, which means we need to monitor changes daily through metric change assertions and domain data assertions.

  • Metric change assertions: We detect unusual changes in the distribution of data by leveraging statistical data assertions. For example, we use total variation distance-based checks in Data Sentinel to detect any significant shift in the companies mapped to cities or members mapped to titles. Statistical data assertions can help detect situations like the ingestion of several new companies that are missing geographic location, an upstream company data source change causing several companies to be mapped to new geographic locations, or an upstream title taxonomy change causing several members to be mapped to new titles.
  • Domain data assertions: Datasets must also meet certain expectations for column values. For example, company IDs must be unique, and we expect to have a certain level of coverage for dimensions like HQ country and city.

Assertions are assigned different levels of severity, ranging from silent to notify to severe. An example of an assertion failure, which helped detect a change in HQ city for several companies, is shown below. The pipeline was paused and investigation revealed that an upstream geo taxonomy change caused several companies in Australia to lose city information. We were able to work with the geo data owners to fix the issue and prevent bad data from being pushed to the product.

  • Screenshot of Data Sentinel failed validation page

UI in Data Sentinel displaying details for the failed validation

Troubleshooting an alert can be a time-consuming process because our upstream data sources are constantly evolving to ingest and map more information. To augment our alerting system and aid in our troubleshooting, we have built internal dashboards that summarize our data and daily changes. These dashboards help us to quickly verify a change that’s flagged in an alert and to determine if a corrective action is required. A few example situations where these dashboards can come in handy are provided below. 

  • If we receive an alert indicating a drop in company geographic location coverage, we can quickly identify the geographic locations impacted by this issue and work with the company standardization team to reassign the correct geographic locations for companies impacted by the issue. 
  • If we receive an alert indicating a significant increase in the number of companies, we can quickly assess if the newly added companies meet our required thresholds for geographic location coverage and industry coverage.
  • Example of an internal dashboard that is used to verify firmographics data

Example of an internal dashboard that is used to verify firmographics data

CRM data enrichment

We can multiply the value of our data by delivering it to our customers where they need it most. Today, it is a complicated and manual process for customers to get clean data into their CRM. They must join a variety of data, work the data, and then manually rejoin the data and bulk upload it into the CRM. In addition, poor data quality in the CRM inhibits a sales team’s ability to work with accounts, prioritize, and integrate with other sales tools.

To assist with these challenges, we built an LSI-CRM integration for Salesforce and Dynamics that allows customers to easily update their CRM with LinkedIn company data and keep it up to date. We use entity resolution models to match our records to the customer’s internal database. Then, we allow customers to bulk export the data from a given report to their CRM instance.

In order to simplify the flow from the LinkedIn side to push data to Salesforce and Dynamics, we create custom objects that contain LinkedIn firmographic and demographic data. We use the upsert method (Salesforce, Dynamics) to easily insert the data instead of querying the unique object id or storing the id on our side, as well as an externalID and lookup to push millions of Account record updates (company profile and personas) every day. With quality firmographic and demographic data for every account, LSI can be the key input to CRM health and sales planning processes.

Next steps and acknowledgments

For those who work in analytical functions, data is everything. We’re excited to democratize access to one of the most powerful datasets in the world—a source of information that is real time, people-powered, and augmented by advanced AI and data mining techniques to serve Sales Ops professionals with the most reliable insights.

We’re making this data available to sales organizations across the globe, and there’s much more to come. We’ll help our customers understand how to get the most value out of our data features, integrate LSI into additional workflows, add insights, and continue to improve our data foundations.

These efforts are truly collaborative and require many cross-functional partners. We would like to extend thanks to the R&D team who contributed their expertise to LSI.

LinkedIn Sales Insights: Ajay Singh, Jean Baptiste Chery, Sashikanth Damaraju, Matteo Palvarini, Chad Krsek, Clark Rasmussen, Haowen Cao, Kevin Liou, Regina Galieva, Nicole Gkerpini, Tiffany Sukamtoh, Vidit Aggarwal, Vinicius Santana, Xiaonan Duan, Jeff Tang, Sabeer Thajudeen, Siddhartha Sengupta, Christine Cho, and Thomas Lee

Company Standardization Team: Sammy Hansen, Deirdre Hogan, Fancheng Kong, Jimmy Kuo, Michael Han, Yuliang Li, Tianhao Lu, Xiaoqiang Luo, Dan Melamed, Remi Mir, Tao Xiong, Yunpeng Xu, and Yao Zhang.