Post Inspector: A Tool to Optimize Content Sharing
June 19, 2018
The Content Ingestion team at LinkedIn primarily focuses on discovering content across the web and ingesting it into the LinkedIn content ecosystem. Not only do we ingest content whenever a member shares a URL, but we also proactively search for interesting content that our members could enjoy.
Given the team’s focus, we’ve created a tool—called Post Inspector—for external content providers and internal teams at LinkedIn that provides insight into how we extract metadata so that content providers can easily optimize the sharing experience of their content on the LinkedIn platform. For instance, when members share a link to post on the platform, the Content Ingestion team’s services are tasked with finding the metadata to populate the shared post’s title, image, and content provider. Metadata is essentially a bird’s-eye view of the content that gives you an idea of what the content is about, but is not the content itself.
The internet and unstructured data
The great thing about the internet is that there is no standardized format that people must follow when publishing content online. This has allowed for a lot of innovation, and new forms of content have popped up over the years. However, since web pages have unpredictable formats, large players in the internet industry have come up with ways to be able to understand pages across the web in order to do things such as build helpful search engines or show content previews across platforms.
There have been several efforts in the past decade to introduce some structure to the web. For instance, schema.org and the Open Graph protocol are two of the major initiatives to allow web content providers to add helpful markup to their pages so that search engines and other web companies can better interpret the content.
Why Post Inspector?
When possible, our services use the content’s structured metadata, which can come in several forms, including Open Graph tags or OEmbed tags. However, we can’t expect that every piece of content on the web adheres to these protocols, nor can we expect that the provided metadata meets the protocols’ specifications for each metadata property. For instance, the provided image could be too large or too small, a title could be too long, and so on. Consequently, we need to have backup ways to interpret pages based on what’s available when extracting the metadata from the content, and we need to validate the data we are given to pick the best candidate for each metadata property (title, image, description, and so on).
This additional complexity of handling unstructured data and validating existing structured metadata makes it harder to reason about why we are extracting certain values. Our team had to answer a lot of questions regarding content metadata, such as why we chose a certain image over the one that was specified in the structured metadata. Often, the reason was because our service found a better candidate for the image because the image property’s criteria weren’t being met. Up until now, the reason why certain values were chosen instead of other ones was largely a mystery to most people outside our team. That had to change, because we want to help create the best content sharing experience for our members, in a highly scalable way.
At LinkedIn, we care a lot about both our members’ and publishers’ success, so we have teams across the company that work closely with publishers to help us provide our members with a variety of content, displayed in a way that makes the content fit seamlessly into the feed experience.
Without Post Inspector, each time a client had an issue with how our content was being displayed, because people didn’t have clear visibility into content metadata requirements, our team would have to answer with implementation details about why exactly certain values were picked.
We believed that the required knowledge to solve a problem should match the problem’s complexity. This effectively means that if a publisher, or any member for that matter, wants to improve how their content looks on LinkedIn, it should be extremely straightforward. Through Post Inspector, we want to empower our members and publishers alike, so that anyone can easily know what they can do to make improvements for their content to gain more traction on LinkedIn. No one should have to know the fine details of the LinkedIn architecture to figure out why their image is smaller than expected, or is not showing up.
How to use Post Inspector
Post Inspector is designed to provide members with three main values: (1) a way to show content providers how to optimize their content for better engagement on LinkedIn; (2) to provide knowledge about when we last updated our information about their content, and what values we extracted for the metadata; and (3) a starting point for investigating any issues that came up when we visited their page.
You can use Post Inspector to optimize the sharing experience for anything: an article, image, video, personal website, resume—you name it. As long as you have a URL to the content, all you need to do is enter the URL, and we will do all the magic to teach you exactly what needs to be done in order for your content to have a fully-enriched sharing card.
How Post Inspector works
Whenever a link is shared, or our content discovery services find new content, we store a high-level view of the content. Post Inspector is essentially a visualization of how our metadata extraction process works. In order to build Post Inspector to achieve its purpose, we added new logic to the extraction process that adds annotations and ingestion feedback when we are retrieving the content metadata.
How Post Inspector works: The annotation process
Our service can receive a URL to content that it needs to extract through several means. For instance, someone could share a link, embed a content preview in a Pulse article, or our content discovery tool could send an extraction request. We augmented the extraction process to add annotations to the metadata and ingestion feedback for scenarios in which we couldn’t extract any content.
As shown in the diagram, first our service receives the content’s URL and visits the content to get the HTTP response indicating whether the content is available. If we get a successful response, which indicates that the content is available, we can use the content’s response to extract the metadata properties that we need: title, description, images, and so on. As we are looking through all potential metadata values for each property, we annotate them according to whether they fit our criteria.
We synchronize our criteria with other teams at LinkedIn to ensure that we are prioritizing potential metadata properties in accordance with the end goal: extracted metadata values that fit the qualifications of the places in which they will be displayed, such as a share posts or Pulse articles.
Once we have annotated each potential value with whether it fits the criteria, we rank the values by which ones better fit the criteria, and select the winner as the metadata property value. We do that for each metadata property, and then pass the annotated content metadata to Post Inspector’s API, which provides data to the Post Inspector web client. For each annotation, we generate corresponding ingestion feedback with specific details on which criterion wasn’t met, and what the criterion actually is, in order to give content providers actionable feedback to improve their content. Once that is done, we surface the annotated content metadata to the Post Inspector web client so that it can display the data and clients can get a visualization of what occurred when we tried to extract the content’s metadata.
Today, and onwards
Now that we’ve released Post Inspector, teams across LinkedIn are empowered to answer questions about the metadata our services extract, since they can see exactly why each value was chosen. As a result, there is a shorter turnaround time for troubleshooting, since they can use Post Inspector to get the answers they need when helping content providers improve their presence on LinkedIn. Since the question of “what” and “from where” the metadata was extracted can be answered through the Post Inspector tool, both the Content Ingestion team and the teams we work with have more bandwidth for us to work on how we can improve our ability to guess what the content’s metadata is, and to propose metadata criteria that better fit the core use cases for content metadata on LinkedIn.
Over time, LinkedIn’s content experience will evolve. As we find increasingly better ways to extract metadata, and as we transform to provide our members with new and richer experiences on LinkedIn, the requirements for the metadata may change. It’s crucial to have a tool like Post Inspector that can quickly give both external and internal clients insights into what our criteria are, and what could have gone wrong, and it is intended to act as the source of truth as requirements change.
Thank you to Nicholas Lee, Chris Ng, and Dru Pollini for playing crucial roles in reviewing the code as we worked on building out Post Inspector. I also want to give a huge thanks to Jessica Tsuei for providing us with feedback on how we could improve the tool, and for highlighting any issues that needed to be solved. Lucy Cheng and Jessica also provided us with invaluable insight into the importance of this tool for the rest of the organization. And, of course, thanks to the rest of the Content Ingestion team at LinkedIn.