Making LinkedIn media more inclusive with alternative text descriptions
October 10, 2019
Co-authors: Vipin Gupta, Ananth Sankar and Jyotsna Thapliyal
As part of our vision to provide economic opportunity for every member of the global workforce, LinkedIn creates a unique environment for members to network, learn, share knowledge, and find jobs. In many ways, the LinkedIn feed has become the core of this effort as the preeminent way to share information and participate in conversations on our site. Alongside text, rich media has become an important component of the feed as well. But the addition of rich media within the LinkedIn feed raises a question: is the feed fully inclusive for all LinkedIn members?
For instance, can a member who has a vision disability still enjoy rich media on the feed? Can a member in an area with limited bandwidth, which could stop an image from fully loading, still have the complete feed experience? To uphold our vision, we must make rich media accessible for all of our members.
One way to improve the accessibility of rich media is by providing an alternative text description when uploading an image. A good alternative text description describes an image thoroughly while bringing the viewer’s attention to the important details. All the major elements or objects of the image must be identified and projected in a single, unbiased statement. Currently, LinkedIn allows members to manually add alternative text description when uploading images via web interface, but not all members choose to take advantage of this feature. In order to improve site accessibility, our team has begun work on creating a tool that adds a suggested alternative text description for images uploaded to LinkedIn. Although computer vision science has made great strides in recent years, automatic text descriptions are still a difficult task—compounded by the fact that images on LinkedIn tend to fall into a professional or work-oriented category, rather than being more general or generic.
This blog post provides a brief overview of the technologies we are exploring to help us improve content accessibility at LinkedIn, using existing solutions through Microsoft Cognitive Services while also breaking ground to customize our models for LinkedIn’s unique dataset.
Why alternative text descriptions?
There are several ways that alternative text descriptions for images can improve the accessibility of rich media in the feed. For members using assistive technology like a screen reader, alternative text descriptions provide a textual description of image content. Similarly, in areas where bandwidth may be limited, such descriptions allow members to understand the key features of an image, even if the image itself cannot be loaded.
If a member doesn’t provide an alternative text description at the time of image upload, we can turn to multiple methodologies for generating alternative text descriptions at scale, including deep learning, neural networking, and machine learning.
Examples of alternative text descriptions (from the LinkedIn feed)
What are the challenges of generating automatic text descriptions?
To describe an image or scene is more art than science. There is no exact “right” description—it’s always subjective. Subject expertise and knowledge of various physical objects and their attributes are required to generate a good description of the image. Additionally, an image is only a two-dimensional projection of our three-dimensional world at a given moment in time, and time-based information that can help more accurately identify activities is missing, making the writing of alternative text descriptions even more difficult.
In view of these challenges, automatic image alternative text description generation models require large sets of training images, annotated by humans, to capture subjective variations and diverse objects. With advances in deep learning and natural language processing models, state-of-the-art image description models can generate these for an image; however, the accuracy of these models lies between 50-70% [1,2]. Furthermore, if we apply these techniques to more specific types of data (e.g., the professional-themed rich media typically shared on the LinkedIn feed), accuracy continues to decrease. Hence, these models need to be trained on specific types of data to be most effective for a specific use case, and need to be applied carefully.
Leveraging Microsoft Cognitive Services
Microsoft Cognitive Services offers many computer vision capabilities, including Analyze API, which can generate alternative text image descriptions. The service defines its output as “describes the image content with a complete sentence in supported languages.” The description is based on a collection of content tags that are also returned by the operation. More than one alternative text description can be generated for each image. Descriptions are ordered by their confidence score. We decided to leverage this Microsoft service as we began exploring automated image description, and started integration within the LinkedIn stack.
Analyze API was trained on a fairly “general” body of data, so one of the first things we needed to do was to assess how the image alternative text description feature performed with LinkedIn-specific data. To do this, we created four categories that human evaluators could use to score performance.
Table 1: Summary of labels used for verifying alternative text descriptions
The Microsoft API returns a confidence score along with the image’s alternative text descriptions, categories (detailed definition of these categories can be found here), and tags. The confidence score predicts the quality of the generated image text descriptions. In order to assess how the API performed on LinkedIn data, our evaluators compared their labels for each result against the automatically-generated confidence score. Below are some examples of different labels manually evaluated. Each label is categorized into three buckets depending on the confidence score.
The Microsoft API is able to do a very good job as shown in row 1 by capturing groups of people, objects like newspapers, and places like a subway. In AI models, confidence is tightly correlated with training data distribution. As the Microsoft Analyze API is not trained on LinkedIn data, we expected less accurate confidence score for LinkedIn rich media which is supposed to contain images with professional context (e.g. in table above Row2 has images with slides / projector in background where descriptions are inaccurate). Images shared on the LinkedIn platform are often captured in professional settings like at an exhibition, conference, seminars, etc. Members also share pictures with a lot of text like product posters, certificates, and graphic images such as charts etc. Because of the specific nature of Linkedin data, Microsoft confidence scores were less accurate. We therefore needed to devise a different way to assess description quality—one that took our unique dataset into account.
To understand the Microsoft Cognitive Services Analyze API functionality and output, look at the table below, where some examples from the MS-COCO dataset are shown in Table 2, along with output from the Microsoft API for the tags, descriptions, and confidence columns. One can observe that Microsoft returned tags recognizing natural objects in foreground like “person,” “dog,” background objects like “grass,” “fence,” “desk,” etc., activity, scene category like “outdoor,” “indoor,” etc. And, the description is a single sentence describing the image in brief.
Table 2: Examples of Microsoft Cognitive Services 'Analyze Image' feature on public dataset images
In the last section, we described the Microsoft Analyze API and how it performs qualitatively on professional images. Though it was doing a really good job for the majority of images, the problem remained around assigning a posterior probability of correctness for a description. To solve this, we decided to go deeper in to the correctness of alt text on LinkedIn data. The idea was to find some frequent patterns which is specific to the quality of the image descriptions. To get at these patterns, we processed metadata like tags and categories returned by Microsoft to generate word clouds: one from images where image descriptions were marked as good (human evaluated), and another where these were marked as bad (human evaluated).
Based on this analysis, we concluded that the Microsoft solution works well for images featuring groups of people, indoor locations, people standing, many people, etc. For other specific types of images, we realized that we needed to further train the model on LinkedIn-specific data in order to obtain more accurate results.
LinkedIn meta classifier solution
The Microsoft Cognitive Services Analyze API is designed for a general use case suitable for widespread adoption. Since the LinkedIn feed is focused on professional networking, and we have additional sources of data about our members and their content, we wanted to enhance the API’s results by tuning the system for the LinkedIn use case. Our goal was to generate an improved confidence score to make sure quality alternative text is passed to the LinkedIn feed. The meta classifier solution is depicted in Figure 2. It helped in improving the proportion of ‘Good’ captions passed on the feed with higher precision.
Figure 2: Flow diagram of the proposed system to improve alternative text description quality
Handling inappropriate image descriptions
Incorrect image descriptions could harm our member experience. The meta classifier we developed helps to filter out such text descriptions (Table 2 below). Further, an image description correction module is developed to replace identified gender, frequent incorrect image descriptions containing words like ‘screenshot’, etc.
Table 2: Inappropriate alt-text examples and how meta classifier help improve scoring
In this post, we’ve provided a brief overview of how we are exploring ways to improve content accessibility at LinkedIn. As discussed, we are currently leveraging existing solutions from Microsoft, combined with specially-trained models, to generate automatic image text descriptions. LinkedIn’s AI teams are also building image description models for rich media content specific to the LinkedIn platform to help improve overall image description accuracy. In the near term, we plan to continue to improve our meta classifier model by collaborating closely with Microsoft on tags taxonomy and an associated dictionary, as well as experimenting with additional text associated with feed posts.
We’re always open to feedback and would love to hear from you as to how we can make LinkedIn even better. For any thoughts or questions, we encourage our members to email our accessibility team through our Disability Answer Desk.
Many thanks to our product partner Peter Roybal, as well as Shane Afsar, Siva Sooriyan, and Peter Zhang from our Engineering teams in NYC and our designers Sara Remi Fields and Kevin Arcara. Would also like to thank the AI engineers working towards this effort, including Subhash Gali, Jyotsna Thapilyal, Aman Gupta, Bhargav Patel, and Bharat Jain. Thanks to infra and platform teams, like Vector who worked with us to materialize it including Mary Hu and Karthiek Chandrasekaran. My sincere thanks to Rushi Bhatt, Liang Zhang, and Ananth Sankar for their valuable input and our AI head, Deepak Agarwal, for supporting this effort and inspiring us to build technology for the future.