Under the hood: Learning with documents
April 24, 2019
LinkedIn Learning aims to be the world’s most engaging learning platform. Learning content comes in many different shapes and sizes, and as such, our team is continually expanding the types of content that we support.
Today, educational videos make up a majority of the content on LinkedIn Learning. However, we’ve taken a step forward to go beyond hosting video content by bringing one of the most popular content types to the platform: documents! We recently launched the ability for organizations to directly upload documents onto LinkedIn Learning. To give organizations more flexibility in sharing learning content, we now support some of the most popular document formats, such as PDF, Word, and PowerPoint.
In this post, we will walk through the engineering that went on behind the scenes. From the backend point of view, we will focus on our architecture and publishing flow. On the frontend side of things, we will explore our web rendering infrastructure, specifically highlighting our accessibility considerations using the open source library PDF.js.
The publishing flow
LinkedIn Learning’s document publishing ecosystem
Before an uploaded document is published for learners, it goes through a series of processing steps. Our flow starts when a learning administrator uses the Learning Admin application to upload a document. As soon as the document is uploaded, the file is processed by LinkedIn’s media infrastructure platform, Vector, for processing and storage. After an initial round of validation, we utilize Microsoft’s Binary Conversion Service to convert the file to a PDF format. Converting the file to a PDF allows us to maintain the core structure of the uploaded document. It also opens up the ability for us to compress the high-quality file into a relatively small file size, streamline our workflow for further processing, and provide accessibility support within our client’s document viewer.
After successfully generating a PDF from the original document, we convert each page of the document to images with a range of resolutions. We included this extra processing step to simplify and optimize performance during our client-side rendering of the document by incrementally loading individual lightweight images over a substantially heavyweight PDF file. As a bonus, this extra processing step aids in low network bandwidth areas, as it gives us the flexibility to render lower resolution images for faster loading. We evaluated a few open source libraries in order to build the document conversion pipeline, such as Apache PDFBox, OpenCV, Apache Tika, and ImageMagick. In deciding which libraries best fit our needs and helped us in simplifying the processing flow, we wound up leveraging a subset of these libraries in our codebase.
While the document processing is happening in the background, the learning administrator can still make updates to the uploaded content by adding more rich metadata to the uploaded file. Some of the metadata that can be updated includes the document’s name, description, tags, and associated skills. This allows for better discovery for learners via search and relevance. The metadata that the learning administrator added is passed to LinkedIn Learning’s content backend system, where it is stored and further passed to LinkedIn’s Universal Content Filtering (UCF) system for spam detection. We also trigger UCF in parallel to perform virus checks and scans on the document's transcripts, embedded URLs, and embedded images to identify and automatically block inappropriate content from being published on our platform. Upon a successful scan of the document and all its metadata, the content is ready for publishing.
As soon as the learning administrator publishes the document, it is ready to be served. All documents are served using signed URLs that are private for each member with a short TTL to ensure access control. Once a learner signs into LinkedIn Learning and accesses an uploaded document, the LinkedIn Learning application fetches data from various systems via REST APIs and passes it to our frontend web framework for rendering.
The web document viewer
Breakdown of the document viewer’s image and text overlay usage
Let’s break the viewer down. Our document viewer renders documents for learners using both the processed PDF and converted images. The converted images of the document are used for the core document viewing experience. We choose which image resolution to use based on the learner’s screen size to ensure we deliver the best possible resolution for his or her device’s needs. On top of these images, we place a selectable text layer rendered straight from the converted PDF for a seamless experience.
PDF.js handles the placement of the text layer using CSS positioning and scales the text layer to the converted PDF image using CSS transform. At the end of the day, our text layer matches perfectly with our underlying PDF images. We make this text layer transparent to avoid any visual conflicts; only in the case of high contrast mode do we add a font color to match the user’s system high contrast mode settings. Our selectable text layer is generated using a refactored version of Mozilla’s open source PDF.js library, which we tweaked for accessibility. With both the images and the text layer lazily loading in, the resulting document viewer is both performant and accessible!
Accessibility at the forefront
With an incredible accessibility team at our side, we strove to make this document viewer as accessible as possible in the browser so that all learners have an equally delightful and enjoyable experience. The core accessibility infrastructure piece that we included in this document viewer was our text overlay. Our text overlay allows for selectable and searchable text, screen reader navigation, and high contrast mode for Windows machines.
In creating a semantic text layer, we decided to use PDF.js as our starting point. PDF.js is not accessible out of the box because it does not carry over the document’s tagged markup to the DOM; PDF.js only uses <spans> and <divs> to render the text. This is primarily due to PDFs being historically presentational documents; document tagging, a structural element, wasn’t added until over a decade after the birth of the PDF format and, as such, it is not as straightforward of a process to render this metadata. We found that PDF.js had the best foundation for us to be able to create our accessible document viewing experience, and we took on the challenge of adding tagged markup support ourselves.
PDFs are an incredibly complex file format; this is especially so given that a PDF can be generated a hundred different ways, all of which a renderer needs to handle gracefully. We dug deep into the Adobe manual for PDF specifications and engineered our way to surfacing tagged documents as semantic HTML in the DOM. Thus, our text overlay became a semantic text overlay. So when someone uploads a Word file, where deliberate steps have been taken to make it accessible to people with disabilities (such as adding ALT text to images and using proper semantic mark-up), we surface all of that accessibility metadata right in the browser using valid markup.
Engagement and analytics
There are dozens of reasons why a learning administrator would like to know how learners are interacting with the content that they have uploaded. Additionally, learners consuming the content should also be able to know what pages they have already read in a document, and where they last left off. That’s where our engagement metrics come into play.
To help guide both learning administrators and learners, the viewing progress per page and per learner is observed and stored in Espresso-based databases in real time as learners interact with the document. This data is consumed by our Samza-based reporting and analytics system to provide content engagement metrics back to the content uploader. The resulting learner activity and engagement information can be viewed by the learning administrators in the form of downloadable reports, as well as easy-to-follow charts showing the most popular content, time spent on learning, etc. Additionally, the learner can see his or her own viewing progress through the document viewer within the Learning application, where we show a green tick mark for viewed pages.
The document viewer with viewing progress in the Document Thumbnails sidebar
Building a document processing infrastructure and an accessible web document viewer was just a starting point for LinkedIn; it allowed us to get our foot in the door and open new opportunities for other LinkedIn products, including Flagship, Talent Solutions, Sales Solutions, and Marketing Solutions. LinkedIn's document uploads for feed was launched side-by-side with document uploads for Learning, with both teams working together to build some core infrastructure components while, at the same time, leveraging each other's work. A few other teams at LinkedIn have already started to leverage our document infrastructure, and we are excited to see what they build with it!
Bringing documents to LinkedIn Learning was an enormous endeavor that took over six months and was a result of the hard work of about 50 individuals across various teams. We would like to thank the following teams for their invaluable contributions: Learning Content Platform, L&D Experience, LEX Native Mobile, Vector, Ignite Formats, Learning Enterprise Reporting, Universal Content Filtering (India), Learning Search (India), Accessibility, Learning Relevance, House Security, Trust and Safety, Learning SRE, and Media SRE. And last but certainly not least, we would like to give a big thank you to our partners in design, product, and product marketing for all the guidance and inspiration along the way.