Introducing Commute Time for Jobs
June 7, 2018
A recent LinkedIn survey found that more than 85% of respondents would take a pay cut for a shorter commute. Given the importance of commute time in the context of job searching, we decided to focus on this issue to make it easier for our members to find the right job for them. Members are now able to see their estimated commute time while viewing jobs for which this data is available, save their commute preferences, and see at a glance which jobs have shorter commutes. The integration of this valuable information into the job-seeking experience arose from exciting new opportunities to collaborate with the Bing Maps team and leverage their APIs. But there was no shortage of groundwork needed to support these new features, from collecting accurate locations of jobs and companies to processing complex geospatial data to surface new features.
Commute time module on Job Detail page
Now that we are collecting exact addresses and inferring addresses for job postings, members can find out their prospective commute time on each Job Detail page by specifying their starting location, mode of transportation, and departure time. Jobs with an exact or inferred address will also have a pin icon next to the location indicating that an estimated commute time is available for that job.
We’ve also built a feature that leverages a member’s saved commute preference data to decorate jobs with an insight, or “Job Flavor,” informing the member that a specific job is reachable within their commute preference.
By having a gradient of isochrones for every member that has a saved commute preference, we can surface jobs with the more compelling, shorter commute times when applicable. Luckily, it was easy to leverage the Job Flavors platform to serve our new Commute Job Flavor at scale in everything from job search to job recommendations.
Collecting exact addresses for jobs
The biggest challenge we faced when we started this project was that we had a limited understanding of where jobs were located. At the time of the project start, we had only collected low-granularity information about the location of each job from job posters, such as the city or the region where the job was located. In order to help job seekers understand their potential commute time, we needed precise locations of jobs, so our first step was figuring out how to get this data. We knew that collecting more detail would be difficult, so we did some research to find examples of other companies that collect this level of detail at scale. However, we quickly found that this is a challenge that many other companies grapple with as well.
In an ideal world, every job poster would be happy to provide the address of the job so that job seekers could see their potential commute, but we understand that this isn’t realistic. For instance, some companies have job postings for general requisitions that don’t have a specific office and are for a campus. We sought to provide a solution that could work for all types of job posters. We collaborated with the Bing Maps team to customize the Bing AutoSuggest API and limit the types of entities to addresses and cities. This solution is flexible and lets a job poster quickly enter an address, while keeping the previous functionality intact. For example, job posters with an exact address in mind can post the job to a specific address, but a general requisition in any one of LinkedIn’s Sunnyvale offices can still be posted simply to “Sunnyvale, California,” without an address (although this may restrict the availability of Commute Time features).
LinkedIn job posting UI with an address typeahead
Inferring addresses for jobs
Even after allowing job posters to explicitly add an exact addresses, we only had addresses for a small percentage of jobs on our site. Not all job posters will put in an address when they post a job, and the majority of jobs posted on LinkedIn aren’t posted via the web UI (they are instead typically ingested through an API). We decided to tackle this data problem from a very different angle to increase our coverage of jobs with addresses by collecting the addresses of companies and then using that data to infer the addresses of jobs.
This was an effort that spanned three engineering teams: the Company Team, the Ingestion Team, and the Careers Team. The Company Team owns a service that gives us access to all of the company data we have at LinkedIn, so they added the ability to edit and add company addresses on the company admin page. This means that company admins can now update their address information, and LinkedIn can use that when inferring addresses. The Ingestion Team built out an email campaign to ask company admins to confirm addresses that had been manually discovered by LinkedIn so that they could be added to our company database. The Careers Team extended an existing Samza pipeline for standardizing jobs to allow for inferring the exact address.
The Samza processor we built has two major functions. The first is to call the Bing Geocoding API to get a standardized address with latitude and longitude for use in downstream systems like job search and job recommendations. The second is to infer the address for a job that wasn’t posted to a specific address. Since we can never be entirely sure that the inferred address is correct, we allow multiple addresses to match the job. We look for any addresses associated with the company that posted the job which match the postal code of the job posting itself. If none are found, we check to see if the company has any addresses associated with it within the same city as the job posting. If either search returns results, these are saved as inferred addresses. A major benefit of building this processor into our nearline job standardization pipeline is that this pipeline has built-in reprocessing capabilities. As we refine our inference algorithm and our data becomes more complete, we have the ability to easily reprocess existing jobs to take advantage of new data and improved algorithms.
Job address standardization and inference flow
For a job seeker, being able to quickly identify jobs that are within their commute time preference can greatly improve their job searching efficiency, especially when location is a primary decision-making criteria. To do this, we allow users to give us their commute preference so we can use this data to better cater jobs to them. Collecting the preference as text is straightforward, but transforming this data into something that is more flexible to work with in different parts of the Linkedin job ecosystem can be challenging.
The Bing Maps team recently introduced their Isochrone API that allows clients to find isochrones, which are borders of equal travel time that contain the area reachable within a specified time duration, starting location, and mode of transportation. An isochrone is represented as a multipolygon of lat-long vertices; a multipolygon is appropriate because transit commutes often result in “forests” of polygons (imagine a polygon rooted around each stop of a train route, for example).
We leveraged the Isochrone API from Bing Maps to generate these polygons from a member’s commute preference with traffic information factored in. We generate a gradient of isochrones around the specified commute preference to allow for finer granularity in recommendations. For example, if a commute preference is specified to be 30 minutes of driving, we’ll request and store the isochrones for 30 minutes, 15 minutes, and 10 minutes of driving; we’ll show later that this is helpful in efficiently generating more valuable insights.
Collecting commute preferences and isochrones
Calculating a gradient of isochrones from a location is an expensive computation, so we process them in an asynchronous pipeline. Members’ commute preferences are saved into an Espresso datastore, LinkedIn's scalable and elastic data-as-a-service infrastructure.
Since the latency of asynchronous requests to the Bing Isochrone API varies greatly depending on the commute duration, we needed to handle out-of-order responses.
Since isochrones are computed from third-party geodata and this data can change over time, we needed an easy way to re-bootstrap all stored isochrones on demand.
Capturing Espresso updates
Brooklin, LinkedIn’s stream ingestion service, offers change-capture services from Espresso and Oracle data sources, and has a simple integration with Samza. In our pipeline, Samza listens to change-capture events from Brooklin for the Espresso commute preference updates and sends out Kafka messages for the isochrone requests.
Getting and persisting isochrones
The requests to the Bing Isochrone API go through GaaP, LinkedIn’s Gateway-as-a-Platform. For each commute preference, we make several requests to Bing with different commute times. The individual isochrone responses are later aggregated to form an isochrone gradient.
The high-level Samza fluent API is used to combine multiple Samza tasks into one Samza application. The application handles merging messages from different sources, aggregating the individual isochrone responses to create the isochrone gradient, repartitioning the complete messages by member ID, and finally, determining whether the update is the latest and eligible to be persisted. To determine the order of the responses, the timestamp from the online Espresso update is propagated throughout the process and saved locally in the Samza application.
Venice is chosen as the storage system for our derived data as it offers sub-millisecond latency and supports very high QPS, which allows us to use the data in different latency-sensitive job services.
Solving delayed updates in the Venice store
The asynchronous pipeline can take a couple of minutes to complete, starting from a complete Espresso update to a new isochrone gradient being saved in Venice. To prevent users from seeing stale data while waiting for the async response, we invalidate all reads to the isochrones during this time. For every isochrone read, we query Venice for the isochrone and Espresso for the latest timestamp. We only return the isochrones if the timestamps in the Venice record and Espresso record match.
Having Brooklin and Samza work together facilitates the re-bootstrapping process. Espresso regularly publishes a snapshot of the entire database. Brooklin supports bootstrapping by having two streams, one reading from the snapshot and one reading from the online change-capture stream. When the bootstrapping instance starts, our Samza processor first consumes from the snapshot bootstrap stream and later switches to the change-capture stream. The Brooklin/Samza system consumer seamlessly makes the switch without the client’s knowledge.
In addition, the Venice team is building support for a hybrid store to simultaneously consume from a finite reprocessing job and a non-finite nearline job, similar to how they supported Venice Hybrid for batch push and nearline jobs. Once this work is done, re-bootstrapping becomes an automatic process that can be run anytime on demand to regenerate the isochrones without interfering with the flow for new updates from members.
Working with isochrones
Thanks to cutting-edge optimizations from Bing, we can generate large isochrones for 2-hour drives in under 10 seconds. However, depending on the number of vertices of the multipolygon returned, isochrones can be unwieldy and impractical. Since the public Isochrone API from Bing Maps must serve many different use-cases outside of our own, it was important for us to be able to limit the size of these isochrones ourselves to meet storage and performance requirements, all while maintaining a reasonable tradeoff in accuracy. After investigating the literature on computational geometry and polygon simplification, we narrowed our experimentation down to two popular choices for polyline simplification: the Douglas-Peucker algorithm and the Visvalingam-Whyatt algorithm. After we experimented with and benchmarked the two using the Java Microbenchmarking Harness, we chose Visvalingam-Whyatt for its slightly faster performance, tendency to produce smoother edges, and preservation of gradual changes over longer distances when compared to Douglas-Peucker. Some of these characteristics are noticeable even when comparing the simplification of a small example isochrone in San Francisco’s Golden Gate Park, as illustrated below:
With the ability to generate manageably-sized isochrones, one of the essential questions we needed to answer was whether a given job or company is reachable within the commute preference. In other words, is a given lat-long point located within an isochrone?
For this we looked to the most well-known algorithm for point-in-polygon determination, the ray-casting algorithm. The idea is that a point is only in a polygon if a ray drawn starting from the point intersects the edges of the polygon an odd number of times. This ray drawing can be simplified to a horizontal ray starting from the point and extending semi-infinitely to the right, parallel to the x-axis. The runtime complexity of this algorithm is linear in the number of vertices, and we determined it would meet our performance requirements after benchmarking tests.
Looking ahead: Commute-based search and recommendations
We plan to incorporate the exact location of jobs with addresses into the search index to further improve the precision of location-based searches and recommendations. We’re also planning on using the isochrones when creating our search query to improve the relevance of job search and recommendations. This will allow members to search for jobs within a short commute of their home, and be notified of newly-posted jobs along their existing commute.
This project would not be possible without the hard work of our team: Andrew Dye, Dan Li, David You, Jessica Fung, Kedar Kulkarni, and Xiaoping Li, and the leadership of Caleb Johnson. We'd also like to extend a special thank you to our partners in the Company Team (Hao Liu), Ingestion Team (Jean Baptiste Chery and Oliver Juang), Samza Team, Venice Team, Brooklin Team, and Standardization Team (Amit Yadav and William Kang), GaaP Team (Kunal Kandekar), and at Bing Maps (Ching Wang, Erik Lindeman, Hua Li, Simon Shapiro, and Zhihong Zhang).