New Recruiter & Jobs: The largest enterprise data migration at LinkedIn
July 23, 2021
In August 2019, we introduced our members and customers to the idea of moving LinkedIn’s two core talent products—Jobs and Recruiter—onto a single platform to help talent professionals be even more productive. This single platform is called the New Recruiter & Jobs.
Figure 1: New Recruiter & Jobs experience
While it’s a relatively clear idea to explain, updating and migrating the backend systems was not as straightforward, due to the difficulty in migrating these systems’ existing data from the legacy database to the new database. To put it plainly, as an engineering team, we needed to move our customers from one plane to another while in flight.
The biggest challenge was migrating data to a different storage system consistently, at scale, without disrupting business. In order to ensure a smooth transition experience for all existing customers, we established the following two success criteria for the data migration effort:
No data discrepancy: Customers should not see any data discrepancies after onboarding to the New Recruiter & Jobs experience. Examples of data discrepancies include not only data loss and data difference, but also data duplication and rebirth of deleted data.
No downtime: LinkedIn offers enterprise products with high reliability and availability for customers all over the world, within our SLA. Recruiter is an essential tool for talent professionals to complete their day-to-day jobs. Therefore, no downtime due to an outage or maintenance should be introduced to our customers for data migration.
In the rest of the post, we will discuss our unique challenges, the general architecture for a successful data migration, and the thought process we followed. Then, we’ll apply the general architecture and thought process to our unique circumstances and explain in more detail our specific system architecture. In the “Ramp process” section, we’ll explain the inherent difficulties in satisfying our success criteria and describe how we overcame these difficulties and fulfilled the success criteria practically.
The storage system for New Recruiter & Jobs changed from a single shard RDBMS to LinkedIn’s distributed key-value database, Espresso. It is extremely hard to implement dual write with distributed transactions between two systems or maintain data constraints. Without having a distributed transaction, eventual consistency is the best the migration system can achieve.
Complex many-to-many entity mapping
The New Recruiter & Jobs backend data models are significantly different from the legacy Recruiter application data models. The new data models were designed to help the new platform be more flexible to incorporate future product use cases. The entity mappings between these old and new data models were not direct one-to-one mappings. Most mappings were many-to-many. These complex entity mappings were very specific to the requirements of the application and could not be automated. A high level of craftsmanship, attention to detail, and uncompromising diligence were required to ensure high code quality and a low amount of bugs in this area and to secure the success of the data migration.
Complex entity dependency
Most entities that need to be migrated have complex, interrelated dependencies. For example, a candidate can only be added to a project after the project is created; a candidate can only be moved to a state after the state is created in the hiring pipeline. Therefore, the order of data synchronization of different entities must strictly follow the dependency order.
Figure 2: Sample entity dependency graph
LinkedIn Recruiter allows customers to perform bulk operations, including: candidate imports from spreadsheet, seat transfers, and removing a state from the hiring pipeline. These bulk actions usually end up with bursty write traffic to the source database and are therefore hard to migrate while keeping “time to converge” under a few minutes.
The design principles
A data migration is a process that replicates data from a source database to a destination database within a given time frame, where the source and destination are two independent systems. These are the three basic principles we used to assess our data migration:
Convergence in data: the difference between data at the source and destination diminishes
Convergence in time: data convergence happens within the expected timeframe
Convergence in the engineering process: the data migration system converges to being bug-free
The principles behind a successful data migration
The following figure demonstrates the high-level architecture of a proper data migration system. In order to successfully migrate data from the legacy database to the new database, you need some important building blocks. They are: dual write components, online/nearline validation and fix components, validation queue, and offline validation systems. In the following paragraphs, we will explain the rationale behind these building blocks.
Since the source and the destination databases are two separate transactional systems, there is no guarantee each data sync transaction always succeeds. Nor is it practical to expect proper ordering of such transactions. In order to ensure data convergence, a feedback loop is introduced so that all data are validated against the source of truth and fixed as needed. The feedback loop is a critical part of the system and guarantees convergence in data and convergence in time. The dual write, validation, and fix part of the loop ensures data consistency. The online and nearline part of the feedback loop ensures the data are replicated in a timely manner. The nearline components need to revalidate after their own fixes to ensure that data is converging and not diverging after the fix operation. The convergence in time is guaranteed by the convergence of geometric series, since if system availability is better than 99%, then the failure is only 1% and then 1% of 1%. The discrepancies diminish exponentially. The offline part is needed for initial bootstrap and any subsequent full data scan (in case of any sustained failures of the infrastructure). Due to the convergence of the geometric series, the amount of inconsistent data in the feedback loop should diminish to little more than incoming data volume with a diminished time delay. The delay is only a function of the incoming data volume and system throughput limit, which you can provision to ensure a minimal time delay.
Figure 3: The feedback loop for convergence in data and time
As with any software engineering process, a good outcome is only guaranteed with a very high level of craftsmanship. In order to achieve that, we followed the engineering feedback loop by looking at the data migration feedback loop. If the data migration feedback loop did not converge, we used the inconsistent data stuck in the loop for manual debugging. We repeated this process until achieving convergence. In addition to direct data comparison for validation, we also added shadow reads closer to the frontend to simulate a “customer’s eye view” of data as an additional safety net to ensure true data consistency.
In order to ensure the convergence of the data migration, we needed to be able to tell whether the convergence was happening within the desired parameters. These were some important metrics to measure:
Data consistency overall as a general quality indicator
Data consistency among data not recently updated, which should stay at 100% as an indication that data that should have converged did actually converge
Active data in the feedback loop as an indicator of time to converge
Time in queue, which should remain low
Design and implementation choices
Combining the success criteria we set up and the design principles we established, here are some critical choices we made in designing and implementing the New Recruiter & Jobs data migration system.
Data migration should cover all existing customers, regardless of when they are being rolled out to the New Recruiter & Jobs experience. LinkedIn's hiring platform is a multi-tenant environment, but there should be no difference in how different customers’ data are migrated. This simplifies engineering operations, ensures early detection of problems, and allows customers to have more flexibility on when to adopt the new experience.
Historical data bootstrap
All historical data needs to be backfilled into the new system at a controlled rate by leveraging spare capacity of the system. While bootstrapping can take several days to complete, it will be a one-time effort that guarantees all historical data are correctly migrated. Bootstrap logic should be shared with the continuous data synchronization wherever possible to establish a consistent pattern of data migration.
Continuous data synchronization
For the data migration to work, all we absolutely need is for the data to converge in the correct time frame when the source of truth switchover happens. We decided to run the data synchronization feedback loop continuously to keep the system load proportional to the real traffic. If the data inconsistency is continuously maintained at a very low level, there need not be any last minute scramble to validate or fix large amounts of data, and the operation can be much more smooth and predictable. When all underlying infrastructures are operating normally, the final switchover will be a no-op, besides flipping the switch at a predetermined time.
As a consequence of the continuous feedback loop, the data migration system should be capable of reacting to any malfunctions automatically. It should have the ability to examine any data discrepancies caused by system failures and take appropriate corrections in a timely way. Both data discrepancies and time to discover and fix any discrepancies should converge to 0 when all systems are operating normally.
When there are persistent failures in the underlying infrastructures, the convergence can be violated and the source of truth database cannot be switched. After the systems recover from the failures, the feedback loop will detect and fix all the resulting data discrepancies, and the convergence would resume subsequently. In particular, we scheduled the offline data validation to run regularly so that convergence was recovered within 24 hours of infrastructure recovery.
One-way data sync
The New Recruiter & Jobs is a fully redesigned experience for LinkedIn Recruiter, with many new product features based on deliberate customer research. The functionality and data in the two systems are not one-to-one mapped, so it is not practical to go back and forth between the two systems.
Full situational awareness
Monitoring and alerting are critical in every aspect of the migration process. Consistency, scalability, and system health metrics are crucial to the success of a self-healing data migration system. Ultimately, all the metrics in place should help answer whether data discrepancy has converged to 0 at the time of the ramp switchover. We also introduced internal alert emails on the entity level with more specific information during nearline validation and fix. This was essential for supporting the convergence process by enabling engineers to investigate and troubleshoot any potential issue in the feedback loop.
In the New Recruiter & Jobs data migration system, we implemented the aforementioned feedback loop into the system with the following key parts:
Offline bulk loads for historical data bootstrap
Online dual writes for continuous data synchronization
Nearline change stream verification for self-healing capability driven by data change capture
Online shadow read verification for self-healing capability driven by user interaction
Offline bulk verification for self-healing capability with full coverage, but with some delay
Figure 4: Migration system architecture overview
Offline bulk loads
LinkedIn’s ETL infrastructure allows data ingestion from online data storage to HDFS for offline processing. It applies incremental data changes on top of a database snapshot and publishes merged incremental snapshots every 8 to 12 hours.
For each type of entity to migrate, the historical data bootstrap pipeline compares the ETL copy of the source data and target data with predefined entity mapping logic. The bootstrap workflow runs Spark and Pig jobs on Hadoop to prepare the backfill patch for further processing. Due to the entity dependency graph, multiple rounds of backfill are required and they have to happen in the correct order.
For entities that contain billions of records to migrate and require only idempotent operations, bulk loading data from offline directly into the target database is preferred. Bulk jobs give priority to live traffic to reduce impact on customers, and they run at the spare capacity of the shared multi-tenant database cluster. Note that certain records backfilled in this way might need further corrections due to the level of staleness of the ETL data.
Figure 5: QPS for bulk loading data into target database can be as large as 15.9K
Other entities are fed into the online system for backfill via a Kafka push job from an offline pipeline. The online system is able to get the latest copy of the source data and write it to the target database at a controlled rate. Similar to offline bulk jobs, the rate limiter is capacity-aware and always gives higher priority to live traffic to reduce impact on customers. The verification and fix components are shared across other online and nearline flows so that different triggers converge to one consistent way of migrating data.
Online dual writes
For continuous data synchronization, any update in the legacy Recruiter application needs to be dual written to the New Recruiter & Jobs platform. Due to the high complexity of implementing distributed transactions across two different separate application stacks with different storage solutions, writing to the new platform only happens asynchronously upon a successful commit to the legacy system.
Online dual writes are not meant to impact customers, so the legacy Recruiter application should be agnostic to the result of dual writes. At the same time, any failures during dual writes need to be tracked and fed into a self-healing queue to trigger validation and fixes.
Despite the fact that online dual writes may fail and cause data inconsistencies under certain circumstances, this approach ensures that, in most scenarios, different types of entities are written to the new platform in the desired order following the entity dependency graph almost instantaneously after the initial write to the legacy system, which is an important part of keeping convergence time low by keeping the number of items requiring a fix at very low level.
Nearline stream verification
Brooklin is the default change-capture service for many data sources at LinkedIn. Any update to the source database will be streamed via Brooklin. This nearline processing flow provides an additional opportunity for the migration system to systematically validate whether the source update in the legacy system is correctly reflected in the new database. Any validation failures during Brooklin stream processing are fed into the self-healing queue for corrections.
The nearline stream validation success rate is closely monitored as an indicator of online dual writes reliability. It is a critical factor to determine whether the time to converge is close to zero, i.e., whether last-minute changes in the legacy database can be replicated in the target database before flipping the final switch.
Online shadow read verification
The online shadow read flow verifies data consistency upon every successful read. The shadow read happens close to the API level, and compares the data as shown to the customers. It not only indicates database data quality, but also indicates business logic correctness. Shadow read is driven by real user interactions with the system, and thus it works like a dry run of switching the data source. Shadow read gives the closest-to-reality metric about data consistency as if the user were migrated to the new platform.
Similar to online dual writes, shadow reads happen asynchronously upon a successful read from the legacy system. Shadow read results should not impact user behavior, and any discrepancies detected via shadow reads should be tracked via Kafka events and fed into a self-healing queue for corrections.
Offline bulk verification
Offline bulk load workflows are used for an offline bulk validation job that runs on every new database incremental ETL snapshot published after the initial bootstrap. The offline bulk verification flow scans through all records in the legacy database that are updated 24 hours before the job run time and detects discrepancies in the target database. Discrepancies are sent to the self-healing queue for corrections via Kafka push job. It leverages the same validation and fix pipeline that is shared among online, nearline, and offline verification flows.
Feedback loop queue
A queue is an essential part of the feedback loop that buffers events for data validation and only de-queues events if either the validation is successful or a correction has been applied successfully.
The queue processors are storage-capacity-aware and rate limited to prevent negative impact on the online system. On top of the basic metrics, we built in a comprehensive set of metrics monitoring system health and the status of different stages of data flow, with breakdowns to the level of entities and use cases.
The CAP theorem and the ramp process
By the nature of the data migration system, the legacy system and the new system create a dependent system connected via network. We can model the data consistency problem between the two systems like the consensus problem in distributed systems. Before the customer ramp, all read operations only read from the legacy system (and databases). After the ramp, all read operations only read from the new system (and databases). This means that in this model, the network is practically always partitioned on the read path between the legacy and the new system.
With such a bi-partite distributed system, at most one of the two sides can form a quorum on its own. By the CAP theorem, Consistency and Availability cannot be achieved together across the switchover, due to the artificially partitioned network. This is a physical barrier that we cannot cross and we have to trade off between data consistency and availability when we switch over to the new system. Since data inconsistency can have longer-term ramifications, we chose to guarantee data consistency by introducing a small window of unavailability to allow data sync to settle in the new system.
In order to minimize the unavailability window and minimize the customers exposed to it, we established and meticulously followed these processes:
Monitor all infrastructure systems to ensure that all data migration systems and dependent systems are operating normally and performing at the expected level.
Run data migration continuously so that all the data is kept consistent at all times, as long as the underlying infrastructures are operating normally, to avoid last-moment scrambles to get large amounts of data into consistency before a deadline.
Stop all bulk operations three days before the planned ramp to ensure consistent low data-synchronization latency.
Run the ramps on Sunday mornings to further minimize the amount of data changes in the system. This also has the benefit of allowing 100% focus in case any issues arise.
Redirect our customers into the new UI, which is a small hidden window of unavailability.
The data migration system does not exist in isolation. Instead, the system lives in a dynamic environment where things do change. Establishing and following these processes meticulously was key to detecting and addressing potential problems early. The consistency in adhering to the processes over time was instrumental to our success.
The New Recruiter & Jobs data migration is the one of the largest and most complicated enterprise data migration challenges at LinkedIn. In order to ensure the success of the data migration and customer satisfaction, we formulated the Convergence Principles for data migration, designed for success by following a principled approach, held a high bar of craftsmanship, moved fast with caution, and established and followed sound engineering processes meticulously. We were able to reach a steady state data consistency rate well above 99.999% (most often 100%) and bring down the steady state potential broken data window to no more than minutes (limited only by infrastructure SLA). The biggest obstacle to massively rolling out the New Recruiter & Jobs experience was removed.
We found the Convergence Principles formulated above to be particularly illuminating during our design and development phases. They enabled us to systematically analyze what needed to be built and to what spec they needed to be built in order for us to achieve the quality target for data migration. The principles are broadly applicable for all data migration efforts, and we hope other efforts can benefit from us sharing our experience here.