Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

2022 was a year driven by change for the Talent Acquisition industry, with nearly 50k company mergers and acquisitions completed worldwide. As of November 2023, roughly 150K+ recruiters switched jobs in the previous 12 months as shown in Figure 1. These changes – whether at an organization level or a user level – result in ownership transfers of hiring entities.

Image of Talent pool report for recruiters - LinkedIn Talent Insights
Figure 1: Talent pool report for recruiters - LinkedIn Talent Insights

During mergers and acquisitions, the source company’s user licenses and data are transferred to the acquiring company. Similarly, when a recruiter transitions to their next opportunity, internally or externally, all their work is transferred to another recruiter. This includes ownership transfer of more than 15 entity types (jobs, applications, notes, saved searches, etc.). This multi-entity handover process involves huge amounts of data updating and cloning. 

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. To address these business requirements, we developed a system that brings offline and nearline components essentially making it a hybrid bulk data processing framework. When a transfer request is placed, the company administrator expects all data and licenses to be moved from a source recruiter to a different account or reassigned to another recipient. Any disruptions in the transfer blocks the recruiter from carrying out the day-to-day recruiting process. Part of our mission is to make the recruiting process more effective and efficient, which includes being able to carry out transfers reliably within a determined time frame.

With our new data processing framework, we were able to observe a multitude of benefits, including 99.9% request success rates, 78% reduction in customer escalations, and automatic recovery from transient errors. In this post, we will cover unique challenges we faced, our solutions design and architecture, the tech stack used, and the performance results we achieved.

Concepts

There are four types of data ownership requests:

  1. Transfer: The licenses and data are cloned to the same recruiter from one account boundary to the other. A typical merger & acquisition scenario.
  2. Undo Transfer: Reverting a previous transfer.
  3. Reassign: The licenses and data are transferred to a different recruiter within an account boundary. A typical next opportunity scenario.
  4. Undo Reassign: Reverting a previous reassign.

Unique challenges

Complex interdependence between entities

The entities that change ownership have complex relationships. For example, an application cannot be cloned to the new owner unless the corresponding job is cloned and remapped. Notes cannot be remapped unless the parent entity (job or applicant) is remapped, so it becomes important to maintain the strict dependency ordering that exists between various entities. Figure 2 captures the dependency of all the major entities.

Diagram of Entity dependency graph (hierarchically from top to bottom)
Figure 2: Entity dependency graph (hierarchically from top to bottom)

Bursty write traffic 

The data ownership changes result in bursty write traffic to all the entity Source-of-Truth tables of the database. This demands significant write QPS support without causing any data inconsistencies.

Idempotence 

With multiple moving parts, different service-level agreements (SLAs) offered by different infrastructure, transient issues are inevitable. The system should retry requests which might result in duplicates. It becomes necessary that the operations should be idempotent.

Support revert 

The underlying system should accurately track all the data ownership updates to support the customer requests to revert operations.

Entity cardinality 

Entities vary in quantity, with the number of entities to be processed in a request varying significantly, leading to differing processing needs for various entity types. For example, a recruiter might have one or two inboxes, while the number of jobs they manage could range from a few tens to hundreds; the volume of job applications and stored notes can be considerably higher than that of jobs.

Guiding principles

To achieve the business requirements and to address the unique challenges, we followed the these five principles:

  1. Consistent data - The system should make sure no data is left behind. All the source data should transition over to the destination user.
  2. Observable - The system should track all data movement requests and clearly reason about failures and inconsistencies.
  3. Durable - The system should auto recover from transient failures. Push for eventual success of the request.
  4. Configurable - Enable plug and play. The entity owners should be able to implement the processing interfaces and easily plug into the framework.
  5. Scalable - The system should be able to scale to all the entity types irrespective of their cardinality values.

Solution

Let's explore the solution, use cases, and architecture in greater detail.

Why Hybrid

As discussed in the unique challenges section, every entity is different and has its own requirements. Some entities are easier to find and process than others. A framework should be capable of accommodating the processing requirements of present and future entities. A hybrid – offline and nearline components working in tandem – can achieve the requirements in a highly reliable, scalable, and observable manner. Entities that are orders of magnitude larger, and that require heavy processing are advised to scale using offline components. Therefore, we recommend entities that typically take more than 100 seconds to process should onboard to offline, else nearline.

Actors and use cases

Figure 3 below illustrates the data ownership transfer process. To start, customer administrators and LinkedIn support representatives submit data ownership transfers. On failure, reps validate and submit customer escalation requests to Engineering. Then, engineering teams work on new features, new entity onboarding, and troubleshooting issues. The bulk data processing framework which runs on LinkedIn infrastructure executes the workflows and successfully transfers the data ownership of entities as part of LinkedIn Recruiter.

Diagram of Actors and use cases in a data ownership transfer
Figure 3: Actors and use cases in a data ownership transfer

Tech stack

Various internal and open source components work together to achieve bulk data processing. Some of the key components include:

  1. Rest.li - An open source REST framework for building robust, scalable RESTful architectures using type-safe bindings and asynchronous, non-blocking IO. The request endpoint, entity processors are exposed over Rest.li.
  2. Apache Kafka - A distributed event streaming platform. The requests and processing metadata are streamed over Kafka.
  3. Apache Samza - A distributed stream processing framework. Takes care of invoking online interfaces and interacting with the caching layer.
  4. Couchbase Cache - A caching solution to track active transfer requests, entity processing metrics etc.
  5. Azkaban - A distributed workflow manager to orchestrate offline workflows
  6. Scheduler - An in-memory scheduler to schedule a batch of requests.

Our framework’s architecture

Diagram of hybrid bulk data processing framework architecture
Figure 4: Hybrid bulk data processing framework architecture
  1. The requests are submitted to the online endpoint which makes an entry in the DB and schedules the batch for processing. Note: Multiple requests are grouped into hourly batches and then executed.
  2. The Workflow Manager reads an entity dependency config and accordingly processes the batch of requests. It also takes status input from the Status Monitoring component and decides if the failed requests should be retried or the successful requests should be moved to the next step in the workflow.
  3. The entities can be processed partly offline and partly nearline or completely nearline based on the entity processing requirements.
    1. Offline + Nearline + Entity Clients - The offline job identifies all the entity records to be processed and leverages a nearline single entity processor along with entity clients for processing all the records (one record per Kafka message). The nearline system hosts rate limiters per entity to control bursty write traffic. 
    2. Nearline + Entity Clients - Through a nearline aggregate processor the entity clients identify and process all the records (multiple requests per Kafka message).
  4. The requests of the batch and the status of each request on a per entity basis is tracked via the cache. The cache captures all the transient metrics that the Workflow Manager uses for decision making. The same information is surfaced in the monitoring dashboard.
  5. The data store maintains two type of records:
    1. Request Records - These are the ownership transfer requests as submitted by the end user. It tracks the source, destination, type of request, status of request etc.
    2. Entity Transfer Records - These are granular details of the exact entity that was transferred during the process. Entity Id, entity type and other such entity specific information are tracked here. Entity processors are welcome to leverage this data store to support revert operations and entity look-up to support Idempotency. 

Config File

Our framework leverages a config file to understand the entities to be processed, the dependencies between them, the nature of processing (offline vs nearline), etc. The stages config groups entities in such a way that the entities part of the same stage are processed in parallel while the entities in Stage_j are processed only after processing entities in Stage_i (i < j) addressing complex entity interdependency

  1. transferEntityType - Represents the entity to be processed
  2. workflowHandlerType - It can be Offline or Nearline, accordingly the entity will be processed
  3. flowName - This key is necessary for offline entities to specify which flow to be triggered in the offline component
  4. lixKey - This key is used to test and ramp newly onboarding entities

Entities

<property name="SeatTransferWorkflowManagerService.workflow">
 <list>
   <map>
     <entry key="transferEntityType" value="ENTITY1"/>
     <entry key="workflowHandlerType" value="NEARLINE"/>
   </map>
   <map>
     <entry key="transferEntityType" value="ENTITY2"/>
     <entry key="workflowHandlerType" value="OFFLINE"/>
     <entry key="flowName" value="$[seatTransfer.job.flowName}"/>
   </map>
   <map>
     <entry key="transferEntityType" value="ENTITY3"/>
     <entry key="workflowHandlerType" value="NEARLINE"/>
   </map>
   …………………………
   …………………………
   <map>
     <entry key="transferEntityType" value="ENTITY(N-1)"/>
     <entry key="workflowHandlerType" value="NEARLINE"/>
   </map>
   <map>
     <entry key="transferEntityType" value="ENTITY(N)"/>
     <entry key="workflowHandlerType" value="OFFLINE"/>
     <entry key="flowName" value="$[seatTransfer.tag.flowName}"/>
     <entry key="lixKey" value="lix.key.identifier"/>
   </map>
 </list>
</property>

Stages

<property name="SeatTransferWorkflowManagerService.stages">
  <map>
    <entry key="0" value="ENTITY1,ENTITY2"/>
    <entry key="1" value="ENTITY3,ENTITY4"/>
    <entry key="2" value="ENTITY5, Entity6"/>
    …………………………
    …………………………
    <entry key="M" value="Entity(N-2), Entity(N-1), Entity(N)"/>
  </map>
</property> 

Performance

Durability

Extract-Transform-Load (ETL) delay is a unique attribute of an offline dataset which might affect consistent data principles. Online data is ETL’ed into offline storage on a regular cadence to be available for offline processing. This results in potential delays when data is produced versus when it is available for processing.

Our framework is highly durable and tries to achieve an eventual success result. It executes the workflow across two iterations and two retries. The iterations are spaced apart. The framework tries to achieve the end state within a well defined SLA. There are two durability parameters viz., iteration and retry

  1. Iteration - The entire workflow is processed twice i.e., as part of two iterations spaced by a few hours. This ensures no data is left behind either because of ETL delays in the offline component or because of transient failures.
  2. Retry - Every step of the workflow is retried if one of the requests in the batch fails. The retry picks only failed requests and re-processes them to ensure the system recovers from transient failures.
Image of Durability - success on retries
Figure 5: Durability - success on retries

As shown in Figure 5, there were 23 failures during Iteration1 - Retry0 which was mitigated during Iteration1 - Retry1. Further, no such errors were observed during Iteration2 - Retry0 emphasizing it to be transient.

Observability

Monitoring is one of the most important components of this framework as we have long running workflows. At any point in time we want to track how many batches are running with the exact number of requests per batch. The requests that have failed processing certain entities while many others have succeeded. 

For greater observability we have enabled email notification and developed a monitoring dashboard.For email notifications, successfully executed batches and failed batches with specific failed request IDs are notified to the group via emails and a daily summary is posted to the entire group to notify the overall health status of the framework.

With the monitoring dashboard, every batch is spaced hourly, so by inputting batchID or the hour timestamp of the batch, one can retrieve all the requests and their statuses. Additionally, there is a provision to retrigger specific requests of a batch.

Image of Seat Transfer V2 Search
Images of Observability - all metrics available in a dashboard
Figure 6: Observability - all metrics available in a dashboard

Conclusion

In this blog post we have discussed the salient features of the hybrid bulk data processing framework. It is highly durable, observable, configurable and scales easily for various entity requirements. We’ve deployed this framework to production and it has been successfully running for more than five months now processing 4K+ requests per week. We have successfully onboarded 15+ entity types, ~6 offline entities and rest nearline.

We’ve achieved the following:

  1. 99.9% request success rate
  2. Reduced customer escalations by ~78% in the last 6 months
  3. Expected weekly reduction in Customer Support Effort of 10K hours
  4. Mean Time To Detect (MTTD) request failures and identifying the failing process along with the request id is in minutes
  5. Automatic recovery from transient errors

Figure 7 presents the seat transfer success rate aggregated over various weeks of the year 2023. The results are captured from the point of feature ramp i.e., week 9 (early March) this year. The system was able to achieve the SLA of 99.9% success rate during the majority of the weeks. It was able to scale and handle all the traffic coming its way.

Image of Weekly aggregate of the completion rate
Figure 7: Weekly aggregate of the completion rate

Acknowledgements

It takes a village to build something significant! This project was possible with the tremendous effort and collaboration from the following members:

  1. Engineering Team: Abhishek AgrawalAditya HegdeKrunal RankPiyush MasraniRahul SuleRajesh PalakurthiSaumi BandyopadhyayShen ShenSi ChangXie Lu
  2. Partner Team: Abhinav GosaviChunnan YaoRohini BhimpureWeitong DiYezhong XuZhaokang LiNanda Kishore KrishnaJeremy ChuangLuke FleschYi Zhao