Scaling Salt for Remote Execution to support LinkedIn Infra growth
April 18, 2023
At LinkedIn, site engineers like to automate operational tasks at various infrastructure layers to minimize manual interventions, which can scale well and be easy to operate. Certain automations are performed via onDemand job executions.
LinkedIn engineers have been using Salt, a Python-based, open source software, for automating tasks at various infrastructure layers for more than a decade now, due to its high performance and pluggability. Salt comes with a rich set of execution modules which can be used directly or within custom modules. It works well for tasks such as OS upgrades, auto remediation/triage of issues, application profiling, traffic shifts, firmware upgrades, switch management and more.
Salt leverages a master-minion architecture to execute actions. A basic master and minion flow in Salt is shown in Figure 1. The minion (an agent on the host) sees jobs and results by subscribing to events published on the event bus by the master service. It uses ZMQ (ZeroMQ) to achieve high-speed, asynchronous communication between connected systems. Targeted minions execute the job on the host and return responses to the master. Master and minion secure their communication via encryption using AES keys.
Figure 1: Salt master minion flow via event bus
In this post, we will share how we scaled Salt by adding layers and integrating it with LinkedIn infrastructure to achieve 10x more remote execution jobs, with more reliability than ever before.
LinkedIn has seen massive growth in the last decade, and to support this growth LinkedIn engineers expanded the infrastructure from one datacenter fabric to multiple datacenter fabrics. With thousands of microservices & growing, the hardware demand has been in the order of hundreds of thousands of servers in recent years. This immense growth resulted in higher numbers of builds, releases, deployments, configuration pushes, network policies, system configurations etc. and demanded significantly more from our infrastructure tooling ecosystem.
As a result, we’ve built deployment systems for thousands of applications, containerization, continuous integration and delivery, service discovery, secret management for applications, RBAC systems for data/apps/users authorization, PKI infra etc.
As we began to unravel the intricacies of and scale our state and configuration management tooling we discovered multiple similar tools in use, such as cfengine, puppet and Salt which was leading to poor use of engineering time, so we settled on the most useful tool for each job. The low latency and high throughput performance provided by Salt at large scale for task execution over a Rest API made it our choice of tooling for parallel task execution at the server level.
Salt at LinkedIn
Prior to 2019, we had a single Salt master set up as illustrated in Figure 2, where the single master hosts used to orchestrate all minions in a fabric. As the infrastructure grew, the number of minions communicating with each master increased causing the masters to fall over often due to load related issues, resulting in downtime and impact to various automations that depend on it.
Figure 2: Old Salt setup in a fabric
Besides the reliability and performance impact caused by single masters handling in excess of 60k minions, we had various operational challenges for our Salt set up as well:
Salt development had almost no code coverage and faced various challenges related to building and maintaining multiple products- The 6 RPMs required regular upgrade, client libraries were buggy, the CLI was non-intuitive and monitoring was challenging.
Managing Salt master failover manually.
Managing minions and master configs via a complex set of cfengine policies.
Managing Salt API SSL certs for each data center fabric setup.
Managing Salt masters cnames for master discovery for each data center fabric.
Lack of security checks for clients’ Salt modules, and lack of module ownership.
Salt components were managed via self generated RPMs, making upgrades challenging and time consuming.
Single master reaching almost 65K minions per master in production, resulting in poor performance at times.
Remote Execution only via REST APIs
Until 2019, we were using Salt for use cases such as config management, state management, log management, remote execution jobs, artifacts distribution, application deployments, application performance analysis tools, auto remediation, user account management, network device maintenance, and traffic shifts. As it became obvious that we had multiple tools at LinkedIn for the same job, we began to isolate the right tool for each job and restricted the use of Salt to only remote execution usecases via the Rest API. Usecases related to config management, state management, etc. were ported over to puppet and other pieces of infrastructure built for specific purposes. Our plan was mainly inspired by one of 19 aphorisms (i.e The Zen of python) by Tim Peters.
“There should be one and preferably only one obvious way to do it.”
Our purpose of rebuilding Salt infrastructure was not just to scale it but also to simplify all of its operational aspects & improve the client experience. In mid 2019 we began to re-architect the LinkedIn Salt ecosystem, by integrating it with our development and deployment infrastructure to leverage the numerous benefits it provides, like CI/CD, deployment workflows, service discovery, application config management, secret management, containerization, managed service certifications, etc. The benefits were immense.
We re-structured the development of Salt by creating five new python multiproducts, out of which li-salt-master and li-minion used upstream python Salt as the main dependency. A Multiproduct template is a development framework, which helps with various aspects of an application development, i.e., package building using pygradle, dependency management, application config management (which can also capture the applications secrets), code coverage via unit tests, mypy for type checking, and flake8 for style and syntax checking. The purpose of each multiproduct we created is explained briefly in the following bullets:
li-salt-master: Deployable Master and API service which orchestrates minions and exposes new Salt rest APIs endpoints for clients.
li-minion: Installable python agent which gets installed on all hosts/servers. It is wrapped and packaged as an RPM with customized code which automatically discovers relevant master hosts and generates minion config on every start.
lipy-lisaltmaster: Python library for clients. For non python clients, i.e., java or go lang, simple curl examples are documented.
lisaltmaster-fileroot: Contains all Salt client ACLs and custom modules. This product enforces security checks on clients’ modules to ensure clients are following safe coding practices.
salt-execute: This Cli command allows teams to execute their module via api endpoint i.e., /execute which is exposed by our new design.
The flexibility built into the Upstream Salt project allowed us to create custom plugins/modules to set up the flow of our new architecture.
li-salt-master is a python application which acts as a monolithic service and pins the upstream salt library as a direct dependency. On deployment it starts 3 services, Salt-master, Salt-api and Nginx.
Figure 3: Individual li-salt-master setup with few li-minion
The Salt-master authenticates & authorizes approved clients, publishes job instructions to minions and collects job responses from minions.
- The Salt-api uses the python cherrypy framework and exposes rest APIs for clients and is written on top of the Salt netapi. It also exposes an endpoint which is used for service health & discovery.
- Nginx is used primarily as a reverse proxy for achieving mTLS authentication.
We overrode some Salt functions by plugging in custom modules and utilized knowledge provided in the multimaster tutorial.
Config: Salt and Nginx configs are generated dynamically using a jinja template. Placeholders in templates are updated using values from application configs and secrets like master private keys, paths, mysql db passwords, etc. To ensure redundant masters, all master hosts have the same private/public key pair, so that any minion can connect to any master host in the deployed cluster.
Auth module: We plugged in our own Salt PKI auth module which allows clients/services to use their client certs for identification and authentication.
Netapi module: We modified Salt’s existing rest_cherrypy code and added new API endpoints i.e.,
- /execute wraps existing Salt rest_cherrypy API i.e /minions and /jobs, for job executions on targeted minions and aggregating their responses.
- /connected to know the number of minions connected to each master in the cluster.
- /login is modified to rely on mTLS at Nginx level.
- /stats, this existing Salt api endpoint is expanded further by adding various new metrics around Salt master & API, Salt Auth QPS / Failures, request per sec, bytes per request, and many more.
- /admin was added to expose the overall health check of the cluster and also allows the service to become discoverable via DNS. Nginx is used as a reverse proxy and mTLS is enforced via the same.
- All other api endpoints are made inaccessible for clients.
Figure 4: Li-salt-master auth and execute api flow in a fabric
Tokens: We plugged in custom modules for managing auth tokens, defining token creation, retrieval, listing and deletion from the store (it uses MySQL DB as a store).
Engines: We plugged in custom engines for monitoring executions and Salt ACLs and modules sync from the lisaltmaster-fileroot datapack. It also emits various custom Salt master and api metrics like number of connected minions, Memory and CPU usage of all Salt sub-processes, Disk I/O, 2xx/4xx/5xx API responses, QPS, etc.
Reactor & Runner: We also added a custom runner module which gets triggered whenever a minion fails to authenticate with the master. This handles the usecase where authentication between master and minion fails after a host reimage due to the new key pairs generated on the minion. The runner validates the minion before accepting its new public key on master host.
Li-minion is a python product built using the LinkedIn python gradle, which builds the RPM package that gets installed on all LinkedIn hosts. It wraps the default salt-minion agent and generates the minion config before starting it (config can differ from fabric to fabric). It also generates the systemd service definition for running the li-minion agent on a host, logrotate configs for managing minion logs, generates minion metrics locally for offline analysis and schedules a fetch of new/modified modules from the respective master fileroot every 10 mins. The li-minion also defines its own resource limits like number of file operations and active memory to avoid disturbing co-hosted services. It discovers master hosts using facts available on hosts (like fabric, tags, etc) and integrates with LinkedIn service discovery.
Figure 6: Li-salt-masters ←→ Li-minions flow architecture, with mysql as auth token store (minion discover masters using service DNS record)
lisaltmaster-fileroot is a python product, which generates datapacks (datapack is a deployable package consisting only of static files to be dropped in some location on a host) and it contains client Salt ACLs and Salt Modules. Salt ACL defines who is authorized to execute which module on which hosts. Each code change is followed by a python Bandit run during continuous integration to find common security issues in clients modules. Any change in ACL or Modules builds a new data pack and gets pushed to all Salt master hosts automatically in a few minutes via LinkedIn deployment workflows, so there is no manual intervention needed.
Monitoring and Log Analysis
All metrics for our Salt ecosystem are emitted to the Kafka pipeline and visualized on LinkedIn’s internal graphing tool inGraphs.” Anomalies are alerted via Autoalerts and notified to on-call engineers via Iris. All Salt Master and API logs are streamed via Apache Kafka to Azure Data Explorer. Logs are analyzed using Azure Kusto Query Language queries, and Visualized using Azure Data Explorer for real time analysis.
Figure 7: Li-salt-master events dashboard screenshot
It is now easy to scale and operate, compared to earlier challenges we had. The new Salt architecture supports the execution of more than 15000 remote jobs across all of LinkedIn’s fleet of hundreds of thousands of servers every day with more reliability and scalability than ever before.
Major thanks go out to all our colleagues from the infrastructure tools SRE team, In addition to contributions from Himanshu Chandwani, the project was greatly contributed by Aastha Nandwani, Bhavik Patel, Ritish Verma, and Sergii Shchypa. We would also like to thank Nidhi Mehta for being a TPM partner for this project. Furthermore, we highly appreciate our partner teams across LinkedIn, i.e., Production SRE, Traffic SREs, Espresso SRE, LPS SRE, OSUA team, Monitoring infra, On-demand Profiling team, InfoSec team, and other Site Engineering teams within LinkedIn who supported our efforts. Finally, this project would not have been possible without the support of our engineering leaders, Senthilkumar Eswaran and Xi Chen.