Infrastructure

Costwiz: Saving cost for LinkedIn enterprise on Azure

Authors: Deven Walia, Vivek Subramaniam, Simon Desowza, and Karthik Subramanian

Cloud services have completely changed the way we approach infrastructure management. It’s now much easier to manage large infra requirements that have traditionally demanded an amalgamation of teams like DBA, Infra-SRE, Onprem-SMEs, network managers, and access control managers working together. However, the ease of these processes can lead to over-provisioning and under-utilization of cloud resources, resulting in increased operating expenses. Without careful monitoring and accountability in place, organizations risk getting swept away by soaring costs, compromising their ability to enhance the member and customer experience.

That’s why we built Costwiz, a tool that allows us to reduce costs by helping teams keep an eye on budgets and over-provisioned or under-utilized resources. Costwiz provides a unified experience that helps leaders drive more accurate forecasting of Azure budgets at LinkedIn with resource ownership detection, accountability, expedited remedies, and holistic data visibility (via custom dashboards). In this blog post, we will share our progress, challenges, and lessons learned from our Costwiz journey.

How Costwiz works

Costwiz detects and stops cloud cost anomalies as they occur to avoid unpleasant billing surprises. To identify where suboptimal spending is occurring, it ingests cost-cutting recommendations from Azure Advisor, an Azure service that is constantly analyzing resource utilization and other metrics to help ensure an optimized Azure deployment. Costwiz automates this process to alert teams of cost-saving options and proactively save money while giving teams deep visibility into your cloud costs.

Costwiz creates accountability by notifying organization owners to assign these recommendations to engineers or SREs and tracking the workflow of a recommendation. If a recommendation is not remediated in a set timeframe, Costwiz escalates the issue to the assigned person’s team and shares a summary email to organization leaders. It also helps aid decision-makers with the information they need around resource utilization details, its current cost in Azure, recommended action, potential savings, assigned engineers, and more in a unified UI.

Our approach to building Costwiz

For dashboards and alerts through regular emails, we decided to use native reporting tools available to us like Power BI to help us scale quickly. These tools expose the right set of data without us having to worry about engineering efforts for dedicated reporting UI or a workflow for email alerts and reporting dashboards. This allowed us to concentrate on other business problems and optimization efforts that needed attention.

Costwiz application and workflow management system

For the initial rollout, we built a single-page app for users to perform actions on their recommendations (as seen in Figure 1). The landing page lists all the resource recommendations along with metadata around resource owners (Azure security groups), recommendation message, current lifecycle status of the recommendation, due date, assigned engineer, last action message in terms of comments, and a history modal option to check the timeline of actions taken. Furthermore, engineers can access more information about subscription details, the total number of escalations, and the last escalation date.

Managers will have a custom view based on their login token, giving extended visibility into all recommendations assigned to their organization.

Screenshot of the Costwiz application view

Figure 1: Costwiz application view 

While the application gives customers a way to perform relevant actions, under the hood the lifecycle of a Costwiz recommendation is maintained as a state grammar, and allowed state movements are described in the following diagram (Figure 2). 

Diagram of State grammar for recommendation lifecycle management

Figure 2: State grammar for recommendation lifecycle management

Costwiz pluggable framework design

Costwiz’s modular framework can be easily integrated into different systems with minimal code and configurations. The main components that drive the Costwiz workflow are as follows:

  1. App configurations: The static configurations must be fed to the Costwiz workflow by the client system utilizing the pluggable framework.

  2. Data providers: The core input data for which the workflows should be executed and other supporting data required are fed to the system by data providers. We have support for data to be fetched from the database currently which can be easily extended to files, external caches, API call results, etc.

  3. Workflow execution: Once each pluggable workflow receives input from a data provider, they execute various stages of the workflow and produce outcomes like storing results to the database, escalating, alerting owners, etc.

  4. Notification providers: This component takes the responsibility of sending out notifications and alerts via emails, Slack, etc., based on particular workflow requirements.

These components can be classified into providers and core workflow methods. To support this, we implemented a Python SDK library, which provides abstractions for the relevant providers. The client system can utilize this package and invoke the workflow methods with provider implementations. An example of a pluggable workflow can be seen in Figure 3.

Image of Pluggable Framework Design

Figure 3: Pluggable Framework Design

The SDK library-based implementation aims to evolve into a decentralized, platform-agnostic workflow framework. It is an installable service with storage, deployment infrastructure templates, configuration management, monitoring, and a default user interface for the Costwiz portal. The goal is to enable consumer onboarding with minimal technical and operational dependencies.

Costwiz data platform

Costwiz relies on a robust and high-performing central data platform for its operations. Instead of point-to-point integrations, the platform is built on Extract, Transform, Load (ETL) principles to handle data from various source systems. 

The Extract phase utilizes Azure Data Factory to manage data ingestion from sources like Azure Kusto Clusters, Delta Live Tables in Azure Databricks, LinkedIn's internal REST endpoints, and Azure Data Lake. Data connections are secured through Azure Key Vaults and network connectivity is protected by LinkedIn's NACL control. This helped us quickly ramp various source-sink combinations because the data factory integrates with hundreds of data storage systems via linked services. Also, the data factory provides scheduling, pipeline dependency management, and alerts out of the box.

Transformations take place in Azure Databricks, where data undergoes quality checks, processing, and reformatting into Parquet format for efficient storage and adherence to ACID principles. 

In the Load phase, business views are created from raw source data and loaded into storage systems such as Azure SQL Server, Azure Cosmos DB, and Azure Data Lake. This data is then ready to be utilized by different actors in the next phase, including Power BI for dashboards and reports, the Costwiz application UI, and Costwiz processors in the form of Azure Functions.

The entire data platform (as shown in Figure 4) is monitored for errors, and alerts are configured to be sent to Costwiz administrators using Azure App Insights, Azure Monitor, and Azure Log Workspaces. This ensures the reliability and stability of the platform.

Diagram of Costwiz data platform

Figure 4: Costwiz data platform overview

Data watermarking

Watermarking is a method used to track the point at which data has been ingested in a table. ETL processes must determine where to pick up the next batch of data. We evaluated various watermarking strategies and decided on Auxiliary metadata driven watermarking. The following section provides a comparative analysis of these strategies:

1.  Aux metadata table driven watermarking:

  • Involves obtaining two values: left watermark and right watermark.

  • The left watermark is retrieved from an auxiliary database, representing the watermark value from the previous pipeline run.

  • The right watermark is extracted from the data source, indicating the latest data record at the time of the current pipeline's run.

  • Copy activity in Azure Data Factory is configured to copy records between these two watermarks, using where clauses in the source data query.

  • The right watermark is saved in the auxiliary database.

2.  Sink driven watermarking:

  • Similar to aux metadata driven watermark. But the left watermark is obtained directly from the sink table itself.

  • The right watermark is extracted from the data source, representing the latest data record at the time of the current pipeline's run.

  • Copy activity is provisioned to copy records between these two watermarks.

3.  Change tracking information driven watermarking:

  • Relies on change tracking information available at the source.

  • The source table is configured to enable change tracking with appropriate change retention policies.

  • Change tracking versions (CHANGE_TRACKING_CURRENT_VERSION) are utilized.

  • These tracking versions are passed to the copy activity to filter data at the source, and the corresponding tracking version is updated for subsequent pipeline runs.

Pros and Cons of each strategy:

1.  Aux metadata table driven watermarking:

  • Pros: Agnostic of sink type, pipeline idempotent, independent of the sink datastore type.

  • Cons: Dependency on an auxiliary store (e.g., SQL DB), additional operational cost due to the auxiliary component.

2.  Sink driven watermarking:

  • Pros: No requirement for an auxiliary store, performant as the source provides the delta records to ingest.

  • Cons: Tightly coupled to the sink datastore type, may be complex and costly for certain sink datastores like file-based storage.

3.  Change tracking information driven watermarking:

  • Pros: Pipeline idempotent, relies on source-provided delta records.

  • Cons: Limited to data sources that support change tracking, dependence on change tracking information.

Integration with Azure resource provisioners

LinkedIn utilizes provisioners in its Azure deployments to manage resource provisioning, network infrastructure, and data plane deployment. Provisioners offer benefits such as standardized resource tags, ownership identification, version-controlled infrastructure code files, and simplified developer experience. 

Costwiz integrates with the provisioners, enabling the application to display these details. Assigned engineers can access IaC repo links to refer to the code used for resource deployment and incorporate recommended configurations for optimal sizing. 

In the future, we may introduce deeper integrations involving deployment gating, where provisioners can leverage Costwiz's cost trend data to evaluate cost quota usage and enforce constraints or approval processes before resource deployment.

Escalation mechanism

Image of the escalation mechanism phases

Figure 5: Escalation mechanism phases

Making people aware of cost optimizations is one thing, but ensuring they act on them is a different challenge. We saw the need for accountability from day one and to accomplish that, we designed escalation engines. The escalation engine is a workflow designed to ensure accountability and action on cost optimizations. It consists of three phases: Collect, Process, and Act (as described in Figure 5).

In the Collect phase, relevant datasets are gathered, including Azure resource recommendations, internal organization hierarchy data from LinkedIn, and subscription scoped configurations.

The escalation process includes the following key points:

  1. Notification Cadence: Notifications are sent at specified intervals. For example, assigned engineers receive daily notifications excluding weekends, while level 1 managers can be notified twice a week.

  2. Escalation Hierarchy: The maximum management level in the hierarchy is defined for incrementally looping in individuals. This ensures that higher-level managers are involved as needed.

  3. Maximum Escalations: The maximum number of escalations attempted is determined. Once this limit is reached, the recommendation's lifecycle ends, and it is no longer considered.

  4. Default CC: Organizations have the option to delegate all Costwiz notifications to specific individuals, such as an on-call team. This configuration allows those designated persons to be copied on all notifications related to their subscriptions.

During the Process phase, policies are evaluated, and exceptions are made for certain executives who are excluded from escalation notifications, such as the CEO and their direct reports. An email allow list is dynamically updated daily to account for any recent changes in the organization hierarchy.

In the Act phase, notifications are aggregated based on the sender's ID to prevent excessive emails. The maximum number of notifications a person can receive per day is limited to three, even in cases where they have recommendations, escalations for their direct reports, and escalations for their skip-level reports.

The engine records audit records for every decision it makes. These records track whether a notification was attempted, succeeded, failed, or if the maximum number of escalations has been reached. The engine uses these audit records to evaluate if enough time has passed since the last notification based on the configured cadence for that subscription.

Resource ownership identification

The ultimate factor that defines the success of Costwiz workflows is their ability to hold resource owners accountable and drive them to take appropriate actions. This factor is directly related to the ability of Costwiz to determine resource ownership and notify the right owners. Initially, we started by relying on ownership data received from an internal team but that accounted for merely 10% of the total relevant resources with recommendations and hence we decided to explore options and build our own system for this purpose. Further, the public cloud itself doesn't handle this use case particularly well (for example, audit logs only go back 90 days).

In an ideal scenario, anyone provisioning a resource should responsibly add some standard tags to the resources followed across the organization which easily identifies the main owners but this is not easily achievable from the first day of cloud adoption. Until we streamline the process to manage resources across the organization, we identified the following sources to retrieve ownership information.

  1. Resource tags

  2. Resource group tags

  3. Provisioner (internal resource management teams) details

  4. Resource activity logs

  5. RBAC (Role assignments in Azure resource groups and subscriptions)

We implemented the base ownership system in a modular way which can be extended with multiple processors for each source mentioned above. The flow diagram of the ownership system is described in Figure 6.

Diagram of the resource ownership identification

Figure 6: Resource ownership identification

We incrementally added the processors for each source based on the coverage and fine tuned them to achieve accuracy. 

  • We started with parsing provisioner details in resources and then processed the tags in resources and resource groups. This covered less than 40% of the total resources.

  • Scanning the activity logs was challenging as the time to scan the last 30 days logs for a single resource took more than 30 seconds easily. Our initial idea was to scan for the last three months but to increase accuracy by relying on recent data, we settled on a 30-day window.

  • Retrieving the role assignments for RBAC processors using Azure’s SDKs was quicker than scanning the activity logs but the roles were assigned at the resource group and subscription level which diluted the accuracy of ownership at the resource level.

This ownership system was primarily built for Costwiz consumption but after realizing its practical utility, we extended the system to a centralized ownership service and exposed REST APIs to compute/get owners which can be utilized by external teams as per their requirements.

Some of the metrics we observed in the process are summarized below in the metrics visibility section.

Why cleanup Costwiz sandbox resources?

When our Productivity Engineering Group started our Azure journey, there were lots of small ad-hoc POCs (Proof of Concepts) and experiments conducted on Azure around resource scalability, feasibility, learning, and more. Because of these experiments, there were many resources created but then most of them were not cleaned up when those POCs were completed. This resulted in high net cost on all the Azure subscriptions that we have owned, which resulted in just the sandbox subscriptions itself costing around 45% of our allocated Azure budget for the group at one point.

The solution

We implemented automated cost optimization on Azure by cleaning up unused resources in our Sandbox subscriptions. Resources are cleaned up based on a time-to-live (TTL) value, and users need to extend the TTL if they want to keep the resource beyond the expiry date. The focus is on sandbox resources, which should be short-lived and removed once their purpose is served.

To onboard a subscription, we gather information from owners regarding TTL settings, exclusions, and notification preferences. Users are expected to create a resource group for their POCs under their sandbox subscription and place resources within it.

Expiry dates are added as Azure Tags to resource groups, with a default value of 2 weeks. The users of those resource groups would be identified through Azure activity logs and are notified at regular intervals before the expiry date, and notifications are consolidated to avoid spam. These notifications are sent at T-14, T-7, T-2, and T-1 days and these can be tweaked as per the onboarding configuration.

On the expiry date, resource groups are deleted, excluding certain default groups used for setting up the monitoring, and networking infra. Azure keyvaults found under a resource group are first moved to a common resource group before deleting the resource group as with soft delete enabled keyvaults, we would have not been able to create another keyvault with the same name for 60 days from the date of deletion.

Azure Resource Manager APIs are used for resource management, and alerts are sent for any deviations or errors.

The process is divided into three jobs, providing flexibility in execution as described in Figure 7.

Diagram of Sandbox resources clean up process

Figure 7: Sandbox resources clean up process

Metrics and Impact

Cost metrics

So far, Costwiz has shown recommendations for 12K Azure resources and 4.29K (approx 36%) of those resources have been reclaimed (data shown in Figure 8).

Another way to consume this metric is to understand that if these 4.29k resources were still running today, they would add a considerable amount to our opex, every single month.

Further, splitting these resources into their resource types, we see the following matrix.

Graph of the cost metrics

Figure 8: Cost metrics

The following are some of our highlights and learnings:

  • VMSS resources attributed the maximum percentage (82.7%) for closure of optimization actions.

  • Relatively speaking, few database resources (Sql server - 18%, Database accounts - Cosmos Db - 28.3%) got reclaimed compared to other resource types. This essentially means that the owner teams are much more likely to opt for downsizing their database accounts and would keep them functional instead of deleting these resources.

  • Also worthy of note here is Database accounts - Cosmos Db (Azure resource type - microsoft.documentdb/databaseaccounts) of which 28.3% got acted upon but they contributed only 4% to the potential savings thus implying a relatively large number of resources with very less potential for savings were reclaimed (typically pointing towards unused database accounts).

How has this process created the craft for continuous optimizations in cloud deployments?

The cost of Azure Sandbox Subscriptions reduced to 5% of our total Azure net spend for the group from 45% and remained consistently around that mark. Overall Azure spend growth became organic as the footprint increased while Costwiz continued to identify and eliminate unutilized sandbox subscriptions.

Conclusion

Costwiz has had a profound impact on the way we see Azure costs. We have learned what engineering personnel expect when they are tasked with rightsizing their resources. In fact, the Costwiz - Azure provisioner integration was ideated as part of providing the most exact information around a resource. It is intended to make it easy to use for engineers to be able to quickly check IaC code for resources and make suggested changes there.

Another major learning we have is that surfacing recommendations is a pure engineering challenge and is relatively easy as compared to actually getting people to act on their assigned tasks creating accountability. To make this even better, our escalation mechanism has proved to be very efficient at creating the required traction by looping in business owners and engineering leaders to get the right attention to inefficiencies.

Acknowledgements

Thanks to the team of Vivek Subramaniam, Deven Walia, Karthikeyan Subramanian, Simon Desowza, Brajesh Jaishwal, Vikram Mandyam, Sharath Channamallappa, and Alagarsamy A who ideated, designed, and got this to implementation understanding the urgency to control cloud costs; and thanks to our advisors Brandon Duncan and Brent Cochran who helped us to understand complex cloud costing model during this journey; and thanks to our leaders Balaji R and Balaji Vappala for the opportunity who encouraged and guided us during the entire course of this effort and make us realize the impact continuously. Special thanks to Ritesh Kini and the team who are our partners from the Cloudfit team in Microsoft.