Health Score Metrics as a Software Craftsmanship Enabler
October 3, 2017
The notion of software craftsmanship is sometimes a muddy one. On the one hand, engineers find it hard to grasp and materialize craftsmanship, which is an abstract objective that, by itself, provides little guidance to the software engineering practice. On the other hand, craftsmanship is often narrowed down to a handful of "best practices" that engineers are expected to follow. Neither of these limited definitions helps much in improving software quality. During 2016, an R&D initiative for software craftsmanship was one of the technical priorities across the engineering organization at LinkedIn. As a part of this initiative, we decided to take the approach of quantifying some components of software craftsmanship to help guide our engineers towards better software development practices and software quality. The ultimate goal is to provide concrete and actionable guidelines for every piece of software being created at LinkedIn. With this goal in mind, we built a health score platform that collects and presents craftsmanship-elevating metrics, and provides a framework for convenient extension. We also proposed and implemented an initial set of metrics and plugged them into this platform.
This health score platform is built on top of LinkedIn's in-house development framework. Under this framework, the entire LinkedIn software ecosystem is broken down into releasable logical units of software known as "multiproducts." Each release of a multiproduct is associated with a semantic version and may depend on specified versions of other multiproducts. Thus, the multiproduct framework centers around the core concepts of dependency management, version control, and build systems. The multiproduct framework unifies various underlying technologies for software development, makes iterations of development faster, and is the backbone of LinkedIn's software delivery pipeline. Accordingly, our health score platform, which we refer to as “multiproduct health score,” tries to realize a model for measuring software craftsmanship at LinkedIn by considering factors such as quality of source code, state of dependencies, and semantic fidelity of product versions. Based on these aspects, we identified some quantifiable metrics (referred to as “multiproduct health metrics”), and synthesized them into a score by a configurable function (referred to as the “multiproduct health model”). Each multiproduct health metric is scaled to the standard value range of [0%, 100%]. The extensible nature of the platform enables us to plug in custom multiproduct health metrics and models in a convenient manner.
The system has three major components:
Quality build controller: In order to obtain multiproduct health metrics without influencing the commit-to-release time, we introduced a build phase after each release, referred to as "quality build." Developers also have the option of invoking quality build at any time. The quality build controller glues the multiproduct health score plugin to the rest of LinkedIn’s development framework.
Multiproduct health score plugin: At LinkedIn, Gradle powers the build pipeline for all mainstream products. During the quality build phase, the quality build controller invokes the Gradle plugin for the multiproduct health score, which returns a result that is sent to the multiproduct health score server. The multiproduct health score plugin provides APIs through which multiproduct health models are configured and multiproduct health score metrics are implemented. Note that before the quality build controller invokes the multiproduct health score plugin, it checks out the product in question. As a result, a product can associate itself with product-specific multiproduct health models and metrics by implementing them either in the build script of the product, or as a separate Gradle plugin, and the defining the plugin as a dependency of the product.
Multiproduct health score server: Provides APIs for accessing multiproduct health score metrics and models, calculating multiproduct health scores, and sending data to the data analytics infrastructures at LinkedIn. These APIs are used by various analysis and presentation tools for LinkedIn developers.
A simplified system architecture is illustrated by the following diagram:
Examples of multiproduct health metrics
Along with the health score platform, a set of multiproduct health metrics was proposed and implemented. In this section, we’ll describe a set of example health metrics that are relevant to all underlying languages and technologies and not heavily dependent on LinkedIn’s specific technical stack. They focus on source code, dependency management, and version control.
Source code metrics: Code coverage and code style are two example metrics in this category. Code coverage is defined as the percentage of source code (in lines) covered by tests, and, similarly, code style is the percentage of source code (in lines) without style issues. A percentage is scaled linearly to a range such that 100% represents the upper bound of the range, and 0% corresponds to the lower bound of the range. These ranges are empirical for code coverage and code style.
Dependency management metrics: Dependency freshness and dependency consistency are two metrics for dependency management.
Dependency freshness is defined as the percentage of dependencies that are not stale in a product. A versioned dependency is stale if: 1) it is around (since release) longer than an empirical threshold period when the quality build happens, and 2) the dependency package has more recent releases. From past software development practice at LinkedIn, using stale dependencies is technical debt that needs to be paid eventually.
Dependency consistency is concerned with correctly managing dependencies using Gradle. At LinkedIn, dependencies and their versions are defined by a dependency specification, which is used by various components of the LinkedIn development framework. At the same time, Gradle uses build scripts to define its view of dependencies. Inconsistencies between these two definitions may cause problems down the road: first, a dependency may be defined in the dependency spec but not in any build script, but be pulled transitively by another dependency; second, direct dependencies (for instance, maven coordinates) may be used in a build script. These inconsistencies create blind spots, making builds behave in strange ways without developers' awareness. Dependency consistency is the percentage of dependencies that are declared consistently in build scripts and dependency specs.
Version control: This is a simple metric defined around one of the rules of the semantic versioning scheme used at LinkedIn; namely, if a product is another product's dependency, it should be on non-zero major version. This is a binary metric.
Multiproduct health score at LinkedIn
Based on our goals for the multiproduct health score system, we will discuss the following health score models:
To provide incentives to improve health metrics, multiproduct health score is presented as a combination of health metrics. With the help from system-generated suggestions, engineering teams work on improving the health metrics of their multiproducts. The coefficients for health metrics are configurable, enabling both data-driven and strategic configurations to direct engineering efforts to the most important health metrics.
To help teams see the most critical area for improvement in their multiproducts, another option is to use the minimum of health metrics as the health score, assuming that the value ranges are standardized across different metrics. We refer to this model as “Critical Health Metric” below.
Multiproduct health score for any given multiproduct is presented to all LinkedIn developers through a LinkedIn internal development and deployment portal. The long-term trend of the multiproduct health score is also tracked for every product. In this section, we share some of the impact of multiproduct health score at LinkedIn, based on the data from a team where every member was confirmed to be aware of the multiproduct health score platform.
The following plots show multiproduct health score data during a six-month period starting from the launch date of multiproduct health score. The plot on the left shows the “combination model,” while the plot on the right shows the “critical health metric.” In both plots, the green line is the score at the end of the observation period, while the blue line is the improvement in percentage; i.e., assume that a product has a score of s in the beginning of the period, and at the end of the period it has a score of s’, then the improvement is (s’ - s) / s’. The product sequence in both of these charts consists of products sorted by increasing MPHS.
From the plots we can see that most products have improved their multiproduct health scores during this period. Given the fact that there are outlier products (hack day products, experimental products, and even abandoned products) and that many of the products with 0 improvement have reasonable scores on Day 1, these are promising results. Also, with several exceptions, the incentive to improve the overall health score often includes the incentive to improve critical health metrics: along with improvements of other health metrics, a significant improvement of health score is often achieved by a game-changer fix of the critical health metrics.
Our platform also helps teams that believe in and would like to enforce some team-specific best practices: they use our platform to define custom health models and implement custom health metrics. They are conveniently presented and managed just like generic multiproduct health scores and metrics by the platform.
Below is another example of such an effort to improve multiproduct health score where an engineering team keeps track of the evolution of aggregated multiproduct health score: in the dashboard, each point depicts the average multiproduct health score across all products of the team on that date. It has not been the case that multiproduct health score is taken care of every day, but the general trend is that developers pay attention to the metrics and try to improve them over time.
Our goal for the multiproduct health score platform is to make it an ecosystem where developers inside LinkedIn have access to data, can contribute metrics, and conduct research on how software craftsmanship is materialized and provide concrete guidelines to developers. While we look forward to progress in these areas, we hope that our work on multiproduct health score to-date provides these development teams with one quantifiable approach to measure trends of the level of software craftsmanship in any of their given products and for their team as a whole.
Thanks to Szczepan Faber and the 2016 Craftsmanship working group for their guidance on this project, and many thanks to Tim Worboys, Joshua Lawrence, Timothy Lindvall, and Haizhou Liu for making the UI portion of this project possible, and Rupa Shanbhag for providing valuable feedback and helping with adoption of the platform.