Greykite: A flexible, intuitive, and fast forecasting library
May 13, 2021
In this blog post, we introduce the Greykite library, an open source Python library developed to support LinkedIn’s forecasting needs. Its main forecasting algorithm, called Silverkite, is fast, accurate, and intuitive, making it suitable for interactive and automated forecasting at scale. We will start by describing a few applications, and then walk through the algorithm design and user experience. For more technical details, please refer to this paper.
Accurate knowledge about the future is helpful to any business. Time series forecasts can provide future expectations for metrics and other quantities that are measurable over time.
While domain knowledge and expert judgment can sometimes produce accurate forecasts, algorithmic automation enables scalability and reproducibility, and may improve accuracy. Algorithmic forecasts can be consumed by additional algorithms downstream to make decisions or derive insights.
To support LinkedIn’s forecasting needs, we developed the Greykite Python library. Greykite contains a simple modeling interface that facilitates data exploration and model tuning. Its flagship algorithm, Silverkite, is highly customizable, with tuning parameters to capture diverse time series characteristics. The output is interpretable, allowing visualizations of the trend, seasonality, and other effects, along with their statistical significance.
The Silverkite algorithm works well on time series with (potentially time-varying) trends and seasonality, repeated events/holidays, and/or short-range effects. At LinkedIn, we’ve successfully applied it to a wide variety of metrics in different time frequencies (hourly, daily, weekly, etc.), as well as various forecast horizons, e.g., 1 day ahead (short-term) or 1 year ahead (long-term).
Some key benefits:
Flexible: provides time series regressors for trend, seasonality, holidays, changepoints, and autoregression; users select the ones they need and fit the machine learning model of their choice.
Intuitive: provides exploratory plots, templates for tuning, and explainable forecasts with clear assumptions.
Fast: allows for quick prototyping and deployment at scale.
In the remainder of the blog post, we’ll discuss forecasting applications at LinkedIn, the algorithm’s design, the user experience, case studies, and an evaluation of Greykite’s performance. The “algorithm design” section provides an overview of how Greykite’s flagship algorithm (Silverkite) works.
At LinkedIn, we use time series forecasts for resource planning, performance management, optimization, and ecosystem insight generation. For example:
To provision sufficient infrastructure to handle peak traffic.
To set business metric targets and track progress for operational success.
To optimize budget decisions by forecasting growth of various markets.
To understand which countries are recovering faster or slower after a shock like the COVID-19 pandemic.
The Greykite library is designed to solve these types of problems. To develop Greykite, we identified a few champion use cases. These helped us refine our models and prove success. In partnership with the domain experts, we demonstrated that Greykite helps LinkedIn confidently manage the business and make better decisions.
For example, LinkedIn Marketing Solutions is a fast-growing business, with a dynamic ecosystem of advertisers and potential customers. Forecasts are essential to managing this business. Short-term forecasts of budgets, clicks, revenue, and other metrics feed into an ecosystem health dashboard that is regularly refreshed to flag potential issues. The forecasts indicate when a metric deviates from expectations, and provide additional context about which metric dimension or related metric may help explain anomalies. Long-term forecasts help us set metric targets and check whether we are on track to meet them.
On the infrastructure side, forecasts help LinkedIn maintain site availability in a cost-effective manner. We forecast peak minute-level site queries per second (QPS) and service QPS over the next year in order to provision sufficient capacity without adding excessive buffers. Better information about future traffic, combined with accurate site capacity measurements, enables confident decision making. Since even a small percentage of savings translates to a large reduction in total cost, accurate forecasts have a big business impact. Forecasts enable continued, sustainable growth through right-sized applications.
Figure 1. Architecture diagram for Greykite library's main forecasting algorithm, Silverkite
The Silverkite algorithm architecture is shown in Figure 1. The green parallelograms represent model inputs, and the orange ovals represent model outputs. The user provides the input time series and any known anomalies, events, regressors, or changepoint dates. The model returns forecasts, prediction intervals, and diagnostics.
The blue rectangles represent computation steps of the algorithm, decomposed into two phases:
Phase (1): the conditional mean model.
Phase (2): the volatility/error model.
In (1), a model is utilized to predict the metric of interest, and in (2), a volatility model is fit to the residuals. This choice helps us with flexibility and speed, because integrated models are often more susceptible to poor tractability (convergence issues for parameter estimates) or divergence issues in the simulated future values (predictions).
Phase (1) can be broken down into these steps:
(1.a) Extract raw features from timestamps, events data, and history of the series (e.g., hour, day of week);
(1.b) Transform the features to appropriate basis functions (e.g., Fourier series terms for various time scales);
(1.c) Apply a changepoint detection algorithm to the data to discover changes in the trend and seasonality over time;
(1.d) Apply an appropriate machine learning algorithm to fit the features from (1.b) and (1.c) (depending on the objective).
The purpose of Step (1.b) is to transform the features into a space which can be used in “additive” models when interpretability is needed. For Step (1.d), we recommend explicit regularization-based models such as Ridge or Lasso. Note that if the objective is to predict peaks, quantile regression or its regularized versions are desirable choices. In the next sections, we provide more details on how these features are built to capture various properties of the series.
In Phase (2), a simple conditional variance model can be fitted to the residuals, which allows for the volatility to be a function of specified factors, e.g., day of the week.
To make the benefits of this design concrete, we illustrate with a few examples from public datasets.
Time-varying trend and seasonality
It is very common to see time series with trend or seasonality patterns that change over time.
Figure 2. Trend and seasonality changepoints in bike-sharing data
For example, Figure 2 shows the number of shared bike rides in Washington, D.C., from 2011 to 2019 at an hourly frequency. By inspecting this figure and other aggregated trend plots, you can observe some indications of changes in trend patterns. Silverkite allows the user to specify the changepoint locations or request automatic trend changepoint and seasonality changepoint detection.
Automatic changepoint detection works as follows: For trend changepoints, we first aggregate the time series into a coarser time series to eliminate short-term fluctuations (which are captured by other features, such as holidays). For example, daily data can be aggregated into weekly data to weaken the effect of holidays. In the next step, a large number of potential trend changepoints are placed evenly over the whole time period, except a time window at the end of the time series. We avoid placing changepoints near the end of the time series to prevent extrapolating trend changes based on limited data (an example of over-fitting in time series context). Then we apply the Adaptive Lasso (Zou 2006) algorithm to shrink insignificant potential trend changepoints. In the Adaptive Lasso regression problem, we fit the aggregated time series with the potential trend changepoints and yearly seasonality Fourier series, which is sufficient for capturing the long-term trend pattern:
aggregated_timeseries ~ growth + potential_trend_changepoints + yearly seasonality
The reason for choosing Adaptive Lasso over Lasso (Tibshirani 1996) is that, in our experience, the latter over-shrinks significant changepoints’ coefficients to reach the desired sparsity level.
Finally, the detected changepoints are used to construct piecewise growth basis functions. For example, instead of a single linear trend, the growth can be piecewise linear by allowing the slope to change at the detected points. Mathematically, if the original growth term is f(t), e.g., f(t)=t, after detecting change points t1, …, tK, the growth function will be of the form
To allow for changes in seasonality, Silverkite constructs basis functions that allow the seasonal effect to change in both shape and magnitude with time. Silverkite also allows for automatically detecting seasonality changepoints with the following regression model:
De-trended_timeseries ~ Σcomponent seasonality + potential_seasonality_changepoints
It is important to capture the effect of repeated events such as holidays. Silverkite constructs indicator variables to use as basis functions for the holidays. These variables take the value 1 during the event, and 0 otherwise.
Seasonal effects can be disrupted by events, especially important holidays with a large impact on the time series of interest. For example, in Figure 3 below, the daily seasonality pattern is different on weekends (unimodal) and weekdays (bimodal). The reason is that, while people tend to use bikes for commuting during weekdays, they tend to use bikes for leisure on weekends. Silverkite uses interactions to handle these in the model. These allow for any arbitrary deviation from the “regular” times, not only simple mean-shifts during the event.
Figure 3. Silverkite captures different daily seasonality patterns on weekdays and weekends
Some holidays have extended impact over several days in their proximity, whereas others are more localized. Silverkite allows the user to customize the number of days before and after the event where the impact is non-negligible. By default, the effect on each day is modeled as a separate effect. Silverkite also allows for modeling less impactful holidays together for model sparsity.
Remaining temporal dependence (auto-regression)
After accounting for trends, seasonality, events, changepoints, and other important features, the residuals may still show a temporal dependence (albeit often orders of magnitude smaller than the original series). The remaining temporal correlation can be exploited to improve the forecast, especially for short-term forecast horizons. To that end, we allow for an auto-regressive structure in the model. While auto-regression (AR) can account for the remaining correlation in the series, for many applications a large lag order might be needed to capture long-term dependence in the chain. To remedy this issue, Hosseini et al. (2011) suggested a technique to develop “parsimonious models” (models that use few parameters to achieve high predictive power) by aggregating the lags. As an example, for a daily series, consider the averaged lag series
This covariate represents the average value for the value of the series over the past week. As another example, consider
This covariate represents the average value of the series on the same day of week in the past three weeks. Similar series can be defined for other frequencies as well.
When AR is used in the models, the future is predicted by simulating several time series into the future with the fitted models and aggregating them. Note that if the maximum lag utilized in the models is smaller than the forecast horizon, simulations are not necessary. Silverkite takes advantage of this to speed up predictions when possible.
Sometimes, the modeler has expert knowledge about the future that cannot be encoded in the tuning parameters. For example, they may have good forecasts of macroeconomic indicators that are predictive of the metric to forecast. These external forecasts may come from a human or algorithm.
Silverkite allows explicit information about the future to be provided in the form of a regressor with past and future values. This regressor is used directly when fitting the model and for prediction. Silverkite also allows lagged regressors to capture temporal dependence, similar to the auto-regressive lags of the original series.
A positive user experience depends on a few factors: intuitive controls, fast forecast speed, and interpretability. These are essential to library adoption within LinkedIn, because they help users develop accurate models they can trust.
To take advantage of the powerful modeling features above, the Greykite library offers model templates, tuning, and grid search.
Model templates define regressors based on data characteristics and forecast requirements (e.g., hourly short-term forecast, daily long-term forecast). The user is encouraged to try a few relevant templates to find the best one. The high-level tuning knobs provided by model templates drastically reduce the search space to find a satisfactory forecast.
Fine-tuning is important for key business metrics with high visibility and strict accuracy requirements. Therefore, the Greykite library also provides full flexibility to customize a model template for any algorithm. For example, the Silverkite algorithm offers automatic changepoint detection, but also allows the user to add known changepoint dates. Silverkite automatically captures the effect of common holidays and interactions, but allows custom events and specification of model terms even down to the patsy model formula. And while the Greykite library provides outlier detection, the user may label known anomalies and specify whether to ignore or adjust them in the training data.
This fine-tuning often requires a deeper understanding of data characteristics. The Greykite library provides diagnostic plots to assess seasonality, trend, and holidays effects, as shown in Figure 4. It also provides components plots and model summaries, which help with tuning as well as interpretability.
Figure 4. The Greykite library provides diagnostic plots to assess seasonality effects and interactions. In the top plot, we see that the daily seasonality depends on the day of week. In the bottom plot, we see how seasonality magnitude increased between 2010-2018, then decreased in 2019.
Finally, when there are hundreds of metrics to forecast, it is not feasible to manually tune the model for each one. The Greykite library allows for hyperparameter grid search to select the optimal model from multiple candidates, using performance on past data. This makes it easy to forecast many metrics in a semi-automatic fashion: instead of tuning each forecast separately, the user can define a set of candidate forecast configurations that capture different types of patterns, and use grid search to find the best model for each one.
A next 7-day forecast trained on 8+ years of daily data takes only a few seconds (to fit the mean and volatility model, and produce forecasts) with Greykite. Our whole pipeline, including automatic changepoint detection, cross-validation, backtest, and evaluation, takes under 45 seconds.
Speed facilitates interactive tuning and grid search. Modelers can quickly iterate in an interactive environment, such as a Jupyter notebook, to arrive at a satisfactory model (Figure 5). Speed also enables forecasting at scale across many time series and dimensions, a common requirement in business settings.
Figure 5. The Greykite library provides interactive data exploration tools and fast forecasts, suitable for prototyping in a Jupyter notebook environment
Understanding forecasts is important to business decisions. Silverkite typically performs well when an additive model like Ridge or Lasso is used to fit the forecast. In these cases, we can plot components to interpret the model:
Forecast = Seasonality + Growth + Holidays + Changepoints + ...
Figure 6. The component plot shows the contribution of drivers (e.g., trend, seasonality, holidays) to the forecasted value. It can be used for interpretability and debugging.
For example, in Figure 6, the fitted trend increases first then slightly decreases after a few detected changepoints. The yearly seasonality shows a higher number of rides during warm seasons and a lower number of rides during colder months.
Figure 7. Model summary shows the effect of individual factors in the model for interpretability, model tuning, and debugging
Model summary can be used to assess the effect of individual factors, as shown in Figure 7. For example, we may want to check the magnitude of a holiday effect or see how much a changepoint affected the trend. The model summary provides some evidence if a certain feature is beneficial to the model and if its coefficient aligns with our expectations.
These interpretations also help explain how a forecast changes over time. If the forecasted value for a particular target date changes after training on new data, we can check which components or individual regressors have changed to understand why. When strategic business decisions depend on the forecasted value, it is particularly important to understand the factors behind changes and assess whether the explanation is reasonable (perhaps verify against known events that may have caused the change) before deciding to take action.
The Greykite library comes with excellent out-of-the-box capabilities for model assessment, validation, and testing, which are appropriate for the time series context. Below, we first discuss the methodology for assessment and benchmarking, and then compare Silverkite’s current out-of-the-box performance to some existing open source packages.
In practice, a relative error metric is commonly used to evaluate forecast quality. We benchmark forecast accuracy using Mean Absolute Percentage Error (MAPE) rather than Root Mean Square Error (RMSE). MAPE is scale free and can be directly compared across datasets. The following analysis highlights the flexibility of the Silverkite algorithm.
We use a rolling window Cross-Validation (CV) for our benchmarking, which closely resembles the well known K-fold CV method. In K-fold CV, the original data are randomly partitioned into K equal-sized subsamples. A single subsample is held out as the validation data, and the model is trained on the remaining (K-1) subsamples. The trained model is used to predict on the held-out validation set. This process is repeated K times so that each of the K subsamples is used exactly once as the validation set. Average test error across all the K iterations provides an unbiased estimate of the true test error of the machine learning model on the data.
Due to the temporal dependency in time-series data, the standard K-fold CV is not appropriate. Choosing a hold-out set randomly has two fundamental issues in the time series context:
Future data is utilized to predict the past.
Some time series models cannot be trained realistically with a random sample, e.g., the auto-regressive models (due to missing lag values).
Figure 8. Benchmark test sets are created using rolling window CV, with internal cross-validation in each benchmark (BM) split for model selection
Rolling window CV addresses this by first creating a series of K test sets, as illustrated in Figure 8. For each test set, the observations prior to the test set are used for training. This creates K benchmark (BM)-folds. Within each training set, a series of CV folds is created, each containing a validation set. The number of datapoints in every test and validation set equals the forecast horizon. Observations that occur prior to that of the validation set are used to train the models for the corresponding CV fold. Thus, no future observations can be used in constructing the forecast, either in the validation or testing phase. The parameters minimizing average error on the validation sets are chosen. This model is then retrained on the training data for the corresponding BM-fold. The average error across all test sets provides a robust estimate of the model performance with this forecast horizon.
For our benchmark, we ran the models on two different forecast horizons, 1-day ahead and 7-day ahead. We chose consistent benchmark settings suitable for all algorithms, including the slower ones, and used a single CV fold to speed up runtime across algorithms.
We chose datasets with at least two years of training data so that the models could accurately estimate yearly seasonality patterns. The models were run on the following datasets:
Daily Australia Temperature Dataset, Temperature column
Beijing PM2.5 Dataset, pm2.5 concentration column
The number of periods between successive test sets and total number of splits are chosen to ensure the following:
The predictive performance of the models is measured over a year to ensure that the test sets provide a representative sample across time properties, e.g., seasonality, holidays.
The test sets are completely randomized in terms of time features. For daily data, we avoid setting "periods between splits" to any multiple of 7, because that would result in the training and test sets always ending on the same day of the week.
The total computation time is minimized while maintaining the previous points. For daily data, setting “periods between successive test sets” to 1 and number of splits to 365 is a more thorough CV procedure. But it massively increases the total computation time. Hence, we set periods between successive test sets to 16 and the number of splits to 25.
We used out-of-the-box configuration for Auto-Arima (pmdarima) and Facebook Prophet (fbprophet). The Silverkite out-of-the-box configuration was also chosen prior to running the benchmark. It uses ridge regression to fit the model and contains linear growth, appropriate seasonalities (e.g., monthly, quarterly, and yearly seasonality for daily data), automatic changepoint detection, holiday effects, autoregression, and seasonality interaction terms with the trend and changepoints.
As shown in Table 1, Silverkite performs better out-of-the-box for 1-day and 7-day forecast horizons. On average, Silverkite and Auto-Arima run 4 times faster than Prophet. Note that the average test MAPE values are high due to values close to 0 in the Beijing PM2.5 dataset.
Table 1. Benchmark comparison of Silverkite against Auto-Arima and Prophet
This is an initial benchmark on a few public datasets. We hope to benchmark additional datasets, forecast horizons, and data frequencies that better match our industry applications. (Please reach out to us on github if you have a public dataset to recommend!) For example, unlike the weather datasets used above, LinkedIn’s metrics tend to show strong changepoint and event/holiday effects with temporal dependencies. Thus, the benefits of Silverkite are more apparent for our internal datasets; for example for (i) Revenue forecasts 1-day ahead and 7-day ahead and (ii) Weekly Active User forecast 2-weeks ahead, Silverkite decreased the MAPE by more than 50% and 30%, respectively.
The Greykite library provides a fast, accurate, and highly customizable algorithm (Silverkite) for forecasting. Greykite also provides intuitive tuning options and diagnostics for model interpretation. It is extensible to multiple algorithms, and facilitates benchmarking them through a single interface. Currently, Greykite also supports Facebook Prophet (fbprophet), and we plan to add other useful open-source algorithms in the future to give users more options to choose from, through a unified interface.
The Greykite library is developed by the Data Science Research and Productivity team at LinkedIn. Special thanks to Rachit Kumar and Saad Eddin Al Orjany for their contributions to this project, and to our close collaborators in Data Science, Engineering, SRE, FP&A, and BizOps for adopting the library. In particular, Ashok Sridhar, Mingyuan Zhong, and Jerry Shan provided valuable ideas and feedback. We also thank our management team Ya Xu and Sofus Macskássy for their continued encouragement and support.