Dagli: Faster and easier machine learning on the JVM, without the tech debt
November 10, 2020
In recent years, we’ve been fortunate to see a growing number of excellent machine learning tools, such as TensorFlow, PyTorch, DeepLearning4J, and CNTK for neural networks, Spark and Kubeflow for very-large-scale pipelines, and scikit-learn, ML.NET, and the recent Tribuo for a wide variety of common models. However, models are typically part of an integrated pipeline (including feature transformers), and constructing, training, and deploying these pipelines to production remains more cumbersome than it should be. Duplicated or extraneous work is often required to accommodate both training and inference, engendering brittle “glue” code that complicates future evolution and maintenance of the model and creating long-term technical debt.
Consequently, we’re pleased to announce the release of Dagli, an open source machine learning library for Java (and other JVM languages) that makes it easy to write bug-resistant, readable, modifiable, maintainable, and trivially deployable model pipelines without incurring technical debt. Dagli takes full advantage of modern, highly-multicore CPUs and increasingly powerful GPUs for effective single-machine training of real-world models.
With Dagli, we hope to provide three main contributions to the machine learning community:
- An easy-to-use, bug-resistant, JVM-based machine learning framework
- A comprehensive collection of statistical models and feature transformers ready to use “out of the box”
- A simple-but-powerful new abstraction of machine learning pipelines as directed acyclic graphs that allows for extensive optimization while still keeping the ease-of-implementation for each component comparable to a traditional “black box”
For experienced machine learning engineers, Dagli offers an easy path to a performant, production-ready model that is maintainable and extensible in the long-term and can leverage an existing JVM-based technology stack. For software engineers with less prior machine learning experience, Dagli provides an intuitive, fluent API that can be used with their favorite JVM language and tooling and is designed to avoid many common logic bugs. Specific advantages of Dagli include:
- Single pipeline: The entire model pipeline is defined as a directed acyclic graph (DAG) once, for both training and inference. There’s no need to implement a pipeline for training and a separate pipeline for inference.
- Bug-resistance: Easy-to-read pipeline definitions, ubiquitous static typing, near-ubiquitous immutability, and many other design features prevent the large majority of potential logic errors.
- Portability: Works on a server, in Hadoop, in a CLI, in your IDE, or any other JVM context.
- Ease of deployment: The entire pipeline is serialized and deserialized as a single object.
- Speed: Highly parallel multithreaded training and inference, graph (pipeline) optimizations, and mini-batching.
- Batteries included: Plenty of pipeline components are ready to use right out of the box, including neural networks, logistic regression, gradient boosted decision trees, FastText, cross-validation, cross-training, feature selection, data readers, evaluation, and feature transformations.
- Java ecosystem: Leverage your preferred JVM language’s existing IDE's code completion, type hints, inline documentation, debugger, etc.
How Dagli works
To get a concrete sense of how Dagli works, let’s start with a text classifier that uses the active leaves of a Gradient Boosted Decision Tree model (XGBoost) as well as a high-dimensional set of ngrams as features in a logistic regression classifier:
Pipelined model using XGBoost leaves and ngrams as logistic regression features
Of course, we can easily try other variants of this model: for example, we can substitute out our logistic regressor for a multilayer perceptron neural network:
Pipelined model using XGBoost leaves and ngrams as multilayer perceptron features
The models above are fairly minimal, and Dagli additionally provides mechanisms to more elegantly encapsulate example data (@Structs), read in data (e.g., from delimiter-separated value or Avro files), evaluate model performance, and much more. A list of more comprehensive examples may be found here.
The DAG abstraction
Dagli represents machine learning pipelines as directed acyclic graphs (DAGs). The root nodes of the DAG represent the input to the pipeline, which may be either “placeholders,” representing each example’s values as provided during training and inference, or “generators” (such as Constant, ExampleIndex, and RandomDouble), which automatically produce a value for each example.
Transformers are the child nodes of the graph, accepting one or more input values for each example and producing an output value. These include both feature transformations (such as Tokens, BucketIndex, Rank, and Index) and learned models (such as XGBoostRegression, LiblinearClassifier, and NeuralNetwork).
Transformers may be “preparable” or “prepared.” Dagli uses the word "preparation" rather than "training" because many preparable transformers are not statistical models; e.g., BucketIndex examines values across all preparation examples to find the optimal bucket boundaries with the most even distribution of these values amongst the buckets.
During the preparation/training of the DAG, the preparable transformers (like BucketIndex or XGBoostRegression) become prepared transformers (like BucketIndex.Prepared or XGBoostRegression.Prepared), which are then subsequently used to actually transform the input values, both during DAG preparation (so the results may be fed to downstream transformers) and later, during inference in the prepared DAG.
Consider a simple preparable DAG for predicting a person’s musical preference from their age and ZIP Code:
Each node in the DAG produces an output value for each example, and values for a hypothetical training example have been included along each edge in the graph.
Age, Music Preference, and ZIP Code are placeholders whose values are provided for each example; later, during inference, Music Preference (the training label) will be ignored and can simply be null or elided from the graph entirely. A Constant generator node outputs 120 for every example to serve as an argument to a prepared transformer calling Math::min, enforcing a ceiling on putative ages. The preparable BucketIndex transformer finds optimal bucket boundaries by examining its input values for all training examples, which are then used to assign each input value to a bucket (e.g., the “33 to 45” age range). Finally, LiblinearClassification trains a logistic regression model that ultimately predicts a probability for a person preferring each possible type of music, given that person’s features.
Understanding the DAG and optimization
From a seemingly simple abstraction, Dagli can actually infer quite a bit about what the DAG is doing: for instance, knowing which specific values are required by each transformer allows multiple transformers to process a single example in parallel (so long as no transformer is an ancestor of another). Node implementations can also optionally declare a few other key attributes; e.g., preparable transformers can be “streaming” (only requiring a single pass over their inputs) and idempotent (seeing the same set of input values multiple times does not change the prepared result). Transparency into the DAG must be carefully selective to avoid encumbering the freedom and brevity of node implementations, but Dagli knows enough for many key optimizations:
- Deduplicating semantically identical nodes eliminates redundant work.
- Eliding nodes not required to prepare the DAG or to infer its outputs similarly avoids unnecessary computation
- Avoiding caching intermediate results wherever possible saves memory and, in some cases, serialization and I/O costs.
- When Dagli can prove the output of a node is independent of the example data, pre-computing its output reduces DAG execution time even further (akin to constant folding in compilers).
Beyond these intrinsic optimizations, nodes may also provide their own “graph reducers,” which can rewrite the graph to simplify it further. For instance, a DAG nested as a transformer within another DAG will be “flattened” and replaced with its corresponding subgraph; a ConditionalValue node (which acts like a ternary operator ?: among its three inputs) will elide itself if the conditional input has a pre-computable constant value; and several nodes will eliminate both themselves and their parent from the graph if their parent node is an inverse operation (e.g., a Tupled2 creating a tuple followed by a Value1FromTuple which extracts a tuple field). Many of these reductions would very rarely apply in the DAG as originally written by a human author, but become very useful in creating a cascade of simplifications following other graph optimizations.
Mitigating overfitting in pipelined models
Pipelines frequently overfit on the training data when one model consumes the output of another. Consider a degenerate “Label Memorizer” model that simply memorizes the labels from the training data and uses them to trivially “predict” the label for each training example. If we then use this prediction as a feature in a logistic regressor, our pipeline would look something like this:
When we train this pipeline, the “predicted” feature from the Label Memorizer will perfectly match the target label of the Logistic Regressor, which will then (in the absence of regularization) give this feature an arbitrarily high weight and achieve perfect accuracy on the training data. However, this pipeline will (horribly) fail to generalize to a new example during inference: since the Label Memorizer will have no memorized label for the example, it will have to provide an arbitrary feature value to the Logistic Regressor, resulting in an arbitrary, uninformed prediction from the pipeline as a whole.
This is obviously an extreme case, but—to a lesser degree—we still face the same problem if we replace the Label Memorizer with something more realistic, such as a boosted decision forest or a neural network, as statistical models will almost invariably memorize their training data to some extent (i.e., their accuracy on training data will exceed that on test data).
Dagli solves this by allowing transformers to have different outputs during training and during inference, which makes strategies like cross-training possible. For instance, if we use XGBoost as our upstream model together with Dagli’s KFoldCrossTrained node, with K = 3 Dagli will train 3 different XGBoost models, each on two-thirds of the training data, which is then used to predict the labels for the other one-third (the data is partitioned such that, for each training example, there is exactly one version of the model that did not include it in its training subset). Later, at inference time (when confronted with new examples), the XGBoost model that is used is chosen arbitrarily. The benefit of this rather convoluted scheme is that the distribution of the XGBoost predictions will be consistent at both training and inference time, which allows the downstream Logistic Regressor model (and the pipeline as a whole) that is (in part) dependent on the XGBoost feature to generalize and make successful predictions on new examples seen during inference.
Example of a pipeline using K-Fold cross-training to prevent the Logistic Regressor from overfitting on the predictions made by an upstream XGBoost model
Dagli uses DAGExecutors to prepare and apply (do inference with) DAGs; different executors process DAGs in different ways (and Dagli may add additional executors for very-large-scale cross-machine learning in the future), but normally clients can just stick with the defaults.
The ideal executor will of course vary depending on the task: for preparation, the goal is to maximize throughput, which (within a single machine) is essentially maximizing parallelism while minimizing overhead (e.g., synchronization amongst threads). Dagli’s MultithreadedDAGExecutor achieves efficient parallelism in numerous ways:
- Examples are conceptually split into small, sequential, fixed-size “blocks.” Processing values in blocks (rather than individually) reduces overhead, especially for transformers that support minibatching (this is critical for efficient inference in, e.g., neural networks).
- Transformers can execute on a block of examples as soon as the corresponding values are available from their parents, rather than waiting for these input nodes to process all examples first. Bounded buffers limit the intermediate data stored in memory at any given time.
- Preparable transformers need to see all examples before “preparing to” a prepared transformer that will ultimately produce output values, but they can still do much (or all) of their work as input values stream in (on the first pass), and the resulting prepared transformer will process many blocks in parallel.
- Node implementations such as NeuralNetwork, FastText, and XGBoostClassification are themselves heavily multithreaded and can also achieve parallelized computation via SIMD or GPUs; when possible, they can share the executor’s ForkJoinPool to avoid excessive concurrent threads (and the attendant cost of contention).
For inference, Dagli’s default FastPreparedDAGExecutor adopts a different, specialized strategy, arranging examples into blocks as before, but processing each in a single executor thread, minimizing overhead (especially inter-thread synchronization). For batch inference, this results in higher throughput, but it’s even more important for online inference, where it also minimizes latency. Of course, in the rare case where the prepared DAG is sufficiently large and computationally expensive, a MultithreadedDAGExecutor might offer even lower latency at the cost of lower overall throughput, and clients are free to use the executor that best fits their needs.
What’s in the box
Dagli ships with a rather extensive collection of components; however, it’s also trivially easy to use existing methods as transformers in your DAG with FunctionResult nodes, e.g.:
And, of course, it’s straightforward to create new transformers, too.
Many common models are included: K-means Clustering, Gradient Boosted Decision Trees (XGBoost), Logistic Regression (liblinear), Isotonic Regression, FastText (an enhanced Java port), and Neural Networks.
Neural networks may be assembled seamlessly as part of the encompassing DAG definition using Dagli’s layer-oriented API, with the architecture specified as a directed acyclic graph of layer nodes. Many types of layers are provided, with more planned in the future; an even wider range of model architectures is supported by using CustomNeuralNetwork to wrap an arbitrary DeepLearning4J model. And, of course, if you have an existing, already-trained TensorFlow or PyTorch model, you can use their respective Java bindings to implement a new transformer that wraps them, too (although unfortunately, defining and training new models from Java is not yet well-supported by either framework).
Dagli provides “meta transformers” for model selection (choosing the best of a set of candidate models), cross-training (used to avoid overfitting when one model’s prediction is an input to another), and other, more specialized uses (like training independent model variants on arbitrary “groups” of examples, as might be done for per-cohort residual modeling).
Dagli offers a diverse set of transformers, including those for text (e.g., tokenization), bucketization, statistics (e.g., order statistics), lists (e.g., ngrams), feature vectorization, manipulating discrete distributions, and many others. The list of Dagli modules can provide a good starting point for finding the transformer you need.
Evaluation algorithms for several types of problems are included as transformers that can either be used independently or as part of a DAG. Input data can be provided to Dagli in any form (example values just need to be available from some arbitrary Iterable), but we do include classes for conveniently reading delimiter-separated value (DSV) and Avro files, or writing and reading back Kryo-serialized objects. @Structs provide an easy-to-use, highly bug-resistant way to represent examples. Finally, visualizers for rendering DAGs as ASCII art or Mermaid markdown are provided (clients are, of course, free to add others), which can be especially helpful when documenting and explaining your model.
Dagli also contains several “sublibraries” that are useful independently of Dagli:
- com.linkedin.dagli.tuple: Provides tuples, sequences of typed fields such as Tuple3<String, Integer, Boolean> (a triplet of a String, an Integer and a Boolean).
- com.linkedin.dagli.util.function: Functional interfaces for a wide range of arities and all primitive return types, including support for creating “safely serializable” function objects from method references.
- com.linkedin.dagli.util.*: A collection of data structures (BigHashMap, LinkedNode, LazyMap…) and many other utility classes (Iterables, ArraysEx, ValueEqualityChecker…) too extensive to adequately document here.
- com.linkedin.dagli.math: Vectors, discrete distributions, and hashing.
With Dagli, we hope to make efficient, production-ready models easier to write, revise, and deploy, avoiding the technical debt and long-term maintenance challenges that so often accompany them. If you’re interested in using Dagli in your own projects, please learn more at our Github page, or jump straight to the list of extensively-commented code examples.
Thanks to Romer Rosales and Dan Bikel for their support of this project, David Golland for contributing the Isotonic Regression model, Juan Bottaro for his tokenizer implementation, and Haowen Ning, Vita Markman, Diego Buthay, Andris Birkmanis, Mohit Wadhwa, Rajeev Kumar, Phaneendra Angara, Deirdre Hogan, and many others for their extensive feedback and suggestions.