FeatureFu: Building Featureful Machine Learning Models
September 4, 2015
LinkedIn’s FeatureFu project is a new open source toolkit designed to enable creative and agile feature engineering for most machine learning tasks such as statistical modeling (classification, clustering, and regression) and rule-based decision engines. In this blog post, we will detail the design and implementation of Expr in FeatureFu, provide examples of how feature engineering is becoming more powerful with this open source toolkit, and demonstrate how this technique nicely blurs the boundaries between modeling and feature engineering. Here, we share our practices and encourage you to do the same by sharing your valuable experience with FeatureFu.
More than often in many large-scale recommendation systems, the offline modeling and online feature serving/model-scoring are handled by different teams and/or using different code bases. This system is brittle and vulnerable to online/offline parity issues because features generated can be different due to subtle implementation discrepancies and dependencies. Additionally, a small change in feature generation (e.g. binning a continuous numeric feature into a few discrete bucketized features) requires a significant amount of work – likely all that is needed for an online code change – with a long turnaround period. This is typically a blockade in experimenting feature/model techniques. Another example is that, serving decision tree in an online recommendation system that mostly supports only logistic regression would be challenging and time consuming, while serving decision tree as transformed features and using them in the logistic regression framework is as easy as a model configuration change.
To unify the feature engineering process and remove the above problems of inconsistencies, we use Expr, a lightweight Java library, which can be used to transform and build features on top of an existing feature pool with great flexibility. Once deployed to an online feature generation framework, it eliminates any further need for code-change to ship models for a wide range of derived features. For example, in homepage feeds ranking for a professional social network, we often want to capture members preference of different feed types (e.g. news article from an influencer, recent job change from a connection) by counting the number of historical likes and number of impressions of each feed type for the member.
The raw counts usually need to be combined into a like-per-impression ratio with smoothing before it can be used as a stable feature, with a mathematical formula like: (1+likes)/(10+impressions). Normally, the formula has to be coded into an online feature serving system and any change of the formula will need a code change/deployment, which requires significant operational overhead. With Expr and FeatureFu, we will only need to write the formula as an s-expression "(/ (+ 1 likes) (+ 10 impression))", and include it in the model configuration file, any further change to the formula – like additional smoothing by taking logarithm of the counts – will just need a configuration change of the s-expression itself: "(- (log2 (+ 10 impressions)) (log2 (+ 1 likes)))", which is much more flexible and agile.
Expr: S-Expression Parser and Evaluator
S-Expression is a very powerful building block for languages like Lisp. It has a nice balance between expressing power and rigorousness (less ambiguous). It can be used to define a new set of features by creating s-expressions within a model config file as you wish. We have a s-expression parser and evaluator implemented in Java as an artifact. It's light weighted and does not have any external dependencies.
An s-expression is classically defined:
1. an atom (constant or variable), or
2. an expression of the form (operator x y) where x and y are s-expressions
With Expr, a s-expression string can be parsed only once into memory as Java object, any further evaluation of the same s-expression will only need parameter substitution. More implementation details and sample client code can be found at our open source project FeatureFu
Use Cases in Machine Learning Modeling
There are many possible use cases of this simple tool in machine learning, list below are selected sample use cases in feature normalization, feature transformation, feature binding, model featurization, model calibration, and model cascading (e.g. two pass modeling). The full potential of this simple technique is only limited by your imagination.
- Feature normalization
"(min 1 (max (+ (* slope x) intercept) 0))" : scale feature x with slope and intercept, and normalize to [0,1], here min and max are operators
- Feature binding
"(‐ (log2 (+ 5 impressions)) (log2 (+ 1 clicks)))" : combine #impression and #clicks into a smoothed CTR style feature
- Nonlinear featurization
"(if (> query_doc_matches 0) 0 1)" : negation of a query/document matching feature, also represent a small decision tree
- Cascading modeling
"(sigmoid (+ (* x1 w1) w0))" : convert a simple logistic regression model into a feature, also Platt scaling for model score calibration
- Model combination (e.g. combine decision tree and linear regression)
"(+ (* model1_score w1) (* model2_score w2))" : combine two model scores into one final score, of course the model1_score and model2_score here can be substituted by two s-expressions (such as the two above) as you may guess
- S-Expression validation and visualization
Last but not least, Expr can also be used in command line as Java console application for s-expression validation and visualization, for example:
$java -cp expr-1.0.jar com.linkedin.featurefu.expr.Expression \
"(+ 0.5 (* (/ 15 1000) (ln (- 55 12))))" =(0.5+((15.0/1000.0)*ln((55.0-12.0)))) =0.5564180017354035 tree └── + ├── 0.5 └── * ├── / | ├── 15.0 | └── 1000.0 └── ln └── - ├── 55.0 └── 12.0
In future versions, we will introduce more feature generation and analysis tools.We also welcome contributions of all kinds including pull requests, code contribution, bug reports, documentation enhancements and new ideas or feedback!
Thanks to LinkedIn Online Relevance Team for productionizing it, LinkedIn Jobs Relevance Team for being the first customer, and Leo Tang for all the feedbacks and suggestions.