DataFu 1.0

September 4, 2013

DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.

About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and statistics tasks. Over the years, we had developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came PigUnit, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have the initial release of DataFu.

Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like "People You May Known" and "Skills and Endorsements." The library is used at numerous companies, and it has also been included in Cloudera's Hadoop distribution (CDH) as well as the Apache BigTop project. DataFu has matured, and we are proud to announce the 1.0 release.

Download DataFu 1.0

This release of DataFu has a number of new features that can make writing Pig easier, cleaner, and more efficient. In this post, we are going to highlight some of these new features by walking through a large number of examples. Think of this as a HowTo Pig + DataFu guide.

Counting events

Let's consider a hypothetical recommendation system. In this system, a user will be recommended an item (an impression). The user can then accept that recommendation, explicitly reject that recommendation, or just simply ignore it. A common task in such a system would be to count how many times users have seen and acted on items. How could we construct a pig script to implement this task?


Before we start, it's best to define what exactly we want to do, our inputs and our outputs. The task is to generate, for each user, a list of all items that user has seen with a count of how many impressions were seen, how many were accepted, and how many were rejected.

In summary, our desired output schema is:

features: {user_id:int, items:{(item_id:int, impression_count:int, accept_count:int, reject_count:int)}}

For input, we can load a record for each event:

A naive approach

The straight-forward approach to this problem generates each of the counts that we want, joins all of these counts together, and then groups them up by the user to produce the desired output:

Unfortunately, this approach is not very efficient. It generates six mapreduce jobs during execution and streams a lot of the same data through these jobs.

A better approach

Recognizing that we can combine the outer joins and group operations into a single cogroup allows us to reduce the number of mapreduce jobs.

However, we still have to perform an extra group operation to bring everything together by user_id for a total of two mapreduce jobs.

The best approach: DataFu

The two grouping operations in the last example operate on the same set of data. It would be great if we could just get rid of one of them somehow.

One thing that we have noticed is that even very big data will frequently get reasonably small once you segment it sufficiently. In this case, we have to segment down to the user level for our output. That's small enough to fit in memory. So, with a little bit of DataFu, we can group up all of the data for that user, and process it in one pass:

So, let's step through this example and see how it works and what our data looks like along the way.

Group the features

First we group all of the data together by the user, getting a few bags with all of the respective event data in the bag.

features_grouped = COGROUP impressions BY user_id, accepts BY user_id, rejects BY user_id;

--features_grouped: {group: int,impressions: {(user_id: int,item_id: int,timestamp: long)},accepts: {(user_id: int,item_id: int,timestamp: long)},rejects: {(user_id: int,item_id: int,timestamp: long)}}


Next we count the occurences of each item in the impression, accept and reject bag.

DEFINE CountEach datafu.pig.bags.CountEach('flatten');

features_counted = FOREACH features_grouped GENERATE 
    group as user_id,
    CountEach(impressions.item_id) as impressions,
    CountEach(accepts.item_id) as accepts,
    CountEach(rejects.item_id) as rejects;

--features_counted: {user_id: int,impressions: {(item_id: int,count: int)},accepts: {(item_id: int,count: int)},rejects: {(item_id: int,count: int)}}

CountEach is a new UDF in DataFu that iterates through a bag counting the number of occurrences of each distinct tuple. In this case, we want to count occurrences of items, so we project the inner tuples of the bag to contain just the item_id. Since we specified the optional 'flatten' argument in the constructor, the output of the UDF will be a bag of each distinct input tuple (item_id) with a count field appended.


Now, we want to combine all of the separate counts for each type of event together into one tuple per item.

DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();

features_joined = FOREACH features_counted GENERATE
        impressions, 'item_id',
        accepts, 'item_id',
        rejects, 'item_id'
    ) as items;

--features_joined: {user_id: int,items: {(impressions::item_id: int,impressions::count: int,accepts::item_id: int,accepts::count: int,rejects::item_id: int,rejects::count: int)}}

This is a join operation, but unfortunately, the only join operation that pig allows on bags (in a nested foreach) is CROSS. DataFu provides the BagLeftOuterJoin UDF to make up for this limitation. This UDF performs an in-memory hash join of each bag using the specified field as the join key. The output of this UDF mimics what you would expect from this bit of not (yet) valid Pig:

features_joined = FOREACH features_counted {
  items = JOIN impressions BY item_id LEFT OUTER, accepts BY item_id, rejects BY item_id;
    user_id, items;

Because BagLeftOuterJoin is a UDF and works in memory, a separate map-reduce job is not launched. This fact will save us some time as we'll see later on in the analysis.


Finally, we have our data in about the right shape. We just need to clean up the schema and put some default values in place.

DEFINE Coalesce datafu.pig.util.Coalesce();

features = FOREACH features_joined {
    projected = FOREACH items GENERATE
        impressions::item_id as item_id,
        impressions::count as impression_count,
        Coalesce(accepts::count, 0) as accept_count,
        Coalesce(rejects::count, 0) as reject_count;
  GENERATE user_id, projected as items;

--features: {user_id: int,items: {(item_id: int,impression_count: int,accept_count: int,reject_count: int)}}

The various counts were joined together using an outer join in the previous step because a user has not necessarily performed an accept or reject action on each item that he or she has seen. If they have not acted, those fields will be null. Coalesce returns its first non-null parameter, which allows us to cleanly replace that null with a zero, avoiding the need for a bincond operator and maintaining the correct schema. Done!


Ok great, we now have three ways to write the same script. We know that the naive way will trigger six mapreduce jobs, the better way two, and the DataFu way one, but does that really equate to a difference in performance?

Since we happened to have a dataset with a few billion records in it lying around, we decided to compare the three. We looked at two different metrics for evaluation. One is the best case wall clock time. This metric is basically the sum of the slowest map and reduce task for each job (using pig default parallelism estimates). The other is total compute time which is the sum of all map and reduce task durations.

VersionWall clock time %Total compute time %naive100.0%100.0%better30.2%62.6%datafu10.5%43.0%

As we can see, the DataFu version provides a noticable improvement in both metrics. Glad to know that work wasn't all for naught.

Creating a custom purpose UDF

Many UDFs, such as those presented in the previous section, are general purpose. DataFu serves to collect these UDFs and make sure they are tested and easily available. If you are writing such a UDF, then we will happily accept contributions. However, frequently when you sit down to write a UDF, it is because you need to insert some sort of custom business logic or calculation into your pig script. These types of UDFs can easily become complex, involving a large number of parameters or nested structures.

Positional notation is bad

Even once the code is written, you are not done. You have to maintain it.

One of the difficult parts about this maintenance is that, as the pig script that uses the UDF changes, a developer has to be sure not to change the parameters to the UDF. Worse, because a standard UDF references fields by positions, it's very easy to introduce a subtle change that has an unintended side effect that does not trigger any errors during runtime, for example, when two fields of the same type swap positions.

Aliases can be better

Using aliases instead of positions makes it easier to maintain a consistent mapping between the UDF and the pig script. If an alias is removed, the UDF will fail with an error. If an alias changes position in a tuple, the UDF does not need to care. The alias also has some semantic meaning to the developer which can aid in the maintenance proces.


Unfortunately, there is a problem using aliases. As of Pig 11.1 they are not available when the UDF is exec'ing on the back-end; they are only available on the front-end. The solution to this is to capture a mapping of alias to position on the front-end, store that mapping into the UDF context, retreive it on the back-end, and use it to look up each position by alias. You also need to handle a few issues with complex schemas (nested tuples and bags), keeping track of UDF instances, etc. To make this process simpler, DataFu provides AliasableEvalFunc, an extension to the standard EvalFunc with all of this behavior included.

Mortgage payment example

Using AliasableEvalFunc is pretty simple; the primary difference is that you need to override getOutputSchema instead of outputSchema and have access to the alias, position map through a number of convenience methods. Consider the following example:

In this script we retrieve by alias from the input tuple a couple of different types of fields. One of these fields is a bag, and we also want to get values from the tuples in that bag. To avoid having namespace collisions among the different levels of nested tuples, AliasableEvalFunc prepends the name of the enclosing bag or tuple. Thus, we use getPrefixedAliasName to find the field interest_rate inside the bag named interest_rates. That's all there is to using aliases in a UDF. As an added benefit, being able to dump schema information on errors helps in developing and debugging the UDF (see datafu.pig.util.DataFuException).

LinearRegression example

Having access to the schema opens up UDF development possibilities. Let's look back at the recommendation system example from the first part. The script in that part generated a bunch of features about the items that users saw and clicked. That's a good start to a recommendation workflow, but the end goal is to select which items to recommend. A common way to do this is to assign a score to each item based on some sort of machine learning algorithm. A simple algorithm for this task is linear regression. Ok, let's say we've trained our first linear regression model and are ready to plug it in to our workflow to produce our scores.

We could develop a custom UDF for this model that computes the score. It is just a weighted sum of the features. So, using AliasableEvalFunc we could retrieve each field that we need, multiply by the correct coefficient, and then sum these together. But, then every time we change the model, we are going to have to change the UDF to update the fields and coefficients. We know that our first model is not going to be very good and want to make it easy to plug in new models.

The model for a linear regression is pretty simple; it's just a mapping of fields to coefficient values. The only things that will change between models are which fields we are interested in and what the coefficient for those fields will be. So, let's just pass in a string representation of the model and then let the UDF do the work of figuring out how to apply it.

Nice, that's clean, and we could even pass that model string in as a parameter so we don't have to change the pig script to change the model either -- very reusable.

Now, the hard work, writing the UDF:

Ok, maybe not that hard... The UDF parses out the mapping of field to coeffcient in the constructor and then looks up the specified fields by name in the exec function. So, what happens when we change the model? If we decide to drop a field from our model, it just gets ignored, even if it is in the input tuple. If we add a new feature that's already available in the data it will just work. If we try and use a model with a new feature and forget to update the pig script, it will throw an error and tell us the feature that does not exist (as part of the behavior of getDouble()).

Combining this example with the feature counting example presented earlier, we have the basis for a recommendation system that was easy to write, will execute quickly, and will be simple to maintain.

Sampling the data

Working with big data can be a bit overwhelming and time consuming. Sometimes you want to avoid some of this hassle and just look at a portion of this data. Pig has built-in support for random sampling with the Sample operator. But sometimes a random percentage of the records is not quite what you need. Fortunately, DataFu has a few sampling UDFs that will help in some situations, and as always, we would be happy to accept any contributions of additional sampling UDFs, if you happen to have some lying around.

These things always are easier to understand with a bit of code, so let's go back to our recommendation system context and look at a few more examples.

Example 1. Generate training data

We had mentioned previously that we were going to use a machine learning algorithm, linear regression, to generate scores for our items. We waived our hands and it happened previously, but generally this task involves some work. One of the first steps is to generate the training data set for the learning algorithm. In order to make this training efficient, we only want to use a sample of all of our raw data.


Given impression, accepts, rejects and some pre-computed features about a user and items, we'd like to generate a training set, which will have all of this information for each user_id, item_id pair, for some sample of users.

So, from this input:

We want to produce this type of output:

{user_id, item_id, is_impressed, is_accepted, is_rejected, feature_1, feature_2}

One key point on sampling here: We want the sampling to be done by user_id. This means that if we choose one user_id to be included in the sample, all the data for that user_id should be included in the sample. This requirement is needed to preserve the original characteristics of raw data in the sampled data as well.

Naive approach

The staright-foward solution for this task will be group the tracking data for each user, item pair, then group it by user_id, sample this grouped data, and then flatten it all out again again:

This job includes two group operations, which translates to two map-reduce jobs. Also, the group operation is being done on the full data even though we will sample it down later. Can we do any better than this?

A sample of DataFu -- SampleByKey


We can use the SampleByKey FilterFunc to do this with only one group operation. And, since the group is operating on the already sampled (significantly smaller) data this job will be far more efficient.

SampleByKey lets you designate which fields you want to use as keys for the sampling, and guarantees that for each selected key, all other records with that key will also be selected, which is exactly what we want. Another charasteritic of SampleByKey is that it is deterministic, as long as the same salt is given on initialization. Thanks to this charastristic, we were able to sample the data seperately before we join them from the above example.

Example 2. Recommending your output

Ok, we've now created some training data that we used to create a model which will produce a score for each recommendation. So now we've got to pick which items to show the user. But, we've got a bit of a problem, we only have limited real-estate on the screen to present our recommendations, so how do we select which ones to show? We've got a score from our model so we could just always pick the top scoring items. But then we might be showing the same recommendations all the time, and we want to shake things up a bit so things aren't so static (OK, yes, I admit this is a contrived example; you wouldn't do it this way in real life). So let's take a sample of the output.


With this input:

We want to produce the exact same output, but with fewer items per user -- let's say no more than 10.

Naive approach

We can randomize using Pig's default Sample command.

The problem of this approach is that results are sampled from the population in a uniformly random fashion. The score you created with your learning algorithm does not have any effect on generating final results.

The DataFu you most likely need -- WeightedSample

We should use that score we generated to help bias our sample.

Fortunately, WeightedSample can do exactly that. It will randomly select from the candidates, but the scores of each candidate will be used as the probability of whether the candidate will be seleceted or not. So, the tuples with higher weight will have a higher chance to be included in sample - perfect.

Additional Examples

If you've made it this far into the post, you deserve an encore. So here are two more examples of how DataFu can make writing pig a bit simpler for you:

Filtering with In

One case where conditional logic can be painful is filtering based on a set of values. Suppose you want to filter tuples based on a field equalling one of a list of values. In Pig this can be achieved by joining a list of conditional checks with OR:

However as the number of items to check for grows this becomes very verbose. The In filter function solves this and makes the resulting code very concise:

Left Outer Join of three or more relations with EmptyBagToNullFields

Pig's JOIN operator supports performing left outer joins on two relations only. If you want to perform a join on more than two relations you have two options. One is to perform a sequence of joins.

However this can be inefficient as it requires multiple MapReduce jobs. For many situations, a better option is to use a single COGROUP which requires only a single MapReduce job. However the code gets pretty ugly.

This code uses the insight that the input1 bag will be empty when there is no match, and flattening this will remove the entire record. If the input2 or input3 bags are empty we don't want flattening them to remove the record though, so we replace them with a bag having a single tuple with null elements. When these are flattened we get a single tuple with null elements. But, we want our output to have the correct schema, so we have to specify it manually. Once, we do all of this, the approach successfully replicates the left join behavior. It's more efficient, and it's really ugly to type and read.

To clean up this code we have created EmptyBagToNullFields, which replicates the same logic as in the example above, but in a much more concise and readable fashion.

Notice that we do not need to specify the schema as with the previous COGROUP example. The reason is that EmptyBagToNullFields produces the same schema as the input bag. So much cleaner.

Final Example

Ok, a second encore, but no more. If you are doing a lot of these, you can turn this into a macro:

Then all you need to do is call your macro

features = left_outer_join(input1, val1, input2, val2, input3, val3);


So, that's a lot to digest, but it's just a highlight into a few interesting pieces of DataFu. Check out the DataFu 1.0 release as there's even more in store.

We hope that it proves valuable to you and as always welcome any contributions. Please let us know how you're using the library — we would love to hear from you.

Additional Credits

Thanks to Matt Hayes, Evion Kim, and Sam Shah for contributions to this post.

Update: DataFu is now an Apache Incubator project. All links in this blog post have been updated to point to the new Apache DataFu homepage. This blog post has also been cross-posted on the Apache DataFu Blog.