The project was created in January 2012, and was early positioned as a user-defined set of functions (UDF) for the Pig project. Compared to more general UDF sets such as Piggybank,datafu, it focuses more on data mining and statistical class functions, such as the calculation of the number of digits and sampling methods. A new library named Datafu Hourglass was added to this project in October 2013. Hourglass is a class library for MapReduce, which provides the ability to work with incremental data. It is typically handled by saving the state of the last job in HDFs and using it to process the new input. These two projects are now part of the incubator.
The Datafu in Apache is a big step forward in the process. Any project must undergo a rigorous review to complete the voting process before entering the incubator. DATAFU,2014, created in early 2012, successfully entered the incubator at the beginning of the year. Typically, an Apache project is hatched for a certain amount of time, and once the project's related services (wikis, mailing lists, tutorials, etc.) are completed, Datafu will end up hatching and become ASF's top-level project or Hadoop subproject.
With the recent entry into the Apache incubator, Datafu has many recent development plans. One of the most critical features is to provide the same UDF for Hive and crunch for a wider range of applications. This includes porting project build systems to Gradle, which are currently being done by Datafu communities. The benefit of the build system changing from ant to Gradle is the ability to consolidate the community to add new functionality to simpler processes.
The Datafu community is still relatively small, but has maintained steady growth. The recent contribution of Russell Jurney has made the Open NLP Project part of the Datafu 1.3.0. The focus of the mailing list is to add more UDF, as described by project contributors Matthew Hayes and Sam Shah, to make Datafu a "WD-40 of large data".