The core concept of the cascading API is piping and streaming. A pipeline is a series of processing steps (parsing, looping, filtering, and so on) that define the data processing to be performed, and the flow is the union of pipelines with data sources and data receivers (Data-sink). Cascading is a new data processing API for Hadoop clusters that uses expressive APIs to build complex processing workflows rather than directly implement Hadoop mapreduce algorithms.
The processing API allows developers to quickly assemble complex distributed processes without having to "consider" MapReduce. It can also be efficiently scheduled based on dependencies between processes and other meta data information. The core concept of the cascading API is piping and streaming. A pipeline is a series of processing steps (parsing, looping, filtering, and so on) that define the data processing to be performed, and the flow is the union of pipelines with data sources and data receivers (Data-sink). In other words, the stream is the channel through which data is passed. Further, Cascade is the link, branch, and grouping of multiple streams. The API provides a number of key features:
Based on dependency topology scheduling (Toplogical Scheduler) and MapReduce planning-this is the two key components of the cascading API that can be scheduled based on calls that rely on convection, because their execution order is independent of the construction order, This allows concurrent invocation of partial streams and cascades. In addition, the steps of various streams are intelligently converted into map-reduce calls that correspond to Hadoop cluster. Event notification-The various steps of a stream can be notified by a callback to inform the host of the process of reporting and responding to data processing. The Scripting--cascading API has scripted interfaces for Jython, groovy, and JRuby-which makes it suitable for common dynamic JVM languages