What is the
big data processing framework?
The processing framework and processing engine are responsible for calculating the data in the data system. Although there is no authoritative definition of the difference between "engine" and "framework", most of the time the former can be defined as a component that is actually responsible for
processing data operations, and the latter can be defined as a series of components that undertake similar functions.
For example, Apache Hadoop can be regarded as a processing framework with MapReduce as the default processing engine. The engine and the framework can usually be used interchangeably or simultaneously. For example, another framework, Apache Spark, can be incorporated into Hadoop and replace MapReduce. This interoperability between components is one of the reasons why big data systems are so flexible.
Although the systems responsible for processing data at this stage of the life cycle are usually complex, their goals are very consistent from a broad perspective: to improve understanding by performing operations on the data, reveal the patterns contained in the data, and target the complexity Interact to gain insights.
In order to simplify the discussion of these components, we will classify them according to the state of the processed data through the design intent of different processing frameworks. Some systems can process data in batch mode, and some systems can process data continuously flowing into the system in stream mode. In addition, there are some systems that can process both types of data at the same time.
Before introducing the indicators and conclusions of different implementations in depth, we first need to give a brief introduction to the concepts of different processing types.
Batch system
Batch processing has a long history in the big data world. Batch processing mainly operates on large-capacity static data sets, and returns the results after the calculation process is completed.
The data set used in batch mode usually meets the following characteristics...
· Bounded: batch data set represents a limited set of data
· Persistent: Data is usually always stored in some type of persistent storage location
· Large amounts: batch operations are usually the only way to process extremely large data sets
Batch processing is ideal for calculations that require access to a full set of records to complete. For example, when calculating totals and averages, the data set must be treated as a whole, rather than as a collection of multiple records. These operations require the data to maintain its own state during the calculation.
Tasks that need to process large amounts of data are usually most suitable for processing with batch operations. Whether directly processing the data set from the persistent storage device or loading the data set into the memory first, the batch processing system fully considers the amount of data in the design process, and can provide sufficient processing resources. Because batch processing is extremely good at dealing with large amounts of persistent data, it is often used to analyze historical data.
The processing of large amounts of data requires a lot of time, so batch processing is not suitable for occasions that require high processing time.
Apache Hadoop
Apache Hadoop is a processing framework dedicated to batch processing. Hadoop is the first big data framework that has gained great attention in the open source community. Based on Google's many papers and experience on massive data processing, Hadoop re-implements related algorithms and component stacks, making large-scale batch processing technology easier to use.
The new version of Hadoop contains multiple components, namely multiple layers, which can be used in conjunction to process batch data:
· HDFS: HDFS is a distributed file system layer that can coordinate storage and replication between cluster nodes. HDFS ensures that data is still available after an unavoidable node failure occurs. It can be used as a data source, can be used to store intermediate state processing results, and can store the final results of calculations.
· YARN: YARN is the abbreviation of Yet Another Resource Negotiator (another resource manager), which can act as a cluster coordination component of the Hadoop stack. This component is responsible for coordinating and managing the operation of underlying resources and scheduling jobs. By acting as an interface to cluster resources, YARN enables users to run more types of workloads in a Hadoop cluster than in previous iterations.
· MapReduce: MapReduce is the native batch engine of Hadoop.
Batch mode
The processing function of Hadoop comes from the MapReduce engine. MapReduce processing technology meets the requirements of map, shuffle, and reduce algorithms that use key-value pairs. The basic process includes:
· Read data sets from HDFS file system
· Split the data set into small pieces and distribute them to all available nodes
· Calculate the data subset on each node (the intermediate state result of the calculation will be rewritten into HDFS)
· Reallocate intermediate results and group them by key
· "Reducing" the value of each key by summarizing and combining the results calculated by each node
· Rewrite the calculated final result into HDFS
Advantages and limitations
Because this method relies heavily on persistent storage, each task requires multiple read and write operations, so the speed is relatively slow. But on the other hand, since disk space is usually the most abundant resource on the server, this means that MapReduce can handle very large data sets. It also means that compared to other similar technologies, Hadoop's MapReduce can usually run on cheap hardware, because the technology does not need to store everything in memory. MapReduce has extremely high scaling potential, and applications containing tens of thousands of nodes have appeared in the production environment.
MapReduce has a steep learning curve. Although other peripheral technologies in the Hadoop ecosystem can greatly reduce the impact of this problem, it is still necessary to pay attention to this problem when quickly implementing certain applications through a Hadoop cluster.
A vast ecosystem has been formed around Hadoop, and the Hadoop cluster itself is often used as a component of other software. Many other processing frameworks and engines can also use HDFS and YARN resource managers through integration with Hadoop.
to sum up
Apache Hadoop and its MapReduce processing engine provide a set of tried-and-tested batch processing models, which are most suitable for processing very large-scale data sets that do not require time. A full-featured Hadoop cluster can be built with very low-cost components, making this cheap and efficient processing technology flexibly applicable in many cases. The compatibility and integration capabilities with other frameworks and engines make Hadoop the underlying foundation for multiple workload processing platforms using different technologies.
Stream Processing System
The stream processing system calculates the data that enters the system at any time. Compared with batch processing mode, this is a completely different processing method. The stream processing method does not need to perform operations on the entire data set, but on each data item transmitted through the system.
· The data set in stream processing is "boundless", which has several important effects:
· The complete data set can only represent the total amount of data that has entered the system so far.
· Working data sets may be more relevant and can only represent a single data item at a given time.
Processing is event-based, and there is no "end" unless it is explicitly stopped. The processing results are immediately available and will continue to be updated as new data arrives.
The stream processing system can process almost unlimited amounts of data, but it can only process one (true stream processing) or a small amount (micro-batch processing) data at a time, and only a minimum amount of state is maintained between different records. Although most systems provide methods for maintaining certain states, stream processing is mainly optimized for more functional processing with fewer side effects.
Functional operations mainly focus on discrete steps with limited states or side effects. Performing the same operation on the same data will produce the same result or omit other factors. This type of processing is very suitable for stream processing, because the status of different items is usually a combination of certain difficulties, limitations, and results that are not required in some cases. body. So although certain types of state management are usually feasible, these frameworks are generally simpler and more efficient when they do not have a state management mechanism.
This type of processing is very suitable for certain types of workloads. Tasks with near real-time processing requirements are very suitable for using stream processing mode. Analysis, server or application error logs, and other time-based metrics are the most suitable types, because responding to data changes in these areas is extremely critical for business functions. Stream processing is very suitable for processing data that must respond to changes or peaks and pay attention to the trend of changes over a period of time.
Apache Storm
Apache Storm is a stream processing framework that focuses on extremely low latency, and may be the best choice for workloads that require near real-time processing. This technology can handle very large amounts of data and provide results with lower latency than other solutions.
Stream processing mode
Storm's stream processing can arrange DAG (Directed Acyclic Graph) named Topology in the framework. These topologies describe the different transformations or steps that need to be performed on each incoming fragment when the data fragment enters the system.
The topology includes:
· Stream: ordinary data stream, which is a kind of unbounded data that will continue to reach the system.
· Spout: The data stream source located at the edge of the topology, such as API or query, etc., from which data to be processed can be generated.
· Bolt: Bolt represents the processing steps that need to consume stream data, apply operations to it, and output the result in the form of a stream. Bolt needs to establish a connection with each Spout, and then connect to each other to form all the necessary processing. At the end of the topology, the final Bolt output can be used as the input of other interconnected systems.
The idea behind Storm is to use the above components to define a large number of small discrete operations, and then multiple components to form the desired topology. By default, Storm provides a "at least once" processing guarantee, which means that each message can be processed at least once, but in some cases it may be processed multiple times if it encounters a failure. Storm cannot guarantee that messages can be processed in a specific order.
In order to achieve strict one-time processing, that is, stateful processing, an abstraction called Trident can be used. Strictly speaking, Storm that does not use Trident is usually called Core Storm. Trident will have a great impact on Storm's processing capabilities, increase latency, provide state for processing, and use micro-batch mode instead of the pure stream processing mode of itemized processing.
To avoid these problems, it is generally recommended that Storm users use Core Storm whenever possible. However, it should also be noted that Trident’s strict one-time processing guarantee for content is also useful in certain situations, such as when the system cannot intelligently handle repeated messages. If you need to maintain state between items, for example, if you want to count how many users have clicked on a link within an hour, Trident will be your only choice. Although it cannot take full advantage of the inherent advantages of the framework, Trident improves Storm’s flexibility.
Trident topology includes:
· Stream batch: This refers to micro-batch of streaming data, which can provide batch processing semantics through chunking.
· Operation (Operation): Refers to the batch process that can be performed on data.