Recently I saw an article about writing a
big data framework. It is very well written, so let's make some conclusions based on my own experience. The selection of the big data framework is indeed a bit confused for those who are new to distributed computing. I hope this article can be helpful to everyone.
Introduction:
Big data is a general term for non-traditional strategies that are required to collect, organize, and process large-scale data sets and obtain insights from them. Common scenarios: recommendation systems, which make corresponding recommendations based on user behavior. Information, products, etc.
classification:
Only batch processing framework Apache Hadoop
Only stream processing framework Apache Storm, Apache Samza
Hybrid framework Apache Spark, Apache Flink
Hadoop is dedicated to batch processing systems. The new version of Hadoop contains multiple components that can process batch data (HDFS, YARN, MapReduce) through the use of them. In addition, disk space is usually the most abundant resource on the server, so it can process very large amounts of data. This method relies heavily on persistent storage and requires multiple read and write operations, so the speed is relatively slow
1. Read data from HDFS file system
2. Split the data into small pieces and distribute them to all available nodes
3. Calculate the data subset on each node (the intermediate state result will be rewritten into HDFS)
4. Redistribute the intermediate results and group them by key
5. "Reducing" each key value by summarizing and combining the calculation results of each node
6. Write the calculated final result to HDFS
Storm is dedicated to stream processing systems. A stream processing framework that focuses on extremely low latency. It is the best choice for workloads that are close to real-time processing. The Chinese name Topology (topology) DAG (directed acyclic graph) is arranged. Storm and Trident work together so that users can use micro-batch instead of pure stream processing. Moreover, Storm also supports multiple languages, providing users with more choices in topology. Core Storm cannot guarantee the order of message processing. Core Storm provides a "at least once" processing guarantee for messages, which means that every message can be processed, but it may also be repeated
Samza is dedicated to stream processing systems Samza is a stream processing framework that is tightly bound to the kafka message system. Samza's dependence on the kafka query system seems to be a limitation, but this can also provide the system with some unique guarantees and functions. These contents are not available in other stream processing systems. Currently Samza only supports jvm language
Spark hybrid processing system (batch and stream processing)
The main reason for using spark instead of hadop MapReduce is speed. With the help of in-memory computing strategies and advanced DAG scheduling mechanisms, Spark can process the same data set faster. Another feature is diversity. In addition to its own engine, an ecosystem of various libraries has been established around spark, which can provide better support for tasks such as machine learning and interactive query.
The batch processing method used for the stream processing system needs to buffer the data entering the system, which can improve the overall throughput while waiting for the buffer to be emptied, which will also lead to increased latency. This means that it is not suitable for high-latency workloads.
Flink hybrid processing system (batch processing and stream processing) Flink stream processing is the first method that provides low latency, high throughput, and almost itemized processing capabilities. The biggest limitation is that it is still in a "young" project, in a real environment Large-scale deployment of this project is not as common as other frameworks
Name Summary
Hadoop Hadoop and MapReduce provide a set of tried-and-tested batch processing models, which are most suitable for large-scale data sets with low processing time requirements. A fully functional Hadoop cluster can be built through very low-cost components, making this cheap and efficient processing technology applicable to many scenarios. The compatibility and integration capabilities with other frameworks and engines make Hadoop the underlying foundation for multiple workload processing platforms using different technologies.
Storm may be the most suitable technology for pure stream processing workloads with high latency requirements. This technology can be used with multiple languages. Because Storm cannot perform batch processing, other software may be required if these capabilities are required.
Samza For environments where Hadoop and Kafka are already available or easy to implement, Samza is a good choice for stream processing workloads. Samza itself is suitable for many teams to use (but not necessarily closely coordinate with each other) multiple data flow organizations in different processing stages. Samza can greatly simplify many stream processing tasks and achieve low-latency performance. If the deployment requirements are not compatible with the current system, it may not be suitable for use. But if very low latency processing is required, or there is a high demand for strict one-time processing semantics, it is already suitable for consideration at this time.
Spark Spark is the best choice for diversified workload processing tasks. Spark batch processing provides unparalleled advantages at a higher memory footprint. For workloads that value throughput rather than latency, spark is more suitable as a stream processing solution
Flink Flink solves low-latency stream processing, while supporting traditional batch processing tasks
Glossary:
Batch processing: Mainly operate large-capacity static data, and return the data after the calculation is completed
HDFS: A distributed file system layer that coordinates storage and replication seen by nodes in the cluster. HDFS ensures that data is still available after an unavoidable node failure occurs. It can be used as a data source, can be used to store intermediate results, and can store final calculation results.
YARN: can serve as a cluster coordination component of the Hadoop stack. This component is responsible for coordinating and managing the operation of underlying resources and scheduling jobs.
MapReduce: is the native batch processing engine of Hadoop.
Stream processing:
The data that enters the system at any time will be calculated. Compared to batch mode, this is a completely different mode. Stream processing mode does not need to perform operations on the entire data set, but on each data item transmitted through the system. Processing is based on "events" unless explicitly stopped, there is no end, processing results are immediately available, and will continue to be updated as data arrives. Stream processing can process almost unlimited amounts of data, but only one (real stream processing) or a small amount of data (microprocessing) can be processed at the same time, and only a minimum amount of state can be maintained at different times. Although most systems provide methods for maintaining certain states, stream processing is mainly optimized for less side effects and more functional processing. This type of processing is very suitable for certain types of workloads. Tasks with near real-time processing are very suitable for this method. Analysis, server or application error logs, and other time-based metrics are very suitable. It is also very suitable for processing data that must respond to changes or peaks and change trends over a period of time.
Hybrid processing system: batch and stream processing
Some processing frameworks can handle batch and stream processing workloads at the same time. These frameworks can process both types of data with the same or related components and APIs. In this way, different processing requirements can be simplified.