[Samza series] Real-time computing samza Chinese tutorial (III)-Architecture

Source: Internet
Author: User
This article is followed by a conceptual article. From a macro perspective, let's take a look at the architecture of samza's real-time computing service? Samza consists of the following three layers: 1. A streaming Layer 2. An execution Layer 3. A progressing Layer)
What technologies does samza rely on to combine the above three layers? As shown in: 1. Data Stream: distributed message middleware Kafka 2. Execution: hadoop resource scheduling management system yarn 3. Processing: samza API
People playing on hadoop with big data can look at the following Architecture mode (HDFS is responsible for storage, yarn is responsible for the execution layer, and mapreduce is responsible for the processing layer), as shown in: before learning about each part of the three layers, you should note that samza support is not limited to the use of Kafka and yarn, determine the technical framework and tools used for support based on your business scenario. In particular, samza's execution layer and data flow layer are pluggable and allow developers to implement better alternatives themselves. Let's take a deeper look at Kafka, a solution for the data stream layer. (You can skip this step if you are familiar with it ). Kafka is a distributed publish/subscribe message queue system that supports at least one communication guarantee (that is, the system ensures that no information is lost, but in some cases, consumers may receive more than the same message) and highly available partition features (even if a machine goes down, the partition is still available ). For Kafka, each data stream is called a topic ). Each topic is partitioned and replicated on multiple machines called brokers. When a producer sends a message to a topic, it provides a key, which is used to determine the Shard to which the message should be sent. The producer sends messages, while the Kafka broker receives and stores them. Kafka consumers can subscribe to messages on all partitions of a topic to read messages. It is worth adding that Kafka has some interesting features: * All messages with the same key are divided into the same partition, this means that if you want to read all the messages of a specific user, you only need to read the messages from the partition containing the user ID, rather than the entire topic (assuming that the user ID is used as the key) * The partition of a topic is a sequence of messages arriving in order, so you can reference any message by monotonically increasing offset (just like putting an index into an array ); this also means that the broker does not need to track messages read by a specific consumer. Why? Because the consumer saves the offset of the message, it can be tracked. What we know is that a message with a smaller offset than the current one has been processed, and each message with a larger offset has not been processed. There are still many knowledge points about Kafka, if you are interested, please refer to my articles on Kafka to learn more about it (http://blog.csdn.net/yangchao228/article/details/40583765 ). Let's take a look at the Resource Management System yarn (yet another resource negotiator) introduced by the next-generation hadoop cluster scheduler. It allows you to configure a container in a cluster and execute any command. When an application interacts with yarn, it looks like this: 1. Application: Hi yarn boy! I want to run command X on two machines in MB; 2. Yarn: Cool. Where is your code? 3. Application: code here: http://path.to.host/jobs/download/my.tgz 4. Yarn: I'm executing your job on the node1 and node2 grids. Samza uses yarn to manage deployment, fault tolerance, logs, resource isolation, security, and localization. Here is a brief introduction (see http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview ). The yarn architecture is certainly designed to save everyone's time. Let's take a look at the yarn architecture at a macro level. Yarn consists of three layers: A Resource Manager (ResourceManager), a Node Manager (nodemanager), and an application manager (applicationmaster ). In a yarn mesh, each machine runs a nodemanager, which is responsible for starting processes on the machine where it is located. A ResourceManager interacts with all nodemananger to tell them what applications to run. In turn, nodemanager will also tell ResourceManager when they want to run these tasks in the cluster. For layer-3 applicationmaster, the code of a specific application is run on the yarn cluster. It manages application loads and containers (usually UNIX processes), and sends a notification when one of the containers fails. Samza and yarn samza provides a yarn applicationmaster and an out-of-the-box yarn task runner. In this case, you may find it unintuitive. The integration of samza and yarn can be used for an overview (different colors indicate different machines): Let's explain the figure above, the samza client will tell the RM (ResourceManager) of yarn when it wants to start a new job ). This RM tells yarn that a nodemanager (nm) is samza's applicationmaster (AM) to allocate space in the cluster. Once the NM is allocated space, the samza am will be started. After am starts, it will require RM to require more yarn containers to run samzacontainers. In addition, RM and NMS allocate space for containers. Once the space is allocated, NMs will enable samza containers.
Samza briefly introduces yarn, followed by our focus samza. Samza provides a staged stream processing and partitioning framework by using yarn and Kafka. This is probably the way they are put together: the samza client uses yarn to run a samza task (job): yarn starts and monitors one or more samzacontainers, and your processing logic code (using the streamtask API) runs in these containers. The input and output of these samza stream tasks are from the brokers of Kafka (usually they are located on the same machine as yarn NmS)
For example, we want to count the number of page visits. If you use SQL, you may write the following: Select user_id, count (*) from pageviewevent group by user_id although samza currently does not support SQL, the idea is the same; to calculate this requirement, two tasks are required: one task aggregates messages by userid, And the other task is counted. Further, the first task is to send messages with the same userid to the same partition of an intermediate topic, you can use the userid in the message sent by the first job as the key, and the key is mapped to the partition of the intermediate topic (usually the key is used to obtain the remainder of the number of partitions ). The second task processes messages from intermediate topics. In the second task, each task processes a partition of the intermediate topic. In the corresponding partition, the task gets a counter for each userid, and each time the task receives a message with a specific userid, the corresponding counter increases by 1. Let's take a look at the figure:
How can this figure be seen everywhere? It is similar to hadoop's mapreduce operation, right? Every Record carries a specific key to the Mapper and is grouped by the framework according to the same key, then perform calculation statistics in reduce. However, hadoop and samza are very different, because hadoop computing is based on a fixed input, while samza is dealing with unlimited data streams. Another major difference between the stream computing framework and mapreduce is that Mr tasks stop while samza continues to process. Kafka receives the messages sent by the first job and caches them to the hard disk and distributes them across multiple machines. This helps improve System Fault Tolerance: if a machine crashes, no messages will be lost because they are stored in other machines. In addition, if the second job consumes messages slowly or stops for some reason, the first task does not affect: the disk buffer can accumulate messages until the second task is faster. By partitioning a topic, data stream processing is split into tasks and executed concurrently on multiple machines, so that samza has a high message throughput. By combining yarn and Kafka, samza achieves high fault tolerance: if a process or machine fails, it will automatically restart it on the other machine and continue processing from the location of the message terminal, these are all automated. In the end, I would like to express that samza is not a battle. It pulls other powerful partners and makes everything above look natural. I believe this is also the development direction of Excellent Software in the future. After the introduction of the architecture, we should have introduced some comparative introductions in the order of official documents. Considering that some students may not have been in touch with each other, we will skip the process with other guys (the translation will be completed later) the comparison goes straight to the topic. Next, let's take a look at the samza API, so stay tuned!

[Samza series] Real-time computing samza Chinese tutorial (III)-Architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.