Real-time computing samza Chinese tutorial (2) -- Concept

Source: Internet
Author: User

I hope that the previous background article will give you a broad understanding of streamcompute. This article introduces the concept based on the official website. Let's first look at what is there? Concept 1: streams samza processes streams. A stream is composed of a series of immutable similar messages. For example, a stream may be all clicks on a website, update to a specific database table, or be generated by a service or event data. A message can be added to or read from another stream. A stream can have multiple consumers, and reading messages from a stream does not delete messages (so that the messages can be broadcast to all consumers with caution ). In addition, a message can have an associated key for partitioning, which will be described later.
Samza supports a pluggable System for stream extraction: In Kafka, a stream is a topic. In the database, we can read a stream from a table update operation through consumption; in hadoop, we may track files in a directory on HDFS.
Concept 2: jobs
The jobs of samza is a program that sets value-added for a set of input streams and converts them to output streams (SEE ). To expand the throughput of the stream processor, we split the task into smaller parallel units: partition partitions and task tasks.
Concept 3: partitions each stream is divided into one or more partitions, and each partition in the stream is always an ordered message sequence. In this sequence, a message is called offset (Chinese for the moment, it is called Offset), which is unique in each partition. This offset can be a continuous integer, byte offset, or string, depending on the underlying system implementation. When a message is added to a stream, it is only appended to one of the streams. The message is distributed to the corresponding partition by the writer with a selected key. For example, if the user ID is used as the key, all messages related to the user ID should be appended to this partition.
Concept 4: a job in tasks is extended by dividing it into multiple task tasks. As the parallel unit of a job, a task is like a partition in the stream mentioned above. Each task consumes data from one partition for each job input stream. According to the Message offset, a task processes messages from its input partition in order. No sequence is defined between partitions, which allows independent execution of each task. The yarn scheduler is responsible for distributing tasks to one machine. Therefore, a job as a whole can be allocated to multiple machines for parallel execution. In a job, the number of tasks is determined by the input partition (that is, the number of tasks cannot exceed the number of partitions, otherwise there will be no input tasks ). However, you can change the computing resources (such as memory and number of CPU cores) allocated to the job to meet the job's needs. For details, refer to the introduction of container below. Another thing worth noting is that the task assigned to the task partition will never change: If a task is on an invalid machine, the task will be restarted elsewhere, shards of the same stream are still consumed.

Concept 5: Dataflow Graphs: We can combine multiple jobs to create a data flow chart. The Node indicates the stream containing data, while the edge indicates data transmission. This combination is accomplished simply by using jobs as the input and output streams. These jobs are also decoupled: they do not need to be based on the same code library, and adding, deleting, or restarting a downstream task does not affect upstream tasks.

Concept 6: partitions in the containers partition and task tasks are parallel logical units. They do not match the allocation of specific computing resources (such as CPU, memory, and hard disk. Containers is a physical Parallel unit, and a container is essentially a Unix process. Each container runs one or more tasks. The number of tasks is automatically determined and fixed from the number of input partitions, but the number of containers (CPU and memory resources) is set by the user at runtime and can be changed at any time.
Now we have finished introducing several major concepts of samza. We can look at what samza looks like from a macro perspective. Let's take a look at its architecture in the next article.

Real-time computing samza Chinese tutorial (2) -- Concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.