Stream compute storm and Kafka knowledge points

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Enterprise Message Queuing (KAFKA) What is Kafka. Why Message Queuing should have a message queue. Decoupling, heterogeneous, parallel Kafka data generation Producer-->kafka----> Save to local consumer---active pull data Kafka Core concepts producer (producer) messages do not lose data distribution policies Kafka Brokers topic (theme): A kind of message, such as user information, commodity information, order information partition why should have partition. The data is too large to be stored by a single node. The physical form of partition. The number of folder partition is designed. If the cluster is small (10), set the partition to the number of brokers if the cluster is larger than 10, it is necessary to design replication based on the amount of data to replication. Ensure data security, data fault tolerance considerations. Set up several replicas. The more important the data, the higher the number of replicas can be set. However, the higher the data redundancy, the longer the ACK time, the lower the efficiency. The usual set of 2. Segment (segment) Why do you want to segment? If the file is huge, remove the hassle and look for trouble. Segment size, 1G. This can be set. The expiration time of the deleted data, 168 hours, equals 7 days. Note that in planning the Kafka cluster, it is important to consider data storage for a few days. The number of Kafka clusters is recommended for 3-5 units. 24T * 5 = 120T Segment Physical form: There is a log file and index file. The original data stored in the log index holds the offset and physical address value data lookup will have a binary lookup method (mastering) Pagecache Modern operating system provides caching technology, Kafka the current production of data in the cache. Because the time difference between production and consumption is very small, consumers consume data basically from memory. Sendfile, as a consumer, wants to consume historical data. Senfile technology is not applied, at the operating system level, after reading the data, output directly to the network card. Partition ISR If partition has multiple replicas, it needs to select a leader, which is responsible for reading and writing data read and write operations. The leader is likely to hang up because of the pressure. If using voting mechanism, there will be relatively large time consumption, always ready to spare tire. What conditions are met to ISR members. Synchronizes the leader data, within a certain time threshold and threshold value. Kick out the ISR if you don't meet the conditions. Partition leader leader is a description of partition. Responsible for reading and writing data.Consumer Consumergroup, consumer data are in the form of consumer groups. Consumer groups, consumption data is non-interference. When a consumer hangs up, it will trigger a load balancer after determining that the consumer cannot re-consume. Two different consumer groups, consumption of the same topic data, are complete. Note: In the actual development process, your own consumer group should be designed to be unique. Consumer consumption offset management. Version 0.8, offset is managed by zookeeper. 0.8+, you can choose to use Kafka's Consumer_offset topic for management. Kafka frequently asked questions. Why is Kafka so fast? Pageche, Sendfile Kafka consumption does not lose the mechanism. Producer, broker, consumer. Kafka consumer data is globally ordered. The individual partition is orderly, the global order violates the design original intention. Streaming calculation framework (Storm) The composition of the streaming computing framework: General Flume+kafka+storm+redis, each component can be replaced. What Storm is. Flow-based computing framework, once started, never stops. What the storm architecture is. Client: Used to create a stormtopology and then serialize Stormtopology after the commit to Nimbus via RPC. Nimbus: Publish RPC Service, accept client's task commit, verify the task. Put the task in the task queue (blocking the queue). The background thread reads the queue, obtains task information, and assigns tasks. Gets the idle worker resource in the current cluster. Gets the number of tasks that are required for the current task. The number of tasks is the sum of the concurrent quantities of all component (components) plus a ackerbolt that each worker initiates. Save the task information to zookeeper. Zookeeper:zookeeper Save the task information and various other information of the node. Supervisor: Through the watch mechanism, get the task information, and then start your own worker. Worker: Initiated by supervisor, responsible for specific task execution. Task: Essentially a thread, it's a executor. Divided into three types of Spouttask, Bolttask, Ackertask. Storm's programming model. Spout extends Baserichspout Open: Initialize method Nexttuple: There is a while calling the method all the time, and the call sends the data once. Field: Declares the name and number of fields for the output BOLT1 extends Baserichbolt (manual ack) Prepare: Initialize Method Execute: Execute Method field: declares the name and number of fields for the output BOLT2 extends Baseba Sicbolt (Auto ack) Execute: Execution Method field: Declares the output of the fields name and number of the drive class: Topologybuilder run mode: Local mode, cluster mode Storm component's degree of parallelism is set. Spout is set based on the number of partitions of the upstream Kafka topic. BOLT1 is based on the data sent by Spout/BOLT1 processing each amount of data (unit time 1S) BOLT2 is based on the data sent spout/BOLT1 processing each amount of data (unit time 1S) the number of workers spout how to set. The settings are based on the sum of the parallelism of all components. Can have a worker with two SPOut or multiple spout. If the number of bolts at different levels downstream of the spout is in many cases, the computational pressure can be considered consistent with the number of worker and spout. The pressure is greater, only the number of partition can be modified. If you can't change the number of partition, you can only add the number of workers and let the data flood the network. spout upstream and downstream cohesion Strategy (streamgrouping) Localorshuffle grouping policy is the first choice at any time. FieldGroup field grouping. the principle of storm The task submission process is used to create a stormtopology and then serialize the stormtopology through the RPC submission to Nimbus. Extensions: RPC framework, dynamic + reflection technology + network communication technology. Cluster START process Java system flow: Java-jar, Java-server, java-client manual start: Nimbus, Supervisor auto Start: Supervisor start worker based on task information. Task execution process and Nimbus, supervisor not a half-penny relationship, are in the worker. Spouttask.open () is typically used to open an external data source while calling the Nexttuple method. To send data, you need to consider the grouping strategy of the data. Send data is sent to a tuple, which will carry the current tuple to which TaskID is sent. Then, according to the task assignment information, get the worker TaskID. Sends a tuple to a remote worker over a network request. The remote worker has an accepted thread that, according to TaskID, finds the input queue for the corresponding bolt (no lock queue, processes 6 million orders per second), and puts the tuple into the input of the bolt. Each task is a thread, and the content of the input queue is continuously consumed in the background. After the message is consumed, the Execute method of the bolt is called and the data is passed to the bolt. The Execute method of Bolt receives a tuple, which is processed by a meal. The data tuple is then sent downstream. When data is sent, it is based on the downstream grouping policy. For example: Localorshuffle. If it is a Localorshuffle method, the corresponding task in the current worker is directly found and distributed. Puts the data into the input queue of the response bolt. Each task is a thread, and the content of the input queue is continuously consumed in the background. After the message is consumed, the Execute method of the bolt is called and the data is passed to the bolt. So loop. The message does not lose mechanism on how to turn on the message without loss. When sending data to the spout side, add MessageID. spout to override the ACK and fail methods. The number of setnumackers set in the config file for Topologybuiler is greater than 1. Default is 1 on each level downstream of the bolt, you need to increase the anchor point. What the phenomenon is. When the message processing succeeds, the Ack method is called. When the message processing fails timeout processing, the default of 30S calls the Faile method and passes in the MessageID. It's really abnormal, the message is re-sent. The message is re-sent, need handSettings. Message resend, it is best to send a tuple in spout, the tuple itself as messageid incoming, failure, directly send MessageID implementation mechanism. different or mechanism. The same is 0, the difference is 1. When you need to send the upstream, a state is sent. And you need to process the data downstream, send a status. The two state values are the same. Each level will produce a new anchor ID. 64-bit long integer type.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Stream compute storm and Kafka knowledge points

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Stream compute storm and Kafka knowledge points

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support