After listening to the preface, you can't wait to see what samza is? First, let's take a look at samza's background as essential (at least the first on the official website). What technical background do we need to learn about? What is messaging )? The message system is a popular solution for implementing near real-time asynchronous computing. When a message is generated, it can be put into an activemq (rabbitmq), a publish-subscribe system (Kestrel, Kafka), or a log aggregation system (flume, scribe ). Downstream consumers read messages from the above system and process them or create further actions based on message content. Suppose you have a website, and every time someone loads a page, you send a "User saw the page" event to a message system. You may have consumers who do the following: * for future data analysis, store messages to hadoop; * count page access traffic and update it to Dashboard * If the page access fails, an alarm is triggered. * send an email to notify another user. * Add the user's related information to the page display event, and return the information to the message system. To sum up, it is clear that a message system can decouple all the work from the actual web service.
So what is stream computing (processing )? We all know that a messaging system is a relatively low-level infrastructure (despised --) that stores messages waiting for consumers to consume them. When you start writing code that generates or consumes messages, you will soon find that there are many disgusting problems at the processing layer that need to be handled by yourself. Samza's goal is to help us get rid of these disgusting guys! Let's look at the example above (calculating PV and updating it to Dashboard). When your running consumer machine suddenly crashes, what happens if your current calculated value is lost? How to recover? Where should I start when the machine service is restarted? What if the underlying messaging system repeatedly sends or loses one message? Or do you want to group PV statistics by URL? Or is the load processed by one machine too large? Do you want to distribute data to multiple machines for Statistics during aggregation? Streamcompute provides a good solution for the above problems. It is based on the high-level abstraction of the message system.
Samza is a stream computing framework with the following features: * Simple API: unlike most low-level messaging system APIs, samza provides a very simple callback-based message processing API. * Management Status: samza manages snapshots and stream processor status recovery. When the processor restarts, samza restores the snapshots in the same state. Samza is created to handle a large number of States; * Fault Tolerance: when one machine in the cluster goes down, samza Based on Yarn will immediately direct your task to another machine; * Persistence: samza uses Kafka to ensure that messages are written to the corresponding partition in sequence without message loss. * Scalability: samza is partitioned and distributed at each layer. Kafka provides sequential, partitioned, reproducible, and fault-tolerant streams. Yarn provides a distributed environment for samza operations. * pluggable: Although samza works outside Kafka and yarn, however, samza provides pluggable APIs that allow you to run in other message systems and execution environments. * processor isolation: samza running on Yarn also supports hadoop security models and resource isolation and selection through Linux cgroups: currently popular open-source stream computing solutions are very young, and no single system can provide a comprehensive solution. New challenges in this field include: 1. how to manage the status of a streamcompute instance; 2. whether the stream should be buffered to the disk of the remote machine; 3. what should I do when repeated information is accepted or lost? 4. the main difference between how to establish an underlying message transmission system and samza is that * samza supports Fault Tolerance of local states. The status itself is constructed as a stream. If the local state is lost because the machine is down, the status stream is replayed and re-stored. * The stream is ordered, partitioned, replayed, and fault-tolerant. * yarn is used to handle isolation, security, and fault tolerance. * tasks are decoupled: if a task is slow and causes a backlog of messages, other parts of the system will not be affected;
Okay. Here is the background. In the next article, let's take a look at some concepts to facilitate further study. Let's continue to work on it. After listening to the preface, you can't wait to see what samza is? First, let's take a look at samza's background as essential (at least the first on the official website). What technical background do we need to learn about? What is messaging )? The message system is a popular solution for implementing near real-time asynchronous computing. When a message is generated, it can be put into an activemq (rabbitmq), a publish-subscribe system (Kestrel, Kafka), or a log aggregation system (flume, scribe ). Downstream consumers read messages from the above system and process them or create further actions based on message content. Suppose you have a website, and every time someone loads a page, you send a "User saw the page" event to a message system. You may have consumers who do the following: * for future data analysis, store messages to hadoop; * count page access traffic and update it to Dashboard * If the page access fails, an alarm is triggered. * send an email to notify another user. * Add the user's related information to the page display event, and return the information to the message system. To sum up, it is clear that a message system can decouple all the work from the actual web service.
So what is stream computing (processing )? We all know that a messaging system is a relatively low-level infrastructure (despised --) that stores messages waiting for consumers to consume them. When you start writing code that generates or consumes messages, you will soon find that there are many disgusting problems at the processing layer that need to be handled by yourself. Samza's goal is to help us get rid of these disgusting guys! Let's look at the example above (calculating PV and updating it to Dashboard). When your running consumer machine suddenly crashes, what happens if your current calculated value is lost? How to recover? Where should I start when the machine service is restarted? What if the underlying messaging system repeatedly sends or loses one message? Or do you want to group PV statistics by URL? Or is the load processed by one machine too large? Do you want to distribute data to multiple machines for Statistics during aggregation? Streamcompute provides a good solution for the above problems. It is based on the high-level abstraction of the message system.
Samza is a stream computing framework with the following features: * Simple API: unlike most low-level messaging system APIs, samza provides a very simple callback-based message processing API. * Management Status: samza manages snapshots and stream processor status recovery. When the processor restarts, samza restores the snapshots in the same state. Samza is created to handle a large number of States; * Fault Tolerance: when one machine in the cluster goes down, samza Based on Yarn will immediately direct your task to another machine; * Persistence: samza uses Kafka to ensure that messages are written to the corresponding partition in sequence without message loss. * Scalability: samza is partitioned and distributed at each layer. Kafka provides sequential, partitioned, reproducible, and fault-tolerant streams. Yarn provides a distributed environment for samza operations. * pluggable: Although samza works outside Kafka and yarn, however, samza provides pluggable APIs that allow you to run in other message systems and execution environments. * processor isolation: samza running on Yarn also supports hadoop security models and resource isolation and selection through Linux cgroups: currently popular open-source stream computing solutions are very young, and no single system can provide a comprehensive solution. New challenges in this field include: 1. how to manage the status of a streamcompute instance; 2. whether the stream should be buffered to the disk of the remote machine; 3. what should I do when repeated information is accepted or lost? 4. the main difference between how to establish an underlying message transmission system and samza is that * samza supports Fault Tolerance of local states. The status itself is constructed as a stream. If the local state is lost because the machine is down, the status stream is replayed and re-stored. * The stream is ordered, partitioned, replayed, and fault-tolerant. * yarn is used to handle isolation, security, and fault tolerance. * tasks are decoupled: if a task is slow and causes a backlog of messages, other parts of the system will not be affected;
real-time computing samza Chinese tutorial (I) Background