1. Introduction
Twitter relies on a lot of real-time streaming. Over the years, Twitter has been using Strom inside. But at the present scale, the use of Strom has become increasingly challenging. In particular, it involves issues such as scalability, debug capabilities, manageability capabilities, and efficient resource allocation with other data services.
A big challenge is debug-ability. When a topo is not working properly, it is necessary to quickly locate the cause of performance degradation. But in storm, a lot of compoents in a topo are in a process that makes debugging difficult. We need a clear mapping from the logical unit to the physical process. This mapping is important for critical topo.
In addition, Strom requires granular cluster resources that require specific hardware allocations to topo. This leads to inefficient use of valuable cluster resources and limits scalability. What we need is a more flexible way of scheduling resources, and running different data processing systems can share resources. Within Twitter, this work is managed by Aurora.
Allocating resources for a new topo on Strom requires manual isolation of the machine, which must be dropped when not in the Topo. Managing machine allocation is cumbersome. And we want a more effective way, Twitter is a big size. Any improvement will greatly save on hardware costs and increase productivity.
We want to solve the problem above and do not want to rewrite a lot of applications that are running on Strom. Therefore, it is important to be compatible with the Strom interface. General
In the light of various considerations, we have designed Heron to achieve this goal. It is compatible with storm and is very easy to migrate. All of Twitter's production topo are now running on Heron. In addition to significant performance gains and lower resource consumption, Heron also has great advantages in debug-ability, scalability, and manageability.
This paper mainly introduces the design of Heron, and gives the performance comparison.
3 about Storm3.1 Storm background
Storm is a graph of spout and bolts. Spout is the input data source, and Bolt is the abstraction of the flow calculation. Spout often pulls data from the queue, such as Kafka or Kestrel, to produce a tuple stream. This tuple gives the bolt a calculation. A real-time statistic of the number of online topo such as:
spout and bolts are called tasks when they are running. Multiple tasks are divided into one executor. Multiple executor are in one woker, and each woker runs as a JVM process. For example, a host can run multiple workers, and each worker can belong to a different topo.
Limitations of the 3.2 Storm worker architecture
Worker design is more complex. A lot of instances are in a worker process. Each executor has two threads. The scheduling of these threads is based on the JVM preemption and priority scheduling algorithm. Each thread runs several tasks, and executor implements another algorithm to wake up the related task based on the data received. This multi-dispatch and complex interaction result in the uncertainty of scheduling.
Each worker can run different tasks, such as Kafka Spout, a bolt that connects the service, and a bolt that writes to the store on a JVM. Because each task schedule interacts with each other, the resource cannot be isolated, and the problem can only be restarted. After rebooting, The problematic task may be dispatched, and it is very difficult to track and resolve the problem.
Multiple task logs are combined. Errors are very different. If a task has an exception, it can only terminate the entire process. A part of the Topo error can affect the overall performance of the Topo. And the GC problems that are caused by different tasks are very difficult to track.
In terms of total resource allocation, Storm assumes that all workers are homogeneous. This assumption leads to inefficient allocation of resources and often leads to excessive resource allocation. For example: dispatch 3 spout and one bolt to two woker. Assuming that the bolt needs 10G, each spout requires 5G. So each woker needs to reserve 15G, because one of the woker must run a bolt and a spout. Therefore two woker were assigned 30G. But what's really needed is 25G. When the volume of woker components is very large, the problem becomes very bad. It often happens when you allocate complex topo from the upper layers of abstraction.
Debug Challenge. The result of allocating large memory to a worker is that jstack or heap dump becomes cumbersome. And when the worker executes the heap dump, it is prone to losing the heartbeat and being killed by storm monitoring.
Can we redesign storm, a worker, a task? This can lead to wasted resources and also limits concurrency. This causes each topo to have a large number of worker. Lead to significant resource over-allocation. To keep so much memory for each worker.
And the number of component increases, and the other workers that each worker wants to connect to will cause not enough ports. Reduced scalability.
Storm workers uses multiple threads and queues to transfer data. A globakreceive thread (upstream), a global send thread. (downstream). In addition, each executor has a user logic thread and a local send thread. So each tuple undergoes 4 threads of data passing. This leads to a noticeable burden and queue problem.
The problem of 3.3 Nimbus
Nimbus bottlenecks. Too many things to manage. Dispatch, manage, allocate jar packages, manage topo.
Resource scheduling and isolation are not supported.
Production environment, separate machines run topo, waste. It's hard to use resources completely, and it's not completely solved on yarn.
ZK manages the heartbeat and becomes the bottleneck. There is a compromise, but it adds to the operational burden.
Single point.
3.4 Missing back pressure
No back pressure. Backpressure
If the middle break, there are a lot of bad. Waste a lot.
3.5 efficiency
Not perfect. One break will cause the whole break.
GC Time is long,
The queue is easy to fill.
To solve this, we need over-supply and waste of resources. Overprovision
Example: The utilization rate of the 20-30% core. The expectation is that the core is done.
4. Design Considerations
The cost of the modification is too high.
Spart streaming, the API is different, the migration cost is high. The scale used is different.
So I decided to write one more.
5 Heron5.1 data Model and API
API compatible. Topo. Bolt, spout are the same.
Grouping.
At the most once
At least once
5.2 Architecture
Aurora is a scheduling framework on the Mesos. Through the scheduler abstract Heron can also run in Yarn,mesos,ecs
First Aurora Scheduler, managing all topo
Each topo has a topo master. (first)----behind the container each have Stream manager, Metrics Manger, a bunch of Heron Instance. (In fact, it is spout/bolt). Multiple container can be started on a physical machine.
(Aurora uses cgroups to isolate container). Metadata is saved in ZK. A Heron instance is a JVM. Heron Protobufs for internal communications.
5.3 Topo Master
Yarn-like app master.
Find yourself through ZK and write a temporary node. It is also the gateway to the topology. Without data processing, it will not be a bottleneck.
5.4 Stream Manager
SM functionality: Effectively manages the routing of a tuple. Each hi (instance) receives data via a local SM connection. All SM of a topology consists of an O (K2) network connection, and K is the number of SM. The different hi connection routes on this machine are completed by the local short loop mechanism, not the SM. Since the number of hi is far greater than K, this design can extend the overlay network through O (n2) * O (K2)
5.4.1 Topological back pressure
The quantity flow rate can be adjusted dynamically. When different component processing speed is not the same time is very important. For example: downstream slow, if the upstream does not slow down, the queue will be slow. This tuple is discarded by the system. If you discard from the middle, it is a waste. The back pressure mechanism is the ability to slow upstream. There are some strategies below.
TCP Back pressure
The TCP window mechanism propagates back pressure from hi to other upstream components. TCP sockets communication, buffer speed and consumption speed are the same. If hi is slow, buffer will be full. SM will find out because his send buffer will be full. This back pressure is passed on to other SM and hi upstream. It will not be eliminated until hi catches up.
TCP back pressure is easy to implement. But the effect is actually not good, because many of the logical channel between hi is covered on the physical link of SM. This duplex causes both upstream and downstream hi to become slower. The result is a particularly slow failure recovery, resulting in a long-term degradation of the performance of the entire topology.
Spout Back pressure
SM Depresses local spouts to reduce new data injection. This method is used in conjunction with the TCP back pressure. When SM finds hi slow, it stops reading data from spout. This spout buffer will be full and will eventually block. SM sends a "startbackpressure" message to other SM, asking them to stop their spouts. The SM that receives the message stops reading the data from the local spout. When hi catches up, SM sends another "Stopbackpressure" message to the other SM. Then other SM started consuming data from local spout.
This approach directly suppresses most of the upstream spouts. This approach is not optimal, we suppress spout than the upstream producers. This is another way to increase the burden of message delivery. The advantage is that the response time is fast, independent of the depth of topo.
Stage-by-stage Back pressure
A topo is made up of multiple stages. In this way, we constantly spread back pressure to reach spout.
As with spout back pressure, this method is used in combination with the TCP back pressure. But it's between SM and Hi. There is no need to pass messages between Sm.
5.4.2 Implementation
In the heron we achieve the spout back pressure. Because this approach is easier to implement. Good work, when the tilt event occurs, it is easy to debug, it is easy to find the source of back pressure. Each socket channel is associated with an application-level buffer, which has an upper and lower limit of the water level. When the buffer size reaches the upper limit, the back pressure is triggered to guide the size below the lower water level. The basis of this design is to prevent the topo from swinging rapidly back and forth in return pressure.
The result of this design is that once a tuple is spout, Heron will not discard it. Unless the process or machine hangs. This behavior makes the tuple's failure more deterministic.
When the topo enters back-pressure mode, it runs as slow as the slowest one component. This can result in the accumulation of data source data after running for a period of time. Depending on the needs of topo, spout can be configured to discard old data.
5.5 Heron Instance
Both spout and bolts are running in hi. Unlike Storm's worker, HI is a JVM process that performs only one spout or bolt task. This design makes it easy to dubeg/profile a spout or bolt. Because developers can easily see a hi log.
Because complex data transfer is done through SM, it is possible to write hi in other languages in the future.
Implement hi using a single thread or two threads. Let's talk about:
5.5.1 Single thread mode
The thread maintains a TCP channel to the local SM, waiting for a tuple. The user logic is executed once the tuple arrives. If an output tuple is produced in the user logic, it will be buffer. When buffer exceeds a certain threshold, it is sent to the local SM.
This method is simple, but there are bad, the user code may block. Like what:
It's sleep.
IO read-write system call.
Thread synchronization Primitives.
These blocks are not expected. Cause unexpected behavior. Once the merics cannot be collected and reported in time, it is impossible to determine if hi is in bad condition.
5.5.2 Dual Threading Mode
In this design, there are two threads: the Gateway thread and the task EXE thread. Gateway is responsible for all hi's data access and communication. Connect local SM and MM. Also responsible for accepting tuples from SM. These tuples are sent to the task EXE thread.
The Task EXE executes the user code. When starting, execute the "open" or "prepare" method. If it is a bolt, it executes the "execute" method. If it is spout, the "next Tuple" method is executed, and the data is sent to Topo as a tuple. These tuples are sent to the gateway and forwarded to SM. In addition, Task EXE also collects statistical data such as the number of tuples executed, sending a tuple quantity, acknowleged tuple quantity, processing time and so on.
Gateway and EXE communicate through three one-way queues. DataIn, Dataout. Metices-out
Both DataIn and dataout have thresholds. When the DataIn was full, the gate stopped receiving data from SM. This triggers the back pressure mechanism of SM. When the dataout is full, Gateway will think that SM can't accept more data. Task EXE will no longer emit or execute tuples.
When you run a large number of topo in a production environment, we encounter unexpected GC. The network is suddenly interrupted, and the gateway no longer sends a tuple from the dataout queue. Queues are stacked and cannot be reclaimed. Hi memory is full. When the network recovers, Gateway begins to read the data from SM. If the gateway read data occurs before sending, because the memory is full, the GC is triggered. This leads to more performance degradation.
To avoid this problem, we periodically check two queues and increase/decrease the queue size. If the capacity exceeds the limit, the queue size is halved. Doing so either reduces the queue capacity to a stable level or reduces it to 0. Once you reach 0, you cannot accept data or send data. As a result, it is easier to recover from the GC. Similarly, if most of the queues are smaller than the limit, the queue size is gradually increased until the queue reaches the limit.
5.6 Meritcs Manager
Collect metering data. The system and the user topo. Each container one.
Information is sent to a monitoring system. Also sent to Topo Manage.
5.7 Startup and Failure scenarios
Submitted to scheduler, allocated resources, dispatched to the cluster's machine. The first container becomes a TM. Register to ZK.
SM found TM. To connect, to send a heartbeat.
When all SM is connected, the TM starts Spout/bolt to a different containers. This is called the physical plan. When finished, SM gets all the plans from the TM. Mutual discovery. Then SM connects to each other and forms the network. Hi start, download this hi's physics plan and start executing. Tuples begins execution. TM writes a physical plan to the ZK to tolerate disaster.
The situation that causes Topo to fail is: The process dies, container fails. Machine problems.
When the TM process dies, container Restarts the failed process, and the TM resumes its state from ZK. The TM will switch from the master. SM will find the new Master TM and connect it.
Similarly, SM hangs up and container restarts. Rediscover the TM, get the physical plan, and check the status. Other SM will also get new physical plans to get the location of the new SM. When hi hangs up, it restarts and connects to the local SM. Get the physical plan, determine if you are spout or bolt, and start executing the user logic code.
When the container is dispatched to the new machine, the new SM will also find the TM, repeating the failure recovery process above.
Start Flowchart:
5.8 Architecture Summary
Several important points in design
1. Resource allocations are abstracted from Cluster Manager, which can heron better integration with other infrastructures.
2. A hi only runs a task (spout or bolt), easier to debug, jstack, heap dump, etc.
3. Topo running more transparent, it is easy to know that the slow or failure. Metrics collection are fine-grained and pinpoint the problem to a system-specific process.
4. Resource allocation at the component level, avoiding resource over-allocation issues.
5. Each topo a TM, Topo is independent of each other. One failure will not affect the other.
6. Back pressure mechanism allows us to obtain a stable processing speed, easier to understand the system. It is also an important mechanism to move topo from some container to others.
7. The system no longer has a single point.
6. Heron in Production
Heron Tracker, Heron UI, Heron VIz
Interacting with topo, observing topo indicators and trends; Hi problem tracing; Look at the log
6.1 Heron Tracker
6.1 Heron Tracker
A gateway that accesses topo information. Save the metadata through ZK. by ZK watch New Topo, in the running Topo, be kiled topo, any physical plan changes will be monitored. Also use ZK metadata to obtain additional TMS and collect information.
Provides rest APIs. Provide physical, logical plan, no indicator, HI log link., Aurora job page. Run as an Aurora service. There is disaster tolerance and load balancing.
6.2 Heron UI
Use the tracker API to show topo information. Logical plan and physical plan.
The inner ring is the machine, the middle is the container, the outside is hi. can drill down. Counts, time-consuming, ACK count, fail count.
View logs.
6.3 Heron Viz
A dashboard. Periodically get the data.
Health, resource, component, Stream manager.
Fail count.
CPU allocation, memory allocation, GC time.
Number of tuple, emit, fail, ack. Tuple end to end time.
Hi's handling quantity, loss amount. Back pressure propagation time. ...
7 Performance Comparison
Twitter has completely replaced storm with Heron. The former now processes "10TB of data, generating several 1 billion output tuples" in a standard word count test, "throughput increased by 6 to 14 times times, the tuple delay was reduced to the original five to One-tenth", and hardware was reduced by 2/3.
More information. A little bit here.
It's worth mentioning that Heron is completely docker, and the Docker deployment is described on git:
Https://github.com/twitter/heron/tree/master/docker
Paper Address:
http://dl.acm.org/citation.cfm?id=2742788
Open Source Address:
Https://github.com/twitter/heron
Twitter Heron: Large-scale streaming processing system