The drawbacks of "editor's note" Hadoop are also as stark as its virtues--large latency, slow response, and complex operation. is widely criticized, but there is demand for the creation, in Hadoop basically laid a large data hegemony, many of the open source project is to make up for the real-time nature of Hadoop as the goal is created, Storm is at this time turned out, Storm is a free open source, distributed, A highly fault-tolerant real-time computing system. Storm makes the continuous flow calculation easier, making up for the real-time requirements that Hadoop batches cannot meet.
The following is the original text:
background UAE (UC App Engine) is a PAAs platform within UC, and the overall architecture is somewhat similar to Cloudfoundry, including:
Rapid deployment: Support Node.js, play!, PHP and other framework information transparent: operation of the process, System State, business situation grayscale trial and error: IP grayscale, geographical gray basic services: key-value storage, MySQL high availability, picture platform, etc.
Here it is not the protagonist, does not give detailed introduction.
There are hundreds of Web applications running on UAE, all requests are routed through UAE, Nginx access log size is terabytes per day, how to monitor each business's access trends, ad data, page time, access quality, custom reports, and exception alarms in real-time?
Hadoop can meet the needs of statistics, but the real-time of the second level is not satisfied with the spark streaming and some of the overqualified, at the same time, we do not have spark engineering experience; self-write distributed program scheduling is troublesome and should consider the expansion, message flow;
Finally, our technology selection is storm: Relatively lightweight, flexible, easy to communicate and flexible.
In addition, because of UC cluster more, across the cluster log transmission is also a relatively large problem.
Technical Preparation cardinality count (cardinality counting)
In large data distributed computing, PV (Page View) can be easily added together, but UV (Unique Visitor) is not.
In the case of distributed computing, hundreds of business, hundreds of thousands of URLs at the same time statistics of UV, if you want to be divided into hours of statistics (every minute/every 5 minutes merged/hourly merge/daily merge), memory consumption is unacceptable.
At this time, the power of probability is reflected. We can see in the probabilistic data structures for Web Analytics and Data mining that accurate hash table statistics UV and cardinality count memory comparisons are not an order of magnitude. The cardinality count allows you to implement a combination of UV, minimal memory consumption, and the error is entirely within acceptable limits.
You can understand the Loglog counting first, understand the premise of the uniform hashing method, the origin of the rough estimate can be skipped.
The specific algorithm is re-use counting, and the computed base used is Stream-2.7.0.jar.
Real-time log transfer
Real-time computing must rely on real-time log transmissions at the second level, and the added benefit is to avoid network congestion caused by staged transmissions.
Real-time log transfer is a lightweight log transfer tool available in UAE, which is mature and stable and used directly, including client (MCA) and server Side (MCS).
The client listens to the changes in the log files of each cluster and transmits them to each machine in the specified storm cluster, which is stored as a normal log file.
We tuned the transmission strategy so that the log files on each storm machine were roughly the same size, so spout only read native data.
Data Source Queue
We do not use storm commonly used queues, such as Kafka, Metaq, etc., mainly is too heavy ...
Fqueue is a lightweight memcached protocol queue that turns ordinary log files into memcached services so that storm spout can be read directly to Memcached protocols.
This data source is simpler, it does not support replays, a record is removed, and if a tuple processing fails or times out, the data is lost.
It is relatively lightweight, based on local file read, made a thin layer of cache, not a pure memory queue, its performance bottleneck is disk IO, the throughput per second and disk read speed is consistent. But for us this system is sufficient, subsequent plans to change to a pure memory queue.
Architecture
Through the technical reserve above, we can obtain the user's log after a few seconds of the user's visit.
The overall structure is also relatively simple, the reason why there are two kinds of calculation bolt, is based on the calculation of uniform distribution considerations. The volume of the business varies greatly, and if only the business ID is fieldsgrouping, the computational resources will be unbalanced.
spout Each original log standardization, according to the URL group (fieldsgrouping, to maintain the balance of the calculation of each server), distributed to the corresponding Stat_bolt; Stat_bolt is the main calculation bolt, combing and calculating the URLs of each business , such as PV, UV, total response time, back-end response time, HTTP status code statistics, URL sorting, traffic statistics, etc. merge_bolt will each business data merge, such as PV number, UV number. Of course, the UV merge here uses the cardinality count mentioned above; A simple coordinator coordination class, streamid labeled "Coordinator", Functions: Time Coordination (segmentation batch), check task completion, timeout processing. The principle is similar to the transactional Topolgoy with Storm. Implementation of a scheduler through the API to obtain parameters, dynamic adjustment of spout, bolt distribution in each server, in order to flexibly allocate server resources. Support for smooth upgrades topology: When a topology is upgraded, the new topology and the old topology speak at the same time, coordinating the switching time, when the new topology takes over Fqueue, bridges, kills the old topology.
Note the point:
Storm machine as far as possible in the same cabinet, does not affect the bandwidth of the cluster; our Nginx log is divided by the hour, if the time is not accurate segmentation, in 00 minutes, you can see the obvious fluctuations in data, so, as far as possible to use the Nginx module to cut the log, There is a delay in sending a signal with a crontab. Cut log This 10-second-level delay, in the large scale statistics, there is no problem, the second level of statistical fluctuations is very obvious; the heap is too small to cause woker to be forced to kill, so configure the XMX parameter; custom static resource: Static resource filtering options, Filter specific static resources by content-type or suffixes. Resource merging: URL merging, such as restful resources, easy to display after merging; dimensions and metrics: through ANTLR v3 do syntax, lexical analysis, complete custom dimensions and metrics, and subsequent alarms also support custom expressions. Other
We've also implemented it in other ways:
The business process level (cpu/mem/port) monitors business-dependent services such as disk/memory/io/kernel Parameters/language environment/environment variables/compilation environment of the monitoring server for mysql/memcached, etc. link: nginx log real-time monitoring system based on Storm (Zebian/Wei)
Free Subscription "CSDN cloud Computing (left) and csdn large data (right)" micro-letter public number, real-time grasp of first-hand cloud news, to understand the latest big data progress!
CSDN publishes related cloud computing information, such as virtualization, Docker, OpenStack, Cloudstack, and data centers, sharing Hadoop, Spark, Nosql/newsql, HBase, Impala, memory calculations, stream computing, Machine learning and intelligent algorithms and other related large data views, providing cloud computing and large data technology, platform, practice and industry information services.