Background UAE (UCAppEngine) is a UC internal PaaS platform with a general architecture similar to CloudFoundry, including: Rapid Deployment: Support for Node. js and Play! Framework information such as PHP is transparent: O & M process, system status, service status gray trial and error: IP gray level, regional gray level basic services: key-value storage, MySQL high availability, image platform, etc.
Background: UAE (UC App Engine) is a UC internal PaaS platform with a general architecture similar to CloudFoundry, including: Rapid Deployment: Support for Node. js and Play! Framework information such as PHP is transparent: O & M process, system status, service status gray trial and error: IP gray level, regional gray level basic services: key-value storage, MySQL high availability, image platform, etc.
Background
UAE (UC App Engine) is a UC internal PaaS platform with a general architecture similar to CloudFoundry, including:
- Quick deployment: supports Node. js and Play! PHP and other frameworks
- Information Transparency: O & M process, system status, and business status
- Gray trial and error: IP gray scale and regional gray scale
- Basic services: key-value storage, MySQL high availability, image platform, etc.
Here, it is not the main character and will not be described in detail.
Hundreds of Web applications run on UAE. All requests are routed by UAE. The daily Nginx access log size is TB, how can I monitor access trends, AD data, page time consumption, access quality, Custom reports, and exception alarms for each business in real time?
Hadoop can meet the statistical requirements, but the real-time performance in seconds cannot be met. Spark Streaming is not widely used, and we do not have Spark engineering experience; self-writing distributed program scheduling is troublesome, and expansion and message flow should be considered;
At last, we set our technology model to Storm: relatively lightweight, flexible, convenient message transmission, and flexible extension.
In addition, due to the large number of clusters in different regions of UC, cross-cluster log transmission is also a big problem.
Cardinality Counting)
In distributed big data computing, PV (Page View) can be easily combined, but UV (Unique Visitor) cannot.
In distributed computing, hundreds of businesses and hundreds of thousands of URLs are used to calculate UV statistics at the same time. If you want to perform time-based statistics (merging every minute, every 5 minutes, every hour, or every day ), memory consumption is unacceptable.
At this time, the power of probability is shown. We can see in Probabilistic Data Structures for Web Analytics and Data Mining that the precise hash table compares the UV and base count memory, which is not an order of magnitude. The base count allows you to merge UV with minimal memory consumption and the error is completely within the acceptable range.
You can first understand LogLog Counting and understand the rough estimation on the premise of the uniform hash method. The formula derivation can be skipped.
The specific algorithm is Adaptive Counting, and the library used is the stream-2.7.0.jar.
Real-time log Transmission
Real-time computing must rely on real-time log transmission in seconds. The additional advantage is that it can avoid network congestion caused by phase transmission.
Real-time log transmission is a lightweight log transmission tool provided by UAE. It is mature and stable and can be used directly, including the client (mca) and server (mcs ).
The client monitors the changes in the log files of each cluster, transmits the changes to the machines in the specified Storm cluster, and stores them as common log files.
We adjusted the transmission policy so that the log file size on each Storm machine is roughly the same, so Spout only reads local data.
Data source queue
We didn't use the queues commonly used by Storm, such as Kafka and MetaQ. They are mainly too heavy...
Fqueue is a lightweight memcached protocol queue. It converts common log files into memcached services, so that Storm spouts can be read one by one using the memcached protocol.
This data source is relatively simple. It does not support re-transmission (replay). After a record is retrieved, it no longer exists. If a tuple fails to process or times out, data is lost.
It is lightweight. It reads local files and performs a thin cache. It is not a pure memory queue. Its performance bottleneck lies in disk IO, the throughput per second is consistent with the disk read speed. However, this system is sufficient and will be changed to a pure memory queue in the future.
Architecture
With the above technical reserves, we can obtain user logs in a few seconds after the user accesses them.
The overall architecture is also relatively simple. The two computing bolts are based on the balanced distribution of computing. The amount of business varies greatly. If fieldsGrouping is performed only by Business ID, the computing resources are not balanced.
- Spout standardizes each raw log and distributes it to the corresponding stat_bolt according to the URL grouping (fieldsGrouping, to keep the calculation workload of each server uniform;
- Stat_bolt is the main computing Bolt, which sorts and computes the URLs of each business, such as PV, UV, total response time, backend response time, HTTP status code statistics, URL sorting, and traffic statistics;
- Merge_bolt combines data of each business, such as PV and UV. Of course, the UV merge here uses the base count mentioned above;
- A simple Coordinator coordination class is written, and streamId is marked as "coordinator". Its role is time coordination (splitting batch), checking task completion, and timeout processing. The principle is similar to the built-in Transactional Topolgoy of Storm.
- A schedbolt API is used to obtain parameters and dynamically adjust the distribution of spouts and bolts on each server to flexibly allocate server resources.
- Supports smooth upgrade of Topology: When a Topology is upgraded, the new Topology and the old Topology run at the same time to coordinate the switching time. When the new Topology takes over the fqueue, it passes through the river to kill the old Topology.
Note:
- Storm machines should be deployed in the same cabinet whenever possible without affecting the bandwidth in the cluster;
- Our Nginx logs are segmented by hour. If the splitting time is not accurate, we can see significant data fluctuations, try to use the Nginx module to switch logs, and sending a signal using crontab will lead to a delay. The log switching latency of 10 seconds is normal in large-scale statistics, but the fluctuation is obvious in seconds;
- If the heap is too small, woker is forcibly killed. Therefore, you must configure the-Xmx parameter;
Custom items
- Static resources: static resource filtering options. You can use Content-Type or suffix to filter specific static resources.
- Resource merging: URL merging, such as RESTful resources, facilitates presentation after merging;
- Dimensions and metrics: Perform syntax and lexical analysis through anlr v3 to customize dimensions and metrics, and support custom expressions for subsequent alarms.
Others
We have also implemented the following methods:
- Business Process-level (CPU/MEM/port) Monitoring
- Services dependent on services, such as MySQL/memcached monitoring
- Server disk/memory/IO/kernel parameters/language environment/environment variables/compilation environment and other monitoring
Original article address: Storm-based Nginx log real-time monitoring system. Thank you for sharing it.