Design and implementation of monitoring system for warehouse operation Machine

Last Update:2014-11-05 Source: Internet

Author: User

Tags rrdtool sqlite db

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Recently engaged in a warehouse operation Machine monitoring project. The requirements of the project background is: The company's e-commerce business in various parts of the country have many large or small warehouses, warehouse operators (without it technology background) often feedback/complaints about the operation of the Machine network, power off, not even service problems. The actual situation is often inconsistent with feedback, but there is no data on the operational side to prove, so there is the need for this project.

The goal of the first phase of the project is only to collect and display some of the monitoring metrics data for the machine in order to solve the problem at a rapid location, or at least to have data to check .

In order to avoid a large number of monitoring data escalation affecting the production system network services, the system uses the following structure:

Implement an agent to obtain the monitoring data of the machine on the warehouse operation PC or the working PDA;
Implements a data collection processing service on the warehouse local server, provides the API to upload the monitoring data to the agent, the data collection processing service persists the received data to the database, provides the data display to the WebApp on the warehouse local server;
The central server can call the data query interface provided by WebApp on the local server of each warehouse (data is used for locating, discovering problems), and periodically archives the data on the local servers of each warehouse on demand.

In this way, the main work is focused on the agent and data collection processing service, WEBAPPon the working machine . The most critical of these is the data collection processing service . Considering the need to deploy the Operations warehouse local server more than once, and some of the large warehouse operating machines are currently up to 800-1000, we have made the following technology selection:

Golang implement agent, data collection and processing service, WEBAPP;
Using SQLite as a database to store all data reported by the agent;
Using NSQ as the asynchronous message queue middleware;

Choose Golang reason is: can be statically compiled, deployment is simple, just the mutation of the executable binary program to throw to the server to run up on it.

The reason for choosing SQLite is that you don't have to install the server program like MySQL without the need for additional deployment maintenance. Of course, SQLite file lock will greatly affect the database read and write performance, we can split the database as far as possible, the different metrics data stored in different SQLite db files, and even each job machine each indicator of the daily data stored in a different db file, To minimize the performance impact of file locks, it seems to be a good effect at the moment.

The reasons for choosing NSQ are: Golang implementation, distributed, good scalability, high performance, support HTTP/TCP protocol, self-brought web management interface and so on.

The detailed system structure diagram is as follows:

NSQ support multiple topic (different topic data), topic can have multiple channel (the same topic all channel data in the same way, multicast-based implementation, Each channel has a corresponding processing process in the client to process the data in the channel. We have different monitoring indicators of the operating machine as different topic incoming NSQ, most of the indicator data only need to be persisted to the database for later use, so these topic only need one channel.

WebApp is based on the Beego framework to avoid repetitive wheel-building and low workload. The data display in WebApp is implemented by Highcharts and Raphael, and is good compatibility.

For machine metrics data, it should not be stored using a relational database, because this data is characterized by: read-only, time-series, almost no-relational read operations, continuous bulk data read, so open-source monitoring systems such as cacti, Ganglia use RRDtool to read and write indicator data. So as mentioned above, we split the storage of the indicator data into multiple files as much as possible to improve read and write performance without causing other problems.

The workflow of the system is described as follows:

After the agent on the job machine starts, it sends a registration message to NSQ's register topic, NSQ Client changes the status of the job machine to "normal operation" according to the registration message in the Register data table;
Then, the agent regularly reports the monitoring data to the NSQ,NSQ client in the processing process of various data persisted data to the SQLite database file;
When the user access/central server invokes the API, WebApp reads the SQLite database;
There is a goroutine for the registered job machine regularly detect whether or not received its reported heartbeat data within 3 minutes, if not received, the machine status from "normal operation" to "Run Exception", if received, then "Run exception" changed to "normal operation";
The operating machine will send a normal shutdown message to NSQ's register topic when the client reads the message, and the machine will change its status in the Register data table to "gracefully shut down".

At present, the system works well. Then the system will be measured, if there is a bottleneck, it may be data storage, so we may try to rrdtool or influxdb.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More