I am currently a very important part of the work is to participate in or responsible for a number of clusters in the Department of Maintenance, application development and optimization, including: HBase cluster, Storm cluster, Hadoop cluster, Super Mario cluster (real-time streaming system developed within the department), etc., with the expansion of business, The number of cluster machines has been small and large.
Next is me to myself this 1 years, in the cluster application and the operation dimension aspect to do the matter combing and the summary. The following is a bit fragmented, and you can read it as a technical article that is not strictly relevant.
1 The installation, deployment process to be as automated as possible.
The clustering of the steps of the script, you can deploy multiple nodes, fast online/offline a node. Clusters of nodes, or constantly have nodes on the offline, can save a lot of time.
2) Build and make full use of the monitoring system of the cluster.
First of all, the most important is the cluster with its own monitoring system. For example, HBase master, Region Server Monitor page, Hadoop jobtracker/tasktracker, namenode/datanode monitoring page, Storm UI monitor page, and so on. This kind of monitoring focus on the cluster of operations, resources and so on, and contains a full range of information, including the operation of the abnormal log, which for the investigation, positioning problem is very timely and effective.
Secondly, since it is a cluster, it is necessary to have a unified monitoring address to collect and display the working state of each node, the cluster can neither too idle, nor load too high. Therefore, we need to monitor the CPU, memory, disk and network of each node in the cluster. Ganglia is a very good tool, its installation and configuration process is simple, the collection of indicators rich, and support customization, such as Hadoop, HBase have extended the ganglia.
3 to add the necessary operational dimension scripts for the nodes within the cluster.
Delete expired, useless log files, or the disk is full will cause nodes do not work or even failure, such as the Storm cluster Supervisor process log, Nimbus process log, the various process logs of the Hadoop cluster.
Add a boot-up script to the daemon on the cluster to avoid manual intervention after the downtime restart as much as possible. For example, CDH has added a startup script for Hadoop, Hive, hbase, etc., and the RPM installation process can start after the machine restarts.
At the same time, monitor the existence of the daemon on the cluster and reboot directly if it does not exist. This approach only applies to stateless processes, such as Storm Nimbus, supervisor processes, zookeeper processes, and so on, should be coupled with such a monitoring script to ensure that the service process is terminated and can be restarted as soon as possible. For example, by setting the Crontab check every minute.
4) According to the business characteristics of the application layer to add monitoring and alarm.
For business layer computing tasks, can monitor the daily output data size and time, if there are anomalies (such as the size of the data file, the results of output delay, etc.) to the alarm.
For real-time computing applications, the most important thing is whether the data processing is significantly delayed (minutes delay, second-level delay, etc.), based on this, can define a series of rules, triggering different levels of alarm, so that the first time to find and solve the problem.
5 enable multiple users to share the computing and storage resources of the cluster.
Using cluster quota restricts resource quotas for different users, such as Hadoop, but Storm and hbase do not currently find any way to limit them.
The resources of the cluster are restricted and isolated by means of multiuser queues. For example, Hadoop in order to solve the situation of multiuser contention computing resources, the use of capacity scheduler or Fair scheduler to queue the jobs submitted by different users, can directly deploy the application, can also be customized according to business requirements after the use, very convenient.
For storm cluster, its computing resources are divided according to slots, so we can consider adding a layer of resource control module on storm, recording the maximum number of slots per user, the number of slots currently occupied, etc. To achieve the user's resource quotas (although the current storm whether from the cluster size or internal users to see, are not much, this demand is not particularly urgent).
In addition, the access control rights of the cluster are very necessary for different users. For example, whether you can submit a job, delete a job, view the cluster of resources, etc., which is to ensure the safety of the cluster operation of a basic guarantee.
6 Real-time computing applications should find ways to cope with peak flow pressure.
Real pressure: For example, in order to cope with the 11-day flow pressure, simulate peacetime 3~5 times the flow of pressure measurement, early detection to solve problems, to ensure system stability.
Operation Dimension switch: By adding the operation dimension switch, avoid the impact of the peak time on the system, for example, through zookeeper to real-time computing applications plus switches, online adjustment processing speed, allow a certain amount of time delay, the flow of smooth processing.
Fault-tolerant mechanism: The real-time calculation of the scene with the change in traffic, may encounter a variety of unexpected situations, to do system design and implementation must fully consider all kinds of possible error situations (such as data delay, lost data, dirty data, network disconnect, etc.).
Stability and accuracy of the compromise: it is not recommended in real-time calculation too much to pursue the accuracy of the results, in order to ensure the stable operation of the system, you can sacrifice a certain accuracy, to ensure that applications can "live" more important.
7 Many ways to track, locate, solve problems in the cluster.
With the help of the cluster monitoring system, locate the specific machine where the problem lies. Log on to the problem machine, you can also use the top, free, SAR, Iostat, Nmon and other common commands to further view, confirm system resource usage, problems.
Also, check the logs on the cluster (including cluster level, business level) to see if there is an exception log and the corresponding cause.
In addition, the work process can be tracked through Strace, JVM tools, etc. to find the cause from the problem site.
8 The task of cluster operation of some tuning ideas.
Comprehensive consideration of system resource load: Combined with cluster monitoring, the operation of task instance (CPU, memory, disk, network) on each node, and then optimize after locating system bottleneck, make maximum use of system resources of each node, especially CPU and memory.
Task Instance parallelization: can be parallel to the direct use of multiple shard, multiple processes/multithreading, complex tasks can be considered first to disassemble, and then parallelization.
Different types of tasks: CPU-intensive consider using multi-core to run the CPU as full as possible; memory-intensive consider choosing the right data structure, compressing the data in memory (the choice of compression algorithm), persisting data, etc.
Cached cache: Select the frequent use, the access time cost of the link to make cache, reduce network/disk access overhead through cache, reasonable control cache size, avoid cache caused by performance bumps, and so on.
Author: great Circle those things
URL: http://www.cnblogs.com/panfeng412/archive/2013/06/27/cluster-use-and-maintain-experience-summary.html
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/Servers/zs/