Cluster Management PBS

Source: Internet
Author: User
The most important task of managing a cluster computer is not to start it like a PC. It is so easy to shut down the computer, and you do not need to organize disks. I believe that the most important thing for managing cluster computers is the division of resources, task allocation, node performance monitoring and load. The resources of each node (including memory, disk, processor, and network) are combined through network message transmission. This is a bit like the operating system uses the processor, memory, disk, the combination of cartoon and Data Bus becomes a single computer in our concept. Another important task is to monitor node performance, which is similar to the Operating System Monitoring Hardware on a PC. Here we will briefly introduce how to use the management tool C3.

1.Cluster command line tool C3

We have introduced the C3 tool in detail when installing Oscar. C3 mainly contains the following command lines:

• Cexec: a Linux Command tool that can run any Linux Command on all nodes.

• Cget: obtains files from any node in the cluster.

• Ckill: Kill a specific user process on a specified Node

• Cpush: distribute files and directories to clusters.

• Cpushimage: Use the systemimage tool to update images on all nodes.

• CRM: delete files or directories from all nodes

• Cshutdown: close or restart all nodes

• Cnum: return the value of the node range, depending on the basic name of the node.

• Cname: return the node name, with a range of numbers

• Clist: returns information about all nodes in the configuration file.

CexecThe command allows all nodes to run a command in parallel, just like operating a machine. At the same time, a serial cexec command is also used for command execution and debugging. To avoid misuse, its name is cesecs rather than cexec.

Some node information is defined in the/etc/c3.conf file of the node. For example, the configuration of a 64-node cluster is as follows:

Cluster Cartman {

Carman-head: node0 # head node

Node [1-64] # compute nodes

}

The first line is the cluster name Cartman, which is included in the cluster configuration {} later. The second line shows that the cluster's server nodes are node0 machines, followed by all client nodes. Another example is the configuration file:

Cluster Kenny {

Node0 # head node

Dead placeholder # Change command line to 1 Indexing

Node [1-32] # first set of nodes

Exclude 30 # offline nodes in the range

Exclude [5-10]

Node100 # Single Node Definition

Dead node101 # offline Node

Dead node102

Node103

}

There are two offline statuses: exclude and dead. Exclude specifies the offline machine and dead specifies the invalid machine. If there is a special range written as: 1-, 11 represents 1, 2, 3, 4, 5, 9, 11. You can specify the machine range when running the command line, or use it by default. For example, you can run the LS-l command on all nodes.

$ Cexec ls-L

For example, if you run LS-L on 1-5 nodes, the command is:

$ Cexec 1-5 LS-L

Other node ranges can also be specified. For example, the commands in the range of 1-5, 9, and 11 show more 9 and 11 nodes than the preceding commands.

$ Cexec 1-5, 9, 11 ls-L

You can also specify the cluster name and run the following command on all nodes of the cluster:

$ Cexec Carman: ls-l

The format is

$ Cexec Carman: 1-5 LS-L

C3You can run most of the commands in Linux. It uses SSH to connect to the node and runs the commands on it. With C3, we can easily control our cluster, without requiring each machine to telnet and then run our commands.

Use PBS to schedule your job

After building your Oscar cluster, it is important to schedule your jobs in addition to management and monitoring. A large cluster system may have thousands of nodes and run multiple jobs at the same time to complete different tasks. without proper planning, tasks may compete for cluster resources, finally, resources cannot be fully utilized on time. On the cluster we installed, a tool named PBS is used to schedule cluster tasks.

PBS (portable Batch System)It is a flexible batch processing system developed by NASA. It is used in Cluster Systems, supercomputers, and large-scale parallel systems. In addition to scheduling jobs, PBS can also manage cluster resources. If you give a brief summary of it, PBS has the following features:

LEase of use: provides unified interfaces for all resources and is easy to configure to meet the needs of different systems. Flexible job schedulers allow different systems to adopt their own scheduling policies.

LPortability: complies with POSIX 1003.2 standards and can be used in shell, batch processing, and other environments.

LAdaptability: it can adapt to various management policies and provide scalable authentication and security models. Supports Dynamic Distribution of loads on the Wide Area Network and virtual organizations built on multiple physical entities in different locations.

LFlexibility: supports interaction and batch processing jobs.

PBSIn fact, it can be used flexibly to control tasks on a single node or a large cluster, or load balancing between multiple systems, you can also run parallel or serial tasks on computers with multiple nodes. These can also be mixed in practical applications.

PBSIt consists of three parts:

L PBS server: it runs on the server node of the cluster to control transactions and start Run the task.

L Maui Scheduler: Maui scheduling Program generates a priority list based on the resource manager to obtain the resource status of each node and the system job information.

LMomBackground process: there is a mom background process on each node, which truly starts and stops tasks on each node.

PBSThe main components are located in the {$ oscar_home}/sbin directory of each node.

Pbs_serverIs the PBS server, pbs_mom is the mom background process, and pbs_sched is the scheduler. Generally, after Linux is started, the pbs_server server and the pbs_mom background wait process will get up. You can use the ps command to view:

# Ps-Ef | grep PBS _

Pbs_serverFor more information, see man pbs_server. After reading this document carefully, you will find that it has many configuration options, such as specifying the default node, that is, if PBS cannot find the node you specified or no node is specified, use the default_node parameter. You can use the PBS tool qmgr to change the server settings:

 

Qmgr:S default_node = big(S indicates set server) specifies the default_node value.

PBS is mainly used to determine your job. For details about its commands, refer to man help in Linux, if you want to view PBS information about resource restrictions, you can view the pbs_resources help. If you want to view the PBS server configuration, you can view the pbs_server_attribute help. All the commands can be found in {$ oscar_home}/PBS/bin. The following lists several common PBS commands and their functions:

Qsub: This command is to submit your task to PBS.

Qdel: This command is used to delete pbsd jobs.

Qstat [-N]: Displays the current job status and associated nodes.

Pbsnodes [-A]: Displays the node status.

Pbsdsh: Distributed processing console.

Here is a simple example to give readers a bit of perceptual knowledge. The following job is to run the my_script.sh script in Y VPNs on node X, note that you must have a script to run the task to submit the task.

$ Qsub-N my_jobname-e my_stderr.txt-O my_stdout.txt-Q workq-l

Nodes = x: PPN = Y: All, walltime = 1:00:00 my_script.sh

-NOption to specify the task name,-e to specify the standard error output file,-O to specify the standard output file,-Q is the queue name,-l is followed by the specified resource limit, here the node is specified, number of VPNs, maximum running time. Node locating requires a detailed introduction. The equal signs of the nodes parameter are defined by nodes connected by the "+" sign. The attributes of each definition are separated. Generally, it is node number: node name (Omitted): PPN = (number of VPNs): all (or resource ). For example, 2: Red: PPN = 2 indicates that on two machines with the node name red, each machine starts two VPNs to run the task. VP is a virtual processor. By default, a CPU starts a VP, but you can start multiple VPNs on a single CPU. PBS is processed by VP rather than the actual hardware CPU.

After the preceding command submits a task, we can see the script my_script.sh:

#! /Bin/sh

Echo launchnode is'Hostname'

Pbsdsh/path/to/my_executable

The first line prints the machine name in the output file to let the administrator know which computer is running the script. The program is thrown to run.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.