Basic configuration and tuning policies for high-availability and high-performance computer systems

Source: Internet
Author: User

Basic configuration and tuning policies for high-availability and high-performance computer systems

Qingdao railway station Computing Center Yu Dong Liu Mei

Computer users need such a computing environment: Reliability
The performance must be sufficient to handle various loads in complex environments. Users place various applications in computer systems, and the application software activates preset program segments based on various commands sent from the man-machine conversation interface. The entire
System coordination, the shortest down time, and persistent performance are an appropriate system performance. In addition, the computer system must be able to digest the increased load without degrading the performance.
.

This article describes how to configure a system and includes the following information:

  • General terms related to performance and availability
  • How to achieve high performance and high availability
  • How to plan your configuration
  • Configuration and tuning suggestions
  • Procedure for configuring and tuning the system

This article uses Digital UNIX as the basic environment, including system monitoring, performance exception identification, CPU optimization, and virtual storage/network subsystem tuning details. In different system environments, there are different system tools and methods. It is a universal note and will not be further discussed.

Terms and concepts

1.1.1 System Configuration: joint configuration of software and hardware for an independent system or cluster system. Example: CPU, memory board, Operation
The image disk is the configuration object. The host system can be a single CPU or multiple CPU system. multiple CPU systems allow multiple processors to share the physical memory.
The CPU executes the same version of the operating system, accesses the common memory, and executes commands in parallel. in some environments, such as LDB (large_database), the multi-processor system and large memory are required.
The storage system handles the load. vlm (very_large memeory) uses a 64-bit architecture with multiple CPUs and at least 2 GB of memory; vldb (very_large
Database) the system is a vlm system and also requires a large and complex storage configuration. The following is a typical vlm/vldb Configuration:

  • SMP with more than two high-speed CPUs;
  • More than 4 GB physical memory;
  • Multiple high-performance host bus adapters );
  • High-performance and high-availability RAID storage Configuration.

Virtual Memory: various algorithms are used to control the memory allocation of each process. Generally, some physical memory and disk swap space are the objects allocated by virtual memory. page: The minimum physical memory unit that can be provisioned by the system. The size is 8 KB.

Virtual Memory operations: including paging and swapping)

Paging: The reclaim page is reused. Swapping: writes a modified (dirty) page of a susponded process back to the disk's swap space, which can release a large amount of memory.

After configuring a system, you may need to adjust system parameters to improve system performance. Core variable (Kernel
Variables) affects the performance and performance of system cores, virtual storage I/O subsystems, and application software. Changing the Kernel Parameter during system operation can temporarily change the system kernel and change the attribute value.
(Values of attributes) can permanently change the kernel performance.

If the call fails to achieve the expected results, you must reconfigure the system (reconfigure): Increase the CPU, memory, or change the storage configuration or modify the application software? This requires you to determine the overall bottleneck.

1.1.2 system performance

System performance depends on efficient use of system resources: hardware resources, software resources, CPUs, memory, network components, and disk storage.

First, the system must perform well under normal load. Second, when the system load changes over time, such

When adding new users or running additional applications, you must reconfigure the system to cope with the increased system load.

Scalability means that the system can increase the expected performance by adding resources, or the system can use the existing resources to digest the increased load without adding resources, without significant performance degradation.

The so-called "bottleneck" refers to a part of the system configuration, which becomes the main point of poor information in the system. Potential bottlenecks include: virtual memory subsystem and I/O bus. The bottleneck occurs when the load requires more resources than the system can provide. The system capability refers to the maximum theoretical throughput (through-put) of the resource ).

Performance is often described by bandwidth and throughput (through-put. Bandwidth refers to the I/O subsystem.
Data byte Transfer Rate (Transfer
Rate ). Bandwidth is particularly important for applications that execute massive ordered data transmission. Throughput refers to the speed at which I/O operations are performed by the I/O subsystem.
Important.

Performance can also be measured by period. A period refers to the total time of completing a specified operation. A period is also called a delay (Delay). High performance requires a low cycle. The I/O period is measured in microseconds (milliscond, the memory cycle is in nanoseconds. The memory cycle depends on the memory configuration.

1.1.3 disk Performance

The disk performance is described by the disk access time.

The disk storage time is determined by the disk search time and rotational latency.

Seek time: the time when the head rotates to a specified disk track.

Rotational latency: the total time for the disk to rotate to the specified sector.

UBC unified Buffer
Cache) affects disk I/O performance. UBC allocates some physical memory to cache recent frequent access (most-recently)
Accessed) file system data. As a functional layer between the operating system and the storage subsystem, UBC can reduce the number of disk operations.

Disk I/O performance also depends on the features of application software I/O operations: data transfer volume, data flow direction: from disk read and write to disk.

Data transmission has different access modes. The sequential access mode refers to the continuous block (contiguous blocks) or adjacent block (adjacent blocks) of the disk ). The random access pattern (random access pattern) is a block that is not consecutive in the read/write disk.

In addition, data transmission includes file system data and raw I/O (raw I/O). Raw I/O refers to
I/O, raw I/O bypass (bypass) buffer and cache, can provide better performance than the file system I/O. Raw
I/O is often used for operating system I/O and database application software I/O.

Disk I/O performance is also affected by RAID technology. Raid can provide better disk I/O performance and better data reliability.
Digital UNIX supports raid through LSM (logical Storage Manager) software; digital
UNIX also supports hardware-based raid products and provides raid functions by using intelligent control cards and caches.

There are four levels of basic raid:

Raid0: Also known as Disk Striping. raid0 divides data into blocks and distributes the blocks to multiple disks that constitute a disk group, disk I/O load distribution across disks and dashboards improves disk I/O performance.

Raid1: A disk image that maintains the same data backup on different disks of the same disk group. Data copies retained on different disks can improve disk performance and improve data security.

Raid3: A raid check. Data blocks divided by raid3 are also distributed on different disks. It also provides parallel data access and increased bandwidth. Raid3 places redundant verification information on an independent disk. When a disk error occurs, use this verification information to regenerate data. Therefore, raid3 provides higher data security and availability.

RAID 5 is a raid verification method. RAID 5 can independently access data and process synchronous I/O operations, improving throughput.

1.1.4 High Availability

High Availability refers to the ability of the system resources to cope with hardware and software failures. Resources (such as system or disk data) can be highly available through redundancy or multiple backups.

For example, a disk image can make disk data highly available if the original disk fails. In addition, the backup on the disk can be used. If raid is used, redundant data is stored in the verification information. If the data fails, the damaged data can be regenerated.

In addition, redundant network connections make the network more available, and the network availability depends on application software, network configuration, and network protocols.

To make the entire machine system have higher availability, you must set up a cluster. Cluster is a loosely paired server group, and each server is considered as a cluster member. Applications can run on any Member Server. Large applications can run simultaneously on multiple member servers. High-performance internal interconnection between cluster members supports fast and reliable data communication between members.

Cluster application Failover (Failover failure) Technology ensures high availability. When a member system fails, all the application software configured for the cluster running on the system will be transferred to another Member Server to continue running.

To completely prevent configuration failure, each failure point must be eliminated:

  • Use clusters to prevent system failures.
  • Two sets of Network Connections prevent network failures.
  • Disk Images distributed on different bus to prevent disk bus or adapter failure.

For higher availability requirements, multi-layer redundancy can be used to prevent multi-layer failures.

Availability is also measured by resource reliability. Reliability refers to
The average running time of a widget between two failures and data loss. It can be mtdl (mean time to data loss), MTTF (mean
Time to first failure), MTBF (mean time between failures.

Refined High Availability

A high-availability resource can withstand specific hardware and software failures, which are transparent and imperceptible to users. This can be achieved through resource redundancy and Failover mechanisms.

System Configuration depends on the importance of your system and the characteristics of software and hardware. An environment that does not easily fail or can tolerate a down machine only requires data-level high availability, while a picky System

There must be no single failure point, and redundant resources must be available for backup in every important resource.

Figure 1 shows a configuration with multiple potential failures (including system, network, disk, and bus:

1.2.1 elimination failure points:

  • System failed:

    To avoid the failure of a single system, you must configure a cluster with at least two members. If one fails, the application can still run on another member system. However, the two member clusters are not configured with high availability. If one member system crashes, another member system becomes a potential failure point. Therefore, a high availability system must be configured with at least three member systems.

  • Disk failed:

    Prevents disk faults and performs image disks or checks raid.

  • Host bus adapter or bus failure:

    Image Data is distributed across different bus.

  • Network Connection Failed:

    Network connection failure is caused by network interfaces or network problems. You can install multiple network interfaces on a system and ensure that the application software supports this function.

  • Power supply failed:

    System and storage units are prone to power failures. The redundant power supply is used in the host system and storage unit respectively, and the ups is used in the host room.

  • Cluster Interconnection failed:

Two Cluster Interconnection Communication channels.

Fully redundant cluster configuration: see Figure 2.

1.2.2 methods and costs to improve system availability:

Method:

Set a cluster with at least two members.

Use the latest version of hardware, firmware, and operating system

Configure multiple network connections

Configure redundant Cluster Interconnection

Use backup UPS

Cost:

Additional hardware and software expenditures and increased management complexity

Possible downtime during upgrade.

Additional hardware costs and sufficient I/O expansion slots

Additional hardware fee and PCI slot

UPS hardware fee

1.2.3 methods and costs for improving data availability

Method:

Image Disk

Verify raid

Image Data between disks on different bus

Set cluster

Use backup UPS

Cost:

Additional disk fees and write performance reduction

Additional software and hardware costs, increased management complexity and Write Performance fluctuations

Additional I/O total cable slot and hardware fees

Additional hardware and software expenditures increase management complexity.

UPS fee

1.2.4 impact of high availability on system performance

Solution

Images

Cross-bus Image

Verify raid

Redundant Network Connection

Cluster

Impact

Improves disk read performance but reduces write performance.

Prevents a single bus from becoming an I/O bottleneck

When all Raid set disks are available, the disk I/O performance is improved. When a failed inventory is available, the performance is reduced.

Improves network performance and increases Client Access

By distributing loads on different systems, more CPU and memory resources are provided for application software and users, thus improving the performance of the entire system.

1.3 refined High Performance

The system must have reliable performance to meet the needs of users and applications. The system configuration must first meet the rapid response to the current load, and also consider reserving Performance performance space for increased load.

An extensible system environment allows users to add additional hardware to improve performance or reduce load increases.

System performance depends on the interaction between hardware and software configurations and loads. A good system must fully and efficiently use CPU, memory, and I/O resources. Once a resource reaches the capacity limit, it may reduce performance and become a bottleneck. Bottlenecks are interrelated. For example, invalid memory may cause excessive page adjustment and swap, which may cause disk I/O bottlenecks.

To configure a system that meets the performance requirements, you must determine that the failure of such resources in the system may have the greatest impact on the system performance. If the application
The software is CPU-intensive and must be configured with multiple CPUs and wide memory bandwidth. If the application is memory-consuming, the configuration of memory capacity must be considered, otherwise, the lack of memory will reduce the performance of the entire system;
If a large number of disk I/O operations exist in the application software, the disk and bus bottlenecks must be prioritized When configuring the system. If your system is an internet server, make sure that the system can be centrally processed.
A large number of network requests. In addition, if you want to configure the system as high availability + high performance, you must also consider the negative impact of high availability configurations on high performance.

The possible causes of performance problems are:

Improper system kernel variable value.

Modify the system variable value based on your configuration and load changes to achieve optimized performance.

  • No configuration tailored Based on the Load

    If the system tuning fails to improve the performance, the system configuration may not be suitable for the load characteristics. The configured resources may not meet the load requirements. For example, you can increase CPU or memory resources, upgrade the primary performance hardware, or increase the hard disk.

  • Disk fragmentation

    The distribution of file data on disks is discontinuous, which may reduce the read/write performance.

  • Poor software or unoptimized Application Software

    If you prove that the application is a cause of performance problems, you must rewrite or optimize the software.

1.4 planning and Configuration

To plan your Digital UNIX configuration, follow these steps:

  1. Understand your load and application software features
  2. Determine your availability and performance requirements
  3. Select hardware and software configurations that meet both availability requirements

Understanding load characteristics:

1. application features:

Description Application Object

Description Performance Requirements

Is the application a CPU-intensive code?

What is the memory size required by applications?

Disk storage requirements of applications

Does the application require broadband width or high throughput?

Does an application transmit large amounts of small pieces of data?

What is the average data transmission volume?

What is the read ratio during data transmission?

What is the write ratio in data transmission?

Are there many network operations in the application?

System availability requirements

Data availability requirements

Network availability requirements

2. User feature description: data processing or code compilation?

User Type

Number of users

User target

User-executed tasks

List of applications run by the user

Data storage requirements

1.5 basic configuration and tuning recommendations:

The Digital UNIX utility sys_check Collects performance-related information. And

This information is output in a readable format. SYS-CHECK tools help users check configurations and check variable configurations
And provide warnings and tuning suggestions when necessary.

The SYS-CHECK can be obtained at the following address: ftp: // ftp. Digital.

COM/pub/DEC/IAS/sys_check

Basic tuning suggestions are listed below:

  • Recommended operating system and Kernel

    1. Are you sure you are using the final OS version (decevent tool)
    2. Determine the latest firmware (decevent tool) when using the adapter. Controller. Disk)
    3. Determine that important applications have a high priority (use the nice command and the class scheduler tool to define the CPU priority)
    4. Use the kernel variable that applies this configuration to modify
  • Memory recommendations:
    1. Make sure there is enough memory
    2. Make sure there is sufficient swap space, and the swap space is distributed across different disks and total lines.
    3. Increase available process address space
    4. Add available system resources for processes
    5. Reduce application memory requirements
    6. If the system only has a small amount of disk I/O, reduce the amount of memory allocated to UBC (uniied buffer cache ).
    7. Modify the swap speed
    8. Change the pre-write speed of the dirty page.
  • Disk and I/O suggestions
    1. Use high-performance hardware
    2. Distributed disk I/O and file system across disks and bus
    3. Sort File System fragments
    4. Disk I/O-intensive applications, increase the memory allocated to UBC
    5. Increase the maximum number of files simultaneously opened
    6. Increase the nameI cache size
  • Advfs recommendations
    1. Use multi-volume file Domains
    2. Add memory allocated to advfs buffer cache
    3. Threshold)
    4. Reduce the size of read ahead
  • Ufs recommendations
    1. Modify file system Fragment Size
    2. Increase the buffer cache size of metadata (metadata)
  • Network recommendations:
    1. Increase the hash table size. kernel uses the hash table to query TCP/IP control blocks.
    2. Increase the upper limit of the connection of the local (partial) TCP packet listening Queue (socket listen queue)
    3. Increase the maximum number of dynamically allocated ports
  • NFS recommendations
    1. Make sure there are enough nfsd daemons running on the server
    2. Make sure that the client runs enough nfsi OD daemons

1.6 procedure for configuring and tuning the system

1. Configuration System:

Make sure that the environment requirements are met and select the configuration solution that meets the requirements. Then configure the hardware, operating system, and hierarchical products and application software.

2. complete all the recommended initial tuning:

Some configurations must be tuned immediately after the configuration is complete. For example, if the system is used as an Internet server, the default system parameters and properties must be modified immediately after the system is configured.

3. Monitor System Performance:

If the system performance is acceptable, continue to observe the system performance in the future. Resource usage changes at any time. When a resource reaches its capacity, the performance will be reduced. Or a significant change in the environment may also affect performance. If the system performance is unacceptable, determine the source of the problem.

4. analyze the causes of performance problems:

Use tools provided by Digital UNIX to locate Problems

5. Determine the tuning scheme to eliminate performance problems:

If all possible tuning schemes are used up and performance cannot be improved, you must reconfigure the system.

6. Eliminate performance issues:

First, try a simple, no-charge solution, such as running some applications during off-peak hours or controlling disk access.

Try a complex and expensive solution, such as tuning or adding hardware.

If you are sure that the CPU and application software have been optimized, you should focus on tuning the virtual memory subsystem to get the best performance. If the problem still cannot be eliminated, re-tune the I/O sub-system, tuning generally needs to modify the kernel attribute, you can also execute the corresponding system management tool software, such as organizing File System fragments or modifying the stripe width to tune the system.

7. Monitor System Performance:

After tuning, you must carefully observe the system performance to determine whether the tuning results are as scheduled. Figure 3 describes the configuration and tuning process.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.