Maintenance and Performance Optimization of application systems in the AIX operating system

Source: Internet
Author: User
Tags rewind

Application maintenance is a meticulous task. In addition to requiring the technical staff to be rigorous and responsible, technical personnel are also required to have high processing capabilities and work experience in various emergencies. With the establishment of ICBC's two big data centers, ICBC's core business data is concentrated in the data center, the primary branches focus on the maintenance of peripheral systems, such as the business system, the Integrated front-end system, and the historical data query system. The AIX operating system is widely used by ICBC. In terms of application maintenance, ICBC uses the AIX operating system for five application systems: Integrated front-end system, cross-line payment system, customer reconciliation system, historical data query system, and International Business settlement system. Over the past few years, we have accumulated some experience in maintaining and optimizing application systems in the AIX operating system environment, which is summarized into five aspects for reference by our peers.

 

  
  I. Data Security Measures of the AIX System
Data security should be taken into account during the hardware configuration of IBM minicomputers. The configuration of system resources must meet the fault tolerance processing requirements. The following points should be considered:

① After a hardware fault occurs on the Production host, the slave can automatically take over the application system immediately;
② Improve hardware redundancy of the operating system and minimize the impact of spof on the system;
③ Strengthen system backup to reduce the impact of System Version Upgrade.
In response to the above requirements, some security measures can be taken, mainly including redundant backup of hardware resources, reasonable distribution of system software and application software, and the use of highly reliable cluster software. Our experience is: Put the AIX operating system and hacmp (high
Availability cluster
Multi-processing) software is installed on rootvg. Due to rootvg corruption, the system will not be able to run. Even if it is restored through the backup tape, the system will be shut down. Therefore, when the disk space is sufficient, you can consider mirroring rootvg. The specific method is to make the two built-in hard disks of the host as the rootvg image, which can improve the system security and prevent the impact of single hard disk damage on the system, even if the built-in hard disk breaks down one, the system is still running normally. When creating a rootvg image, try to use hard disks connected to different SCSI to achieve load balancing. In addition, to improve the system's fault tolerance capability, you can configure the hard disk image (raid0) or RAID 5 redundancy on the disk array, configure it as datavg, and install databases and applications on it. To improve node reliability, the hacmp cluster environment can be set up to achieve dual-machine hot backup. That is, the hacmp parameters can be configured on two backup-slave hosts to meet the system hot backup requirements. In daily operations, you must back up the system and regularly back up two or more media on the production machine.

1. rootvg Configuration
Make hdisk0 and hdisk1 into mirror and configure it as rootvg.
① Add hdisk0 and hdisk1 to rootvg: Smitty
Extentvg hdisk1 and hdisk0 → rootvg.
② Make mirror: mirrorvg-C 2
Rootvg.
③ Create a boot image on hdisk0 and hdisk1: bosboot-ad hdisk0 and bosboot-ad
Hdisk1.
④ Change the sequence of startup devices: bootlist-M normal hdisk0, hdisk1
Cd0

2. Working Principle of hacmp
Hacmp manages cluster resources. Depending on the complexity and configuration of applications, the cluster resource takes over time ranges from 30 seconds to 300 seconds, no manual intervention is required. Resources in a cluster generally include applications, hard disks, Volume groups (VG), file systems, NFS file systems, and IP addresses. Resources belong to three types of resource groups
GROUP: cascading, rotating, and concurrent ). Different resource groups correspond to different takeover methods. A group can have several resource groups, which can be of different types. Therefore, the resource management methods can be varied and the configuration is flexible.
We generally adopt a layered, hot backup mode. The working principle is that nodea and NodeB are both members of resource group, resource Group A is set to a hierarchical mode, with nodea having the highest priority. Therefore, when the group is active, nodea controls all resources in resource group A. In this case, NodeB is idle.
When it is down, NodeB will take over resource group A. Once nodea joins the cluster again, NodeB will release Resource Group? Nodea gets control again. If NodeB fails, it will not cause any impact.

  
  Ii. AIX System Maintenance Experience

1. Fault Information Collection

Collecting fault information is very important for determining and diagnosing the cause of the fault and repairing the system. Are we checking system error reports? Errorlog? Error reports sent to root users? Mail? And check hacmp. Out? Smit. log? Boot. log and other content to determine whether the system is faulty, and handle the fault according to the fault information.

(1) system fault records
Errdemon
The process runs automatically when the system starts. The records include hardware, software, and other operation information. The fault record file is/var/adm/RAS/errlog, it can be backed up or copied to another machine for analysis, using errpt
Command (common user permissions can also be used ).
# Errpt | more: List brief error messages
# Errpt-D H
List all hardware error messages
# Errpt-d s list all software error messages
# Errpt-AJ error_id
List detailed error information
(2) led code on the control panel
(3) SMS (System Management
Service) fault records
When the console displays the keyboard icon (when the LED shows "e1f1"), press 1. Select "utilities" and select "error"
", Copy the eight-bit fault code.
(4)
Mail check
After a system failure occurs, the system sends a mail to the root user to report the error message. Normally, the system does not check and fix the fault, and the system regularly reminds the root user by mail.

(5) run the fault diagnosis program
Checks and diagnoses system hardware. When a hardware fault is found, diag should be used immediately.
# Diag
>
Select advanced diagnostic)
> Select problem determination or system check
Verification)
After running DIAG, The SRN code, the name and percentage of the faulty device, and the address code are provided.

(6) Other commands used to collect system information
Lsdev-C system device information
Lspv viewing physical volume information
Lsvg
View volume group information
Lslpp viewing file group information
Lsattr view device parameter settings
Lscfg view VPD (virtual
Product Data) Information

2. system hardware fault locating
IBM
The troubleshooting methods of Minicomputer hardware include checkpoints information and error information on the display panel of Minicomputer cabinets.
Code and srns. Checkpoints checkpoint is a system powered CMOS initialization program (IPL, initial program
Load) after running, a series of information is displayed on the display panel of the Cabinet.

The IPL process is as follows: after the system powers up, it automatically enters the IPL process. The IPL process consists of four steps,
① Service
Processor initialization starts when the system powers up until OK is displayed on the display panel of the Cabinet. This step will display 8xxx or 9xxx
Checkpoints code.
② Hardware initialization guided by service processor starts with pressing the white power switch on the cabinet. This step will display 9xxx
Checkpoints. "91ff" is the final code, marking the beginning of step 1.
③ During system firmware initialization, a system processor takes over the control and continues to initialize system resources. This step will display exxx. "E105" is the final code, marking the start of Aix startup in step 4th. In this process, various location codes are displayed (the location codes represent each part of the system ).
④ Start Aix. When AIX starts, the code on the display panel is 0xxx, And the location code will appear in the second line. When the AIX logon window appears on the console, step 4th ends and no information is displayed on the display panel.

When an error is detected during system operation, the srns code (Service Request
Numbers, Service Request Code) is displayed on the display panel in the form of XXX-xxx, and is also registered in the errorlog of Aix; when the SSA cabinet fails, the corresponding srns will be displayed on the LCD screen on the front panel of the Cabinet, and the yellow display light will flash. The corresponding error information will be registered in the errorlog of Aix, and the code will be recorded after the problem occurs, and inform IBM of solutions.


3. Software issue handling
Software faults are complex. The following describes several common troubleshooting methods.

(1) insufficient space for the File System
Check whether the file system is "full. In particular, //,/var, And/tmp should not exceed 90%. If the file system is full, the system cannot work normally, especially the basic file system of Aix. Example/
? Root file system? If it is full, the user cannot log on. Use DF-K to view data.
# DF-K
View the basic file system of Aix
Except for the/usr file system, other file systems should not be too full, generally no more than 80%.
Solution 1: Delete junk files
#
Du-sk? | Sort-Rn | HEAD
Find the subdirectories that occupy the largest space in the current directory, layer by layer until you find the files that occupy the largest space (which directories are the file system's
Mount
Point, which is the subdirectory of the file system ). Delete files and release space. Sometimes the space is not immediately released after the file is deleted because the deleted file is being opened by a program. The space is released only after the program is stopped, and sometimes the system needs to be restarted.

Solution 2: Increase the file system size
# Smitty chjfs
As long as the volume group? VG? There is space available, and the file system can increase at any time.

(2) Check the integrity of the file system
# Umount filesystem_name
# Fsck-y
Filesystem_name
Note: The file system must first umount, and then check and repair, otherwise it will cause unknown consequences.

(3) view the volume group information
Check whether there is a logical volume in the "stale" status. If yes, run the syncvg command to fix the issue.

(4) Check memory swap zone usage
Check whether the usage rate exceeds 70%. If the usage rate exceeds, use CHPs-SX pgname to increase x pp or mkps-a-n.
-SX myvg adds a memory swap zone with x pp on myvg.

(5) Memory leakage of minicomputers
Memory leakage occurs in the minicomputer, that is, the system or application process cannot release the used memory, gradually reducing the available memory capacity. If the available memory is reduced to a minimum value, the system or application cannot Fork sub-processes, causing system paralysis. Generally, we can use PS and SAR commands to view the general memory and CPU usage of minicomputers and the development trend of memory and CPU usage of various processes. Use the ps command to view the basic information about memory and CPU usage. Find out the process in which the memory usage keeps increasing, and this process may have experienced memory leakage.


4. Management and Maintenance of the IBM hacmp dual-host Hot Standby System
(1) Startup of hacmp dual-host system

Log on to each node of the system as the root user and run the # Smit clstart command.
(2) disabling the hacmp dual-host system

Log on to each node of the system as the root user and run the # Smit clstop command.
(3) query the status of the hacmp dual-host system

During the operation of a dual-machine system, the operator only knows the current status of the dual-machine system, can recover the abnormal situation of the dual-machine system, to ensure the high availability and high fault tolerance of the dual-machine system. Query the status of the hacmp dual-host system and enter the node to be queried as the root user. Use #
The lssrc-G cluster command checks whether the hacmp dual-host software has been started on this node. The system displays three active information, indicating that the hacmp
The dual-host software has started properly.
Run #/usr/sbin/cluster/clstat on the command line to check whether the two-host software hacmp is started normally.
-Command A to view the current status of the dual-host system.

5. Network Fault Handling
(1) diagnostic process for network disconnection
Ifconfig
Check whether the network card is enabled (up); netstat-I check the network card status; ierrs/ipkts and oerrs/opkts check whether it is> 1%; ping the local network card address; Ping other machine addresses, if not, use diag on the machine to check whether the NIC is faulty.

In the same network, the subnet mask should be consistent.
(2) Basic Network Configuration Methods
① To modify the network address and host name, use chdev.
Command.
# Chdev-l inet0-a hostname = myhost
# Chdev-l en0-
Netaddr = '1970. 0.15.1 '-a netmask = 255.255.255.0'
② View the NIC status: # lsdev
-CC-if
③ Confirm the network address: # ifconfig en0
④ Enable NIC: # ifconfig en0 up

⑤ Configure the route. There are two ways to add the route:
A. Permanent Routing
# Chdev-l inet0-
Route = '2014. 1.15.2 ', '2014. 0.15.254'
B. Temporary Routing
# Route add 112.1.15.2
112.0.15.254
Run netstat-Rn to view the route table.

6. inspection process of IBM

The inspection process of IBM is helpful to the items that should be focused on in our daily maintenance. You can refer to it for reference.
(1) Check the system hardware status: whether the faulty light of the device is on.
(2) System Error Report.
(3) Check for any error reports sent to the root user.
(4) Check hacmp. Out, Smit. Log, Boot. log.
(5) The File Usage of key systems should not exceed 80%.
(6) Whether the logical volume is stale.
(7) Whether the memory swap zone utilization exceeds 70%.
(8) whether the memory swap area is 1.5 times the physical memory size.
(9) Check the backup status (whether there are system backups, user data backups, and whether the tape drive needs to be cleaned ).
(10) Check communication settings (NIC, IP address, route table, ping,/etc/hosts, DNS settings, and so on ).
(11) Is there any data protection method such as raid10/RAID5? Whether there is hot
Spare.
(12) whether the system dump settings are correct.
(13) check whether the system parameters are correct.
(14) Check whether rootvg is an image.
(15) Check errdemon? Whether srcmstr is running normally.
(16) Check the IDC environment (voltage and humidity ).
(17) Check System Performance: Is there a performance bottleneck? Topas? Vmstat ?.
(18) Check the patch (PTF) and microcode (whether upgrade is required ).
(19) conduct the hacmp test: Cluster
Verification.
(20) diagnose system hardware: run the diagnostic program (diagnostic ).

  Iii. Optimization of Aix System Parameters
The Aix kernel is a dynamic kernel, and the core parameters can be automatically adjusted. So after the system is installed? The modified parameters should generally include the following.

1. maxlogin

The specific size of maxlogin can be set based on the number of users.
Modify the chlicense command, which is recorded in the/etc/security/login. cfg file. The modification takes effect after the system restarts.

2. Limits parameters of the System user

These parameters are located in the/etc/security/limits file. You can set these parameters to-1? No limit. You can use VI to modify the/etc/security/limits file. All modifications take effect after the user logs on again.


3. Paging Space
Check Paging
The size of space. When the physical memory is less than 2 GB, it should be set to at least 1.5 times the physical memory. If the physical memory is greater than 2 GB, it can be adjusted as appropriate. When creating paging space at the same time?
The disk should be allocated to different hard disks as much as possible to improve its performance. Use Smitty CHPs to modify the size of the original paging space or use Smitty mkps to add a paging
Space.

4. system core parameter configuration
Use lsattr
-Elsys0: Check the size of parameters such as maxuproc, minpout, and maxpout. Is maxuproc the maximum number of processes per user? Generally, if the system runs oracle? Maxuproc should be adjusted, and default: 128 should be adjusted to 500. The increase in maxuproc will take effect immediately. When an application involves a large number of sequential reads and writes and affects the response time of the foreground program, you can set maxpout to 33? Set minpout to 16 and use Smitty
Set the chgsys command.

5. File System space settings

Generally, the usage of the file system/,/usr,/var, And/tmp should not exceed 80%. The recommended/tmp value is at least 300 mb. If the file system is full, the system cannot work normally, especially the basic file system of Aix, such /? Root file system? If it is full, the user cannot log on. DF
View: # DF-K (view the basic file system of Aix), and use Smitty chfs to expand the space of the file system.

6. Activate SSA fast-write
Cache
Use Smitty ssafastw to activate the fast-write cache of each hdiskn Logical Disk: After selecting the hard disk, enable
After fast-write is changed to yes, press Enter.

7. high water mark for pending write I/OS per
File? Maxpout? And low water mark for pending write I/OS per
File Configuration
The default value is 0. In a dual-host environment, set high water mark to 33? Low Water
Mark is set to 24. These two parameters can be set with Smitty chgsys.

8. data refresh frequency of syncd daemon

This value indicates the frequency of refreshing memory data to the hard disk. The default value is 60, which can be changed to 20 or changed based on the actual situation. This parameter is configured through/sbin/rc. boot to set nohup/usr/sbin/syncd
60>/dev/null 2> & 1 & change the value 60 to 20.

  
  Iv. AIX system backup and recovery
Backup and recovery are common tasks for system administrators, including rootvg backup and user data backup.

1. Backup of operating systems and system programs
# Tctl
-F/dev/rmt0 rewind
# Smit mksysb

Add "/dev/rmt0" to "backup device or file" and press Enter. The system will run for a long time. Wait until the screen is OK and take out the tape. At this time, the system backup is complete. Mksysb only backs up the installed file system in rootvg.

2. User Data Backup
(1) frequently used tape drive options

/Dev/rmt0?
If/dev/rmt0 is selected, the tape will be reversed to the beginning when the tape is inserted and written once. So? The next backup file will overwrite this backup.
/Dev/rmt0.1? If/dev/rmt0.1? When the tape is inserted or written once, the tape drive does not repeat the tape. Therefore, a tape can be used to continuously back up several files or file systems.
(2) # Smit
FS
    
Select "backup file system", enter the "file system name" to be backed up, and enter "/dev/rmt0.1". Repeat the preceding steps to back up multiple file systems on the same disk.

3. Restore rootvg
Start the machine to enter the maintenance mode.
To base operating system installation and maintanence, select 3 "Start Maintenance
Mode for System Recovery ". To restore the system, select 4" Install from a system backup ".
On the mksysb device screen, select "/dev/rmt0", insert the tape, and press Enter. The system automatically restores the operating system.

4. User Data Recovery
    
# Tctl-F/dev/rmt0 rewind
# Smit FS
    
Select "Recover File System" and add "device name" and "target directory". The system automatically restores the corresponding directory.

  
  V. Routine inspection of the AIX System
The routine inspection of the AIX system is an important part of application maintenance. You can troubleshoot the system faults in the early stage. The following routine checks are of great reference value.
(1) hardware check
Check the status of each indicator and the availability of each physical device.
(2) Process Check
Check whether there are dead processes and use PS
-The EF command lists information about all running processes.
(3) Whether the file system is full
Applicable to DF-K
Command to check the file system usage in K.
(4) Check System error logs
Run the errpt | more command to check whether the existing log is cleared? Errclear
0.
(5) check whether the system is valid or illegal
Use the last command to check the logon location.
(6) check whether the system generates a huge core file.
Use Find
/-Name core-print command check. You can delete core files directly.
(7) System Performance Check
① CPU performance: Using vmstat?
Topas command check.
② Memory usage: Use the topas and vmstat command to check the memory usage.
③ Check Io balancing usage: Use the iostat command to check.
④ Swap space usage: LSPs
-A command to check.
(8) Mail check

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.