Basic elements (time, user, problem)
The user only implemented the linux5.8+11.2.0.4 RAC environment, after a period of time, when switching the grid user, prompted resource temporarily unavailable, as follows:
[Email protected] bin]# Su-grid
Su:cannot Set Userid:resource temporarily unavailable
But we switch other users, such as Oracle users, but can switch normally, and CRS cluster use normal, client connection and user's use of the temporary no impact, the user in the first occurrence of this situation, the use of restarting the server, the way to temporarily solve the problem, but not long, Again, so users need to completely solve this problem, to avoid other security risks, affecting the normal business applications.
Problem Analysis Step one: Detect operating system Resource throttling configuration
Generally this situation, first of all should consider our implementation process in the operating system on the grid user resource constraints of the parameters set possible problems, in the implementation of RAC, the user's resource limit of 2 places/etc/security/limits.conf and/etc/ Profile, you should first detect the contents of these 2 configuration files, as follows:
[Email protected] ~]# cat/etc/security/limits.conf
Grid Soft Nproc 16384
Grid hard Nproc 65536
Grid Soft Nofile 2047
Grid hard Nofile 65536
Oracle Soft Nproc 16384
Oracle Hard nproc65536
Oracle Soft Nofile 2047
Oracle Hard nofile65536
[Email protected] ~]# Cat/etc/profile
if [$USER = "Oracle"] | | [$USER = "Grid"]; Then
if[$SHELL = "/bin/ksh"]; Then
Ulimit-p 16384
Ulimit-n 65536
Else
Ulimit-u 16384-n 65536
Fi
umask022
Fi
The Nproc here is the control of the maximum number of processes the user can use, where soft is a soft limit, and the user can exceed the value of this setting, but must not exceed the hard value. The general soft is smaller than the hard one, tough is the rigid limit, the format of the/etc/security/limits.conf is as follows, here takes fszize this parameter as an example:
Our grid soft Nproc 16384 and grid hard Nproc 65536 indicate that a grid user can enable up to 65,536 processes, with a warning of 16384, and then we should look at the number of processes under the user, as follows
[Email protected] ~]# Ps–ugrid |wc–l
156
[Email protected] ~]# ps–aux|wc–l
659
I looked here under the process is not much, no more than the alarm value, should not prompt resource temporarily unavailable error Ah, I suspect here is I use the command parameters may have problems, through the Baidu PS Command Introduction, changed a parameter to execute as follows:
[Email protected] ~]# ps–el|wc–l
17730
This time we can see the obvious process anomaly, there are actually 16,530 processes in a node, and here, the-E is the process of showing all users, our previous-aux obviously filtered out some processes, this is because
-a displays all programs executed under all terminals
-E Show All Programs
The former only shows all the execution program on the terminal, and does not show all the programs, the latter is the complete display of all the processes in the current environment, then we need to carefully troubleshoot these unconventional processes, by listing, found a large number of ONS process, as follows:
With a summary of the commands, there are a total of 16,530 ons processes,
[Email protected] ~]# ps–el|grepons |wc–l
16530
This has finally found the root cause of the problem, and then we need to deal with that problem.
Step two: ONS process analysis
The official ONS (Oracle Notification Services) explains the following a publish Andsubscribe service for communicating information on all FAN events its Mainly responsible for the communication between RAC nodes, is a very important service process, why a large number of the ONS process? ONS has thousand processes/threads and still increasing (document ID 1547703.1) give reasons
applies To:
Oracledatabase-enterprise edition-version 11.2.0.1 and later
Information in ThisDocument applies to any platform.
Symptoms
The number of ONS processes/threads continuously increases.
Oracle 9470 17663 7447 0 7599 07:11? 00:00:00/orahome/app/grid/opmn/bin/ons-d
Oracle 9470 17663 8920 0 7599 07:12? 00:00:00/orahome/app/grid/opmn/bin/ons-d
Oracle 9470 17663 10425 0 7599 07:13? 00:00:00/orahome/app/grid/opmn/bin/ons-d
..
The output Ofcommand-"onsctl Debug"
IPADDRESS PORT Time SEQUENCE FLAGS
--------------------------------------- ------------- -------- --------
127.0.0.1 6200 511C7CCB 00000001 00000008
Listener:
TYPE bindaddress PORT SOCKET
-------- -------------------------------------------- ------
Local 127.0.0.1 6100 5
Remote any 6200 6
Remote any 6200-
Connection topology: (1)
IP PORT VERS Time
--------------------------------------- ---------- --------
127.0.0.1 6200 4 511c7cdd=
* * 127.0.0.1 6200
* * 127.0.0.1 6200
Server Connections:
ID connectionaddress PORT FLAGS sendq REF Wsaq
-------- -------------------------------------------- ------ ----- --- ----
6 127.0.0.1 6200090026 00000 001
Client Connections:
ID connectionaddress PORT FLAGS sendq REF SUB W
-------- -------------------------------------------- ------ ----- --- --- -
1 Internal 0 01008a 00000 001 002
2 127.0.0.1 610001001a 00000 001 001
5 127.0.0.1 610001001a 00000 001 000
Request 127.0.0.1 6100 03201a 00000 001 000
cause
Misconfigured/etc/hostsfor Loopback Interface
-------------------------------------------------------------------
127.0.0.1 EMSDB01 Localhost.localdomainlocalhost
-------------------------------------------------------------------
Solution
Change Loopbackinterface to the following:
-------------------------------------------------------------------
127.0.0.1 Localhost.localdomainlocalhost
-------------------------------------------------------------------
Resolve Process Step One: view/etc/hosts file
We look at the/etc/hosts file, found that indeed in 127.0.0.1 this line, retains the host name, it seems that our implementation of the implementation of the process is not meticulous results, remove the cursor that column hostname, as follows
[Email protected] ~]# cat/etc/hosts
127.0.0.1
rac01 localhost.localdomainlocalhost192.168.4.23 rac01192.168.4.24 RAC02192.168.4.27 RAC01-VIP192.168.4.28 RAC02-VIP192.168.4.30 Scan-rac
After the 2 nodes have been adjusted, the next node is restarted sequentially, followed by the Onsctl debug command execution results are as follows
ADDRESS PORT Time SEQUENCE FLAGS
--------------------------------------- ------------- -------- --------
127.0.0.1 6200 511C7CCB 00000001 00000008
Listener:
TYPE bindaddress PORT SOCKET
-------- -------------------------------------------- ------
Local 127.0.0.1 6100 5
Remote any 6200 6
Remote any 6200-
Connection topology: (1)
IP PORT VERS Time
--------------------------------------- ---------- --------
127.0.0.1 6200 4 511c7cdd=
192.168.4.23 6200
192.168.4.24 6200
Server Connections:
ID connectionaddress PORT FLAGS sendq REF Wsaq
-------- -------------------------------------------- ------ ----- --- ----
6 127.0.0.1 6200090026 00000 001
Client Connections:
ID connectionaddress PORT FLAGS sendq REF SUB W
-------- -------------------------------------------- ------ ----- --- --- -
1 Internal 0 01008a 00000 001 002
2 127.0.0.1 610001001a 00000 001 001
5 127.0.0.1 610001001a 00000 001 000
Request 127.0.0.1 6100 03201a 00000 001 000
We see that the IP of the node has been displayed correctly compared to the previous one, and then we query the ONS process, which has been reduced to about 2, and the problem is completely solved.
[Email protected] ~]# ps–el|grepons |wc–l
2
Key points of knowledge
1.PS View process commands, note there is-and no--the difference, for example we want to see all processes, should be PS aux and if you use Ps–aux can not display all processes, because:
Parameter description:
-A shows all the processes performed under all terminals except the stage job leader.
A shows all processes under the current terminal, including the processes of other users.
-e displays all processes .
e Displays the environment variables used by each process when the process is listed.
2.11GR2 RAC Implementation, be sure to remember the Hosts file 127.0.0.1 This column of the hostname removed, otherwise it will lead to a lot of ons process.
The RAC environment generates a large number of ONS processes, causing user process resources to run out and user Switching prompts resource temporarily unavailable