Author :? SHOUG Member-Luo Min, ORACLEACS Senior Consultant, received a call from the Eastern service sales manager one day in early August 2014: "Luo, you are on a business trip to Shanghai tomorrow, can I go to XX Airlines first? One of their important systems is down." It is understood that the customer did not purchase the Oracle on-site ACS service, according
Author :? SHOUG Member-oracle acs senior consultant Luo Min on-site live broadcast fire fighting process one day in early August 2014, suddenly received a call from the Eastern Service Sales Manager: "Luo, you will be on a business trip to Shanghai tomorrow, can I go to XX Airlines first? One of their important systems is down." It is understood that the customer did not purchase the Oracle on-site ACS service, according
Author :?
SHOUG Member-oracle acs senior consultant Luo Min
- Live Video Fire Fighting Process
One day in early August 2014, I suddenly received a call from the service sales manager in the Eastern Region: "Old Luo, can you go to Shanghai on a business trip tomorrow? Can you go to XX Airlines? Their important system is down." It is understood that the customer did not purchase the Oracle on-site ACS service. According to the Oracle company's policy of having chickens first and then having eggs, we cannot perform any substantive service work on-site. However, starting from the national conditions, it is reasonable to consider the customer experience and customer relationship. As an ACS pre-sales consultant, it is reasonable to go to the site to assist in analyzing and solving problems and further understand the customer's current situation and needs, it's not just an attack. So I decided to change my itinerary, change the first flight on the next day, and fly to Shanghai at noon.
When I got a taxi at Hongqiao Airport, I tried to say sorry to the master. The master may have waited for a few hours and met me as a bad guy. I only needed to get started when I went to the customer site near the airport. The master was very dedicated and took me to the Information Center Building of the airline in 10 minutes in the hot sun at noon.
When I arrived at the site, the customer's O & M department leaders had long been waiting for me to bring me to the conference room with enthusiasm. In addition, dozens of people from the entire O & M department and development unit were summoned to the meeting room, there are also Indian experts responsible for application development. After the customer briefly introduced the system overview and fault conditions, he asked me to directly connect to the system and connect my computer to the large screen, dozens of eyes began to brush the scene to observe how Oracle consultants saved fire, Comrade Luo began to smell again, haha.
- Field pulse
To be honest, despite all the battles, IT systems are so complex and applications are so varied, and new IT technologies are emerging one after another. No expert dared to say that he was ill. However, the ideas and methods for analyzing and diagnosing problems are the same, that is, first understanding the system overview, and then understanding the fault situation, especially collecting fault-related data, ask if there are major changes to the application or environment before the fault, analyze and locate the problem gradually, and provide the final solution to the problem. The following are the specific situations related to the system and faults:
- Platform and architecture
It runs on the 2-node SUN Solaris platform. The database version is 11.2.0.4 RAC, and the database capacity reaches 1.6 TB.
Instance 1 restarts at around 15:15:44, and instance 2 restarts at, and node 1 is evicted. Before a fault occurs, node 1 consumes a very high amount of memory, reaches 100%, and generates a large number of SWAP operations. Node 2 also consumes 90% of the memory. However, the customer did not install OSWatcher, that is, the operating system data before and after the fault was not collected. At the same time, log files such as RAC and GI alert. log and crsd. log do not record obvious error data.
According to the customer, the system installed new patches for application software before January 1, August 1, that is, some application software was deployed. Based on the analysis of AWR reports from to before the downtime, the three SQL statements in these new applications consume a lot of resources. After RAC is restarted, the newly deployed application software is rolled back. At present, the RAC system runs smoothly.
It can be seen that new application software problems may be an important cause of RAC downtime!
- In-depth Application Analysis
New applications may be an important cause of RAC downtime, and Indian experts responsible for the application module development are also on site, therefore, we first conduct an in-depth analysis of one of the SQL statements. We only extract the following main parts for space limitations:
First, this statement consumes a lot of resources, Buffer Gets and Disk Reads are both very high, and the running time is as long as 555 seconds. By analyzing the execution plan of the statement, we find that the statement scans the entire table of the three large tables. The direct cause of full table scan is the usage of the UPPER function in the following part of the statement:
AND (CUSDOCINF. DOCTYP =: 2 and upper (CUSDOCINF. DOCNUM) = UPPER (: 3) OR
(CUSDOCINF. DOCTYP =: 4 and upper (CUSDOCINF. DOCNUM) = UPPER (: 5 )))
In fact, when we remove the UPPER function OR modify the OR operation to the in operation, the Oracle execution plan is very reasonable and the statement efficiency is very high.
However, after careful observation, I found that the developer has actually designed the UPPER function index and collected statistics. But why does Oracle not use the function index? At this moment, Indian engineers took the initiative to tell me that Oracle Bug 14630247 would cause Oracle optimizer to use full table scanning instead of function indexing. So I immediately analyzed Bug 14630247 and related Bug 14828235 through the Oracle website, especially after reading the Bug 14828235 ORA-7445 [evaopn3] from query with Function based index and order by clause, it is found that the Bug has been fixed in 11.2.0.4, and if the Bug occurs, there should be a ORA-7445 error. However, the above statement does not cause a ORA-7445 error, and the system is already 11.2.0.4 version, so whether it is caused by Bug 14630247 or Bug 14828235, I still cannot judge at the scene. Therefore, we recommend that you create another SR for this issue. Oracle GCS and the R & D department should confirm whether these bugs have been fixed on the 11.2.0.4 for Solaris platform, or if they have been released again. However, as an ACS onsite service team, I suggest taking some Workaround measures at the application level to avoid this problem, such as whether to cancel the upper function or cancel the or operation.
Well, application-related problems can only be analyzed at the site for the time being. But is this the only cause of the above fault? Is it because these statements consume too much resources and lead to downtime? Because the customer has not installed OSWatcher, that is, they are unable to obtain the operating system data when the system is down, especially the memory and process data. Therefore, they are unable to make accurate judgments.
- More serious problems found
In addition to the memory consumption caused by the above-mentioned poor applications, is there a problem with the RAC environment itself? So I checked the RAC environment through the Oracle cluvfy tool and soon found a more serious problem! Some details are as follows:
Grid @ ffpdb01:-bash :~ $ Cluvfy comp sys-n all-p crs-verbose
Verifying system requirement
Check: Total memory
Node Name ???? Available ???????????????? Required ????????????????? Status
----? --------? --------? ----
Ffpdb02 ?????? 96 GB (1.00663296E8KB )???? 2 GB (2097152.0KB )???????? Passed
Ffpdb01 ?????? 96 GB (1.00663296E8KB )???? 2 GB (2097152.0KB )???????? Passed
Result: Total memory check passed
... ...
Check: Hard limits for "maximum open file descriptors"
Node Name ???????? Type ????????? Available ???? Required ????? Status
------? ----? ----? ----? ------
Ffpdb02 ?????????? Hard ????????? 8192 ??? ?????? 65536 ???????? Failed
Ffpdb01 ?????????? Hard ????????? 8192 ????????? 65536 ???????? Failed
Result: Hard limits check failed for "maximum open file descriptors"
... ...
Check: Kernel parameter for "tcp_smallest_anon_port"
Node Name ???? Current ?????????????????? Required ????????????????? Status
----? --------? --------? ----
Ffpdb02 ?????? 32768 ???????????????????? 9000 ????????????????????? Failed (ignorable)
Ffpdb01? ????? 32768 ???????????????????? 9000 ????????????????????? Failed (ignorable)
Result: Kernel parameter check failed for "tcp_smallest_anon_port"
Check: Kernel parameter for "tcp_largest_anon_port"
Node Name ???? Current ?????????????????? Required ???????? ????????? Status
----? --------? --------? ----
Ffpdb02 ?????? 65535 ???????????????????? 65500 ???????????????????? Failed (ignorable)
Ffpdb01 ?????? 65535 ???????????????????? 65500 ???????????????????? Failed (ignorable)
Result: Kernel parameter check failed for "tcp_largest_anon_port"
Check: Kernel parameter for "udp_smallest_anon_port"
Node Name ???? Current ?????????????????? Required ????????????????? Status
----? --------? --------? ----
Ffpdb02 ?????? 32768 ???????????????????? 9000 ????????????????????? Failed (ignorable)
Ffpdb01 ?????? 32768 ???????????????????? 9000 ????????????????????? Failed (ignorable)
Result: Kernel parameter check failed for "udp_smallest_anon_port"
Check: Kernel parameter for "udp_largest_anon_port"
Node Name ???? Current ?????????????????? Required ????????????????? Status
----? --------? --------? ----
Ffpdb02 ?????? 65535 ???????????????????? 65500 ???????????????????? Failed (ignorable)
Ffpdb01 ?????? 65535 ???????????????????? 65500 ???????????????????? Failed (ignorable)
Result: Kernel parameter check failed for "udp_largest_anon_port"
... ...
Verification of system requirement was unsuccessful on all the specified nodes.
Grid @ ffpdb01:-bash :~ $
My mom, the core and network parameters of the operating system did not meet the Oracle RAC installation requirements, which would seriously cause Oracle GI and RAC to run abnormally! This is probably the more important cause of RAC downtime. Of course, to be accurate, it should be due to the sharp increase in external application pressure, and the aforementioned internal problems in the RAC environment have jointly led to downtime and failure.
- Customer tangle and pain
Even if the environment parameters are not configured properly, the 11g RAC is forcibly installed and started to work due to illness. Who did it? The customer leader's answer was a bit arrogant. He said that the Oracle company's product pre-sales department was doing the work, and that the Oracle company's hardware department was doing the work later. Well, don't go into it. Don't embarrass the leaders. I guess it is likely to be a third-party local company for installation, and the company's technical staff may not even carefully read the Oracle Installation documentation, specifically Oracle? Grid Infrastructure Installation Guide11g Release 2 (11.2) for Oracle Solaris, more specifically, the section "2.10 Verifying UDP and TCP Kernel Parameters" and "2.11 Checking Resource Limits for Solaris" in this document. Alas, it is very likely that third-party technical staff have casually found a short article on RAC installation in Baidu and Google, and started training on such an important airline system.
This is the difference between the non-professional service team and the original professional service team. The original technical staff will at least carefully read the official Oracle Installation documents, and will be guided by the Oracle RAC implementation methodology, combined with several best practices of Oracle, we will carry out comprehensive and in-depth implementation of RAC software and patch installation, high availability configuration, and application deployment to ensure the high quality of database RAC implementation.
What should I do now? Can I directly modify several memory unlimited parameters and TCP and UDP parameters to solve the problem and ensure that RAC is not down? As an on-site engineer, after all, I am not directly a product developer and cannot make such a commitment. Therefore, it is recommended that the customer further seek confirmation from the Oracle Background Service Team and product R & D department through SR. However, based on my previous similar experience, the best way is to reconfigure the environment parameters and re-install the RAC system.
Therefore, on the one hand, I proposed the re-installation, and on the other hand, in order to reduce the impact on the production system downtime, I further proposed to first install a Data Guard environment, switch the existing production system Data to the Data Guard environment, reinstall the 11g RAC of the existing production system, and switch back to the 11g RAC suggestions. However, when I exit these re-installation suggestions, the customer leaders immediately sighed and complained: "The system has just been online for less than a month. How can I explain the re-installation to the leaders ?" "Alas, if you had been here a month earlier, you would have discovered environmental problems before going online. At that time, there was no problem with re-installation ."
There are also more tangled and painful questions: "Luo Gong, can your Oracle company provide such evidence? It proves that this RAC downtime is caused by unreasonable environment parameter configuration ?". How can this be proved? OSWatcher is not installed, and valuable information is not captured in other log files. More importantly, based on past experience, if Oracle software installation is found to be faulty, the Oracle background will not continue further analysis and diagnosis. We recommend that you reinstall the software before proceeding. Yes. If A is wrong, B Based on A is wrong. Oracle should stop analysis B and correct A first. Then, it is too logical to check the running status of B.
- More senses and insights
In addition to the above feelings about the professionalism and non-professionalism of the original manufacturer and third-party manufacturer in the installation and implementation of RAC, more insights include:
- Never underestimate Oracle software installation, especially cluster and RAC installation. This is indeed a very professional job. An unreasonable environment parameter configuration may pose a deep risk to the system.
- When encountering problems and failures, we should still be realistic and pragmatic and respect objective rules. We should not think too much about face, especially the comments of leaders. Although it is possible to make a thing solid and perfect, although it may have to pay a lot of price, it is still very face-oriented, and leaders will be satisfied, huh, huh.
- Make another sales promotion for the Oracle service department. Oracle's various professional service departments, including the PS department that provides standard services at the backend and the ACS department that provides on-site services at the front end, are professional teams that cooperate with each other and complement each other, it is valuable to customers and is indispensable. Taking this case as an example, the background PS department can give full play to the advantages of product implementation analysis and communication with the R & D department, while the foreground ACS Department communicates with the customer on site to learn more about the multi-system and application background, it also helps customers communicate with PS departments to jointly promote problem analysis and solutions.
- More feelings and insights for everyone... ...
2014January 1, October 6
Related posts:
- Oracle Acs senior consultant Luo minoluo core technical sentiment: Is Clusterware a mature product?
- Oracle Acs senior consultant Luo minoluo technology core sentiment: automatic scanning of SQL statement tools?
- Oracle Acs senior consultant Luo minoluo core technical sentiment: Table sharding or partition?
- Oracle Acs senior consultant Luo minoluo core technical sentiment: Niu! Automatic Optimization of 11 GB and SQL Profile
- [Oracle RAC optimization] Different gcs_server_processes parameters may cause gc cr multi block request wait events.
- Understand Oracle Validated Configurations
- How many LMS processes for Oracle Rac 9i?
- Oracle database 11g r2 latest installation experience
- Oracle RDBMS Server 11gR2 Preinstall RPM For Oracle Linux 6
- Oracle Recommended Kernel Parameter settings for HP Itanium v3 11.31
Original article address: Oracle Acs senior consultant Luo minoluo core technology sentiment: the sentiment after the appearance of the Smell again, thanks to the original author to share.