Science and technology Softly State Technology Life (16)--red alert--oracle down the tide, quick action!

Source: Internet
Author: User
Tags dba terminates webp metalink


1

Objective


February 14, the eve of Valentine's Day, a data center set of Oracle 11.2.0.4 RAC outage!

A few days later, there is a set of RAC outage!

A few days later, a set of RAC outages followed.

As the operation of you, hear other customers appear such an outage, is not the bottom of my heart will be a burst of inexplicable panic?

So the question is, will the data center of your company have a similar outage?

What is the cause of these failures?

This will continue to go crazy ...

If you can not find the truth in time, then small y believe that the fall of the tide will continue!

Your Oracle database in your center may be getting closer to downtime! The scary thing is, you probably haven't noticed ...

This is definitely not alarmist!

This is a real failure in a very large data center, in less than two weeks, three sets of different Oracle databases have been the case of abnormal termination of instances!

Coincidentally, the small Y service of other customers that also appeared in succession to the precursor of downtime! Fortunately, timely detection and processing.

Look at the advent of the crash, see how small y to simplify, help customers solve the truth of the problem together.

After the truth is uncovered, you may not be hard to find, this is a common problem!

So small y dare not neglect, hurriedly take out to share with you, sounded the red alarm!

Phase 16, little Y will take you through a data center Oracle database outage Analysis journey.

At the end of the article, provide specific alerts and verification methods, what are you hesitating about? Check it out!

2

Here's the problem.


Little Y, something happened, today there is a system, the morning RAC down a node, night and a node, the operating system does not restart, but the database instance crash off! SR has been opened, but now the reason is not sure, the leadership attaches great importance to this issue, tomorrow you can come over to check it together? The leader hopes to ascertain the cause of the problem tomorrow.

Yes, this is a set of 11.2.0.4 RAC, hit the latest PSU!

Received the phone, small y came to the spirit.

The caller is an ultra-large state-owned bank in China, with a high level of Oracle DBA in its own right.

Usually find the problem of small y, are some strange complex problems, if only the database, and the operating system/middleware/storage and other aspects of lack of sufficient understanding, often can not solve their complex problems.

It seems that an uphill battle is inevitable ...

3

Start analysis



First look at the database alert log:

The next morning, to the customer site, the first customer to me about yesterday's failure situation: February 12 morning around 9, 11.2.0.4 RAC Node 1 outage, 22 o'clock, node 2 down.

After customer help is logged into the system, small y first checks the alert log of the database as shown in:


650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ skk2zibiaq2k5dkibxkfgocziaaqx5kpbibiviceytygnj8sdia3bjwyias7rbdxunnhkibnfmxvm2ibnhrblskxmjtcfpxq/640?wx_fmt= Jpeg&tp=webp&wxfrom=5&wx_lazy=1 "style=" margin:0px;padding:0px;height:auto;width:590px; "alt=" 640?wx _fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

It is not difficult to see:

2017/2/128:53:49, because the background process of the database ASMB with the ASM instance communication failed, the ASMB process terminates the DB instance, so small y needs to continue checking the ASM's alert log to see if the ASM instance has a problem before it causes the database to crash.

Immediately following the ASM alert log:

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivtjrwzfcym55kic8ibdz3ojfgju1gibqldbhibmgdyphcibcnklcsmu7yhtw/640?wx_fmt= Jpeg&tp=webp&wxfrom=5&wx_lazy=1 "style=" margin:0px;padding:0px;height:auto;width:590px; "alt=" 640?wx _fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

It is not difficult to see:

A few 10 seconds before the database crash, the Rbal background process for the 8:53:15,ASM instance encountered a ORA-07445 error, Rbal process core dump, so the Pmon process terminated the ASM instance.

That is, the ASM instance Rbal process has a ORA-7445 error that causes the ASM instance to terminate because the DB instance is dependent on the ASM instance, so the DB instance is terminated. The specific ORA-7445 errors for ASM instances are:

Ora-07445:exception Encountered:core Dump [__lwp_kill () +48] [Sigiot]

Small y just start to see this error, helpless shook his head, encounter trouble!

Why does little y have such a feeling? A veteran DBA may have the same feeling when he sees this error!

Because this error is called Lwp_kill is too common to call, in the function core dump can be 10,000, or even more, Metalink will not record all possible ...

However, little y is still very confident, as long as the adjustment to the revised school mode, the unknown problem can be quickly ascertained.

Yes, as long as you concentrate on analyzing the cause of this ORA-7445 [__lwp_kill () +48] [Sigiot] error, it also unlocks the truth about the problem of the outage series.

4

The beginning of the bad

See here, little Y and a few engineers who have seen the problem before have done a simple communication.

The result of the communication is two words, bad.

Before some of their more senior engineers have seen the problem, also found on the Metalink whether there are similar problems, the results show that the call stack matching, the same case has some, but case no direct conclusion, the customer has opened an SR to GCs, is currently being analyzed ..

The customer wants a general result today, time is urgent.

Customers they understand the habit of small y, and then emergency, will first take time to smoke a cigarette ...

Little Y and the customer after a greeting, then went downstairs to smoke.


5

A serious science, please.


Take advantage of the gap of smoking, small y wisp a bit of thought.

Perhaps some students are confused about some of the above terms, what is call stack, what is ora-600 and ora-7445 error. Small y found that a lot of DBAs are the same, here, little y give us a little bit of popular science:

Great popularity of knowledge points

Knowledge point 1: What is a ORA-600 error?

Some alumni say that the ORA-7445 error is the same as the ORA-600 error, which is an internal Oracle error. According to small y, this understanding is not accurate! The ORA-600 error is an internal error, but the 7445 error is not true!

ORA-600 is an exception caught in Oracle source code that typically occurs in a particular function and is relatively specific, typically an Oracle BUG.

Knowledge Point 2: What is a ORA-7445 error?

ORA-7445 errors and ORA-600 errors are different.

An ORA-7445 error is reported when the Oracle process receives a critical signal signal from the operating system during operation. The operating system itself captures some of the illegal operations of the process, such as when a process attempts to write to an invalid memory location, for the purpose of protecting the operating system, the operating system will send a serious signal to the process, such as Sigbus and SIGSEGV signals, so you will see the process core The emergence of the dump phenomenon.

ORA-7445 errors can occur anywhere in the code, where the exact location of the error needs to be located through the core file.


It is not difficult to see from this paragraph, ORA-7445 error is more likely, the essence is that the operating system sends a serious signal to the process, then the cause is either a database bug, there is also a great possibility of some exceptions from the operating system.

This is why ORA-7445 error analysis is more difficult than ORA-600 error.


Knowledge point 3: What is Call stack?


When we talk about bugs or bug defect, there is a question, what is the trigger condition for this bug?

Bugs are usually triggered by a particular scenario, and call stack is the function's calling trajectory, which represents the specific triggering scenario for the bug.

That's what little y mentioned earlier, and before Little y, they had checked the call stack to match the bug.

Unfortunately, the case with the same call stack on MOS has not been concluded and therefore cannot be consulted ...


6

Start looking for the truth from call stack


Small y next Open the trace file for the Rbal process that appears ORA-7445 error, locate the call stack section, as shown below


650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivu9rmhprymupkkqel6oikapoave1umnb7bj5iagsbecw7yevekokmpuw/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" margin:0px;padding:0px;height:auto;width:590px; "alt=" 640?wx_fmt =jpeg&tp=webp&wxfrom=5&wx_lazy "/>

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibiv4gaaiakcxrgrjqyykliahccpcxic71bacvmqo3nmtbr3clsfydqd4abdq/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" margin:0px;padding:0px;height:auto;width:590px; "alt=" 640?wx_fmt =jpeg&tp=webp&wxfrom=5&wx_lazy "/>

First, find the function that appears in the first bracket of the ORA-7445 error, which is Lwp_kill

This means that the Rbal process core dump occurs in the __lwp_kill system call.

The LWP is light Weight process, which is lightweight, and kill is the end.

Careful classmate, you can see, Lwp_kill front with two underscores, not in the Oracle code to call the function, but the function called inside the function, belongs to the recursive function.

So how does small y know what these calls mean?

Actually very simple, these are from the operating system standard call, Degree Niang or Google a bit better.

In the trace file, the call stack is called from the bottom up, and the following function executes before the function above. This mistake is too common! is a generalized error! In many places, it is possible to call Lwp_kill to terminate the process. So analyzing this function doesn't make any sense, we need to keep looking down as shown in


650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibive9wtkiba1xmsz0cdfqgj2ngcupjsijhxfmv1i11ibozrpavicffoal05w/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:25.6px;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

1) The function of Lwp_kill is Pthread_kill, the function is to pass a signal to a thread, is also a recursive function, continue to look down

2) _raise, send a signal to the executing program, raise the Pthread_kill

3) The Abort () function, which is the termination from the name, the abort () function causes the process to terminate abnormally, unless the process termination signal from the operating system is SIGABRT signal is captured and the signal processing handle does not return _assert (), its function is if its condition returns an error, Then terminates the program execution, simply means that the program does something, encounters an error, and needs to terminate the program execution.

Here, it is not difficult to see that the function's call trajectory is

__lwp_kill <--__pthread_kill <--_raise <--abort <--_assert

This call stack, in the case of plain English, is:

The Rbal process encountered an error during execution and therefore terminated the rbal process.


So what's wrong with it? Why you need to keep looking down on other call stacks


650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibiv6p6gbvbhjzna7jszpgkictn5qeq2teuibaqqu7h1w1iayndsqemibxmfg/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:25.6px;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

It is not difficult to see:

_assert () is a recursive call, and the function to tune it up is clsuassertmsg.

This function is not underlined, is the Oracle code of its own function name, obviously in the degree Niang or Google can not be found.

Then you might as well follow the small y to guess.

Taken apart, that is, clsu+assertmsg, which is encountered with an error, assert indicates that an error was encountered.

Looking down, it is clsgpnp_oramemalloc, it is not difficult to see, Oramemalloc is allocated memory, and GPNP related module allocation within the existence of relationships.

In conjunction with the previous call stack, let's summarize:

The Rbal process encountered an error while executing the CLSGPNP_ORAMEMALLOC function to allocate memory, so the Coredump, which is ORA-7445 error, led to the ASM and DB instance successively crash.

7

Why is memory not allocated?


Here, we have actually learned that Oracle has made a mistake in executing the code to Clsgpnp_oramemalloc, what is wrong with it? This is the key! But is there a way to know what's wrong?

Oracle throws a Lwp_kill ORA-7445 error for us, but what we really care about is what error clsgpnp_oramemalloc this function is encountering!

If the trace is going to tell us what is wrong, then how good! What a pity!?

A lot of people may have analyzed here, and there was a deadlock!

In fact, little y read only a few minutes of trace files and found the truth of the problem.

You can stop to think about the next two minutes, if it is you, how will you look down ...


---------------------

Think time .... Don't worry about turning down oh ...



-------------------------

8

Slowly approaching the truth.


The method of small y is very simple, with normal thinking/living language to analyze it can be.

On the surface, the call to the core dump is Lwp_kill, but the Oracle function that is actually having the problem is clsgpnp_oramemalloc. Obviously, we need to know what error was reported when this function went to allocate memory!! Remember, you need proof, not a guess!

Some people will say, we in the trace in the Clsgpnp_oramemalloc keyword search can not do it ...

It's a pity that you might find nothing in the end.

Are you wrong?


No! It's your way!

Small y uses the Clsgpnp_oramemalloc keyword to intercept the front half (you'll know why),

Here to find the Clsgpnp_oram keyword, the results are as follows

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivosop8iay8elbhnmiu8egdghga7zeyrxpnhzmbwjdxclumlehpc0ntiaw/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:25.6px;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

can see:

The Oracle function that really appears to be a problem is clsgpnp_oramemalloc,

This function to allocate memory when the report is unable to allocate 120K memory error! Failed to allocate 120024 bytes.

Seeing this picture, you understand why you need to truncate the function name of the real error as the keyword of the search!

Because the function name wraps, if the entire function as a keyword, you may not find!

This is small y for everyone to contribute a little tips and experience tips bar.

Here, perhaps others say:

Why can't you just guess the error when clsgpnp_oramemalloc this function to allocate memory?

is not the most likely to be unable to allocate memory? That's true, but not just so.

Because there is an error in allocating memory, there are too many possibilities, not the kind of memory that you can imagine!

Do you remember the revised school mode that was mentioned in the previous article by Little y?

If the use of guessing method, the result is unable to persuade customers and themselves, unable to form a complete evidence chain, is the "wild path" one of the typical performance. Small Y These years interviewed a lot of people, the results are not ideal, most of the people are actually wild way, the solution of the problem, the typical is to solve the problem when the East a hammer, west a stick, by luck. Rather than the school-sent step by step. In the face of complex problems, the wild path will catch the elbow.

9

The truth surfaced.


Oracle's function Clsgpnp_oramemalloc, when allocating memory, reported a failure to allocate 120K of memory! Failedto allocate 120024 bytes.

Here, I believe you must be tempted! Want to try their own skill, after all, after seeing this error, the problem is further narrowed down the range!

1) is the CLSGPNP_ORAMEMALLOC function unable to allocate memory due to insufficient machine memory?

The answer is no, first from the monitoring data/oswatcher can be seen, machine memory is still very abundant.

2) is not the operating system Ulimit memory settings are relatively small?

The answer is No,ulimit is configured properly

Are we in the wrong direction?

Sometimes, it seems, we are close to the truth of the matter, but perhaps with the truth in the shoulder to go through? Why is that?

Here, small y sell a Xiaoguanzi, the answer is below the blank, the reader can choose when to turn down ...

Small Y said that if you only understand the database, and the operating system/middleware/storage and other aspects of lack of sufficient understanding, often can not solve the complex problems of large data centers.

View the memory consumption of the rbal process before the outage, as shown in

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivc5nlzvliaicyiqrqiibjwirr3j4twcazlyufmqwufaugyb2pnnw0q2mcg/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:25.6px;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

This is an operating system Hpux 11.31 ia64 (in fact the problem is not related to the operating system),

We recall the history of glance, it is not difficult to see, Res Mem reached 4209480K, that is, about 4G.

Hear the word 4G, do you think of something? Yes, it's like a limit!

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ Skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivryraygibr59q4qzfr8cjh9iar5jpyplutuicvhkrvybobtrxcegdwrcww/640?wx_fmt=jpeg &tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:1.6;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

As you can see, the Maxdsize_64bit parameter of the operating system is set to 4G, that is, the data of a single process can be up to 4g!

Why Rbal process memory with so much memory?

Obviously, there is a memory leak in the Rbal process. Normally, the memory of the Rbal process is above 100M.

We checked the historical data and confirmed the existence of memory in the Rbal process!

What triggers the Rbal process memory leak?

Through analysis and comparison, it is found that the library database has a different place from other databases:

The voting File relocation occurs frequently in the Asmalert log, as shown below.

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivb3wuzysuegtmtaic5hhtok9ruicygmuicjibuibricha9pveqrh7xtz3ufvq/640?wx_fmt= Jpeg&tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:25.6px;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

Finally, after resolving the voting File relocation, the memory of the Rbal process no longer continues to increase, and the problem is fundamentally resolved. After that, the customer himself also applied for a patch.

10

Downtime Surge Event Restore


By reading the call stack of ORA-7445, the small y complex is reduced to reduce the occurrence of events.

1. Why does the ASM Rbal process appear Ora-7445[lwp_kill] error after process core Dump?

So the ASM instance rbal a memory leak in the background process, and when the memory leaks to the OS's limit on a single process, the process cannot allocate memory and crash, which leads to the ASM instance and the DB instance crash

2, why can cause the tide of downtime

Because the Rbal process memory leaks the same speed, in a maintenance day to start the multiple sets of databases, after a small one-year time, it is almost at the same time to the OS to a single process limit, so there has been "downtime"

In fact, in the first set of downtime, small Y has assisted the customer to identify the ASM Rbal process memory leak problem, but did not have time to fully comb and rectify all systems, during this period there have been another two sets of RAC downtime.

3. Will there be any downtime?

Not necessarily, if the operating system has no upper limit on the use of a single system, there will be no downtime, but the Rbal process will run out of memory of the entire system, and if not monitored in a timely manner, may result in performance and inability to telnet/ssh.

In short, the same problem, but the different OS configuration will show different symptoms!

4, Rbal process Core dump must appear in the Clsgpnp_oramemalloc entire function?

Obviously, if memory leaks into the limits of a single process in the OS, no matter which function that needs to allocate memory, you may encounter a situation where memory cannot be allocated and then coredump! Therefore, the answer is not necessarily. This is a fault, there may be a number of different symptoms, but the essence is a matter!

5. Rbal process Memory leak only occurs in HPUX?

No! Not only did we find this problem on Hpux, other customers ' AIX environment, the Rbal process memory has been used 8G, and continues to rise. This issue does not differentiate between platforms, currently confirming that the affected version is 11.2.0.4! Other versions we are still confirming.

11

The red alarm is ringing!

11.2.0.4, a set of Oracle RAC outages!

A few days later, another set of RAC is down!

Not a few days later, followed by a set of other RAC outages!


For detailed reasons, see the analysis of the section on "Reduction of downtime events" above


1, small y here on behalf of the technology to everyone solemnly suggest a larger risk:

In the case of a memory leak in the Rbal background process of the Oracle 11.2.0.4 version of Rac,asm, there is a potential for downtime, and there are already multiple customers that have affected the operating system, including Hpux/aix/linux.

2. Suggestions

It is recommended that the following methods be used to comprehensively comb whether the situation exists and to increase monitoring of memory usage at the process level.

1) ps–elf|grep–i Asm_rbal or PS aux, normally above 100M, can be known by comparison to other background processes of ASM

2) SELECT * FROM V$version

3) See if the following information appears in the ASM alert log

650) this.width=650; "Src=" https://mmbiz.qpic.cn/mmbiz_jpg/ skk2zibiaq2k5dkibxkfgocziaaqx5kpbibivb3wuzysuegtmtaic5hhtok9ruicygmuicjibuibricha9pveqrh7xtz3ufvq/640?wx_fmt= Jpeg&tp=webp&wxfrom=5&wx_lazy=1 "style=" Margin:0px;padding:0px;color:rgb (62,62,62); font-family: ' Hiragino Sans GB ', ' Microsoft Yahei ', Arial, sans-serif;line-height:1.6;white-space:normal;width:590px; Background-color:rgb (255,255,255); height:auto; "alt=" 640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy "/>

3. How to solve the problem:

If you check for a similar problem, to find out how to solve the problem,

You can add a small y, shadow-huang-bj

Note: Join the technology Oracle Group

Small y will be announced in the group and the "Antu" public number in the next share.


Science and technology Softly State Technology Life (16)--red alert--oracle down the tide, quick action!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.