Diagnostic process for Oracle Rman Backup error (II) tracking error messages and locating orientation problems

Source: Internet
Author: User
Tags sessions metalink backup

When checking the backup output script in the database today, an error occurred in the Rman backup.

This article traces the error message and looks for directions to locate the problem.

According to the previous problem description, the problem is found to be more and more complex, from a simple Rman backup error, involving 3 long-running jobs in the system, and the existence of a large number of racgmain check processes in the current node of the RAC environment.

Although the problem is very complex, do not rush blindly operation, the first simple analysis of the current situation.

The problem was found to be due to an Rman backup foot report, but the error message and subsequent tests found that the problem was reproducible, not a simple rman problem, and that the cause of the problem should be shared resources being occupied and Rman being shut down after a long period of time when resources were not available.

When checking for shared resources in the database, it is found that 3 suspicious sessions in the V$lock view hold lock information, and further queries are found to be 3 job-corresponding sessions, which have been running for a long time with no end or error.

While checking the process of the operating system, it is not normal to find that there are a large number of racgmain check processes in the current system, which is quite unusual, and checking other normal RAC environments does not detect the existence of this process.

These 3 questions may be interrelated, or may be independent of 3 questions, if 3 questions are not related to each other, then we should analyze and solve the 3 problems separately, if 3 of the problem is related, then need to find a breakthrough to quickly locate the cause of the problem.

Based on the phenomenon described in the previous article, these 3 problems are mostly related, because Rman error is not a common rman configuration problem, according to the information given in the alert file, the problem is due to the sharing of resources, and then query V$lock, It is not accidental to see 3 of jobs that have not been completed for a long time, although you have not seen what these 3 jobs are doing, but the resources that are suspected of rman waiting are occupied by the job. This shows that the first and second issues are linked, and the shared resource information given by the alert file itself is closely related to the RAC environment, and the LMD process--global ENQUEUE SERVICE Daemon is a RAC environment-dependent process. This shows that 3 questions are not 3 independent issues, but interrelated issues.

Since 3 questions are relevant, start with that question, in principle, from which analysis is possible, because the same is the cause of the problem eventually. But the actual analysis of the problem when there is a choice, rather than blindly choose a problem to start analysis, then it is likely to be less. In general, the choice of the breakthrough point of two considerations, one is to find and close to cause the problem of the phenomenon of analysis. If phenomenon a causes a phenomenon B, then choose to analyze phenomenon A, because the value of the phenomenon B analysis is not enough, now a more of the cause of the problem. Another point of consideration is the difficulty of starting, the first analysis of those very easy to get started with the phenomenon, so that the problem will be more rapid, and even if you find the wrong way, you can quickly find.

For the current 3 problems, Rman connection error initial suspicion and job operation, that is, Rman's problem is likely to be caused by job problems, so here should be limited to analyze job problems. And the job problem and the Racgmain check process are temporarily unable to determine the dependencies, so it is not easy to prioritize the analysis. Next to the two problem analysis of the difficulty of the degree, job running information stored in the database, there are many views available for reference, analysis and familiar; while the Racgmain check process is completely opaque, even if there are individual trace files, its contents are like a heavenly book general, It must be hard to analyze. It is clear that the incomplete job in the database should be analyzed first.

Incidentally, the method given above applies to most situations, but sometimes it is not. Or for the above 3 questions, according to the current information obtained, if you need to query Metalink to get help, then should be limited to consider the Rman problem, because the Rman problem contains a large number of clear information, both Rman error information, there is also a command to cause error information, So it's easy to find the most relevant description of the problem. For Racgmain check process issues, Racgmain check itself is a good query keyword, it is easy to find the relevant information in Metalink. For job problems, not searching for information, but the current information is not enough to search to find meaningful information, you need to further explore the job has not been sent for other reasons, to find a suitable key point of search information. Although the Rman problem and the Racgmain check process issue are suitable for searching, unfortunately, the search solution is not much like the current situation, and it is difficult to determine whether the current problem and the Metalink are described in the same issue. Therefore, you should continue to analyze job information in Oracle.

Sql> SELECT SID, JOB from dba_jobs_running;

SID JOB

---------- ----------

118 4

102 74

289 27

Sql>colwhat FORMAT A60

Sql> SELECT JOB, Log_user, WHAT

2 from Dba_jobs

3 WHERE JOB in (4, 27, 74);

JOB Log_user WHAT

---------- -------------------- --------------------------------------------------------

4 Ndmain dbms_stats. Gather_schema_stats (USER, CASCADE => TRUE);

27ZHEJIANG dbms_stats.gather_schema_stats (user, Cascade => true);

The P_project_stat GPO;

According to job information, the previous two job runs are related to statistics, and the 3rd job runs as a user-defined process.

Sql> SELECT SID, TYPE, ID1, ID2, Lmode, REQUEST, CTIME, block

2 from V$lock

3 WHERE SID in (102, 118, 289)

4 ORDER by CTIME DESC;

SID TY ID1 ID2 lmode REQUEST CTIME block

---------- -- ---------- ---------- ---------- ---------- ---------- ----------

118 JQ 0 4 6 0 151090 2

118 to 196404 1 3 0 151087 2

118 TX 262181 87690 6 0 151087 2

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.