Oracle Diagnostics Case-job task stops executing [final version]

Source: Internet
Author: User
Tags execution final include reference reserved variable sqlplus
oracle| execution
Oracle Diagnostics Case-job task stops executing

Last Updated:friday, 2004-11-26 9:48 eygle






Yesterday, a researcher reported that the database timed tasks did not perform properly, causing some operations to fail.

began to intervene to deal with the accident.
System environment:
SunOS DB 5.8 generic_108528-21 sun4u SPARC sunw,ultra-4
Oracle9i Enterprise Edition Release 9.2.0.3.0-production

1. First step in checking database tasks



$ sqlplus "/as SYSDBA" Sql*plus:release 9.2.0.3.0-production on Wed Nov 2004Copyright 20:23:53 (c) 1982, 2002, Oracl E Corporation. All rights reserved. Connected to:oracle9i Enterprise Edition release 9.2.0.3.0-productionwith The partitioning, OLAP and Oracle Data Mining Optionsjserver release 9.2.0.3.0-productionsql> Select Job,last_date,last_sec,next_date,next_sec,broken, Failures from Dba_jobs; JOB last_date last_sec next_date next_sec B failuresinterval-------------------------------------------------------- -------------------------------------------16-nov-04 01:00:02 17-nov-04 01:00:00 N 0trunc (sysdate+1) +1/24-16-nov -04 00:00:04 17-nov-04 00:00:00 n 0TRUNC (sysdate) + 1 29 16-nov-04 01:00:02 17-nov-04 01:00:00 N 0trunc (sysdate+1) +1/24 16-nov-04 00:00:04 17-nov-04 00:00:00 n 0TRUNC (sysdate) + 1 01-nov-04 06:00:01 01-dec-04 06:00:00 n 0trunc (add_months ( sysdate,1), ' MM ') +6/24 16-nov-04 04:00:03 17-nov-04 04:00:00 N 0trunc (sysdate+1) +4/24/16-nov-04 02:14:27 17-nov-04 02:14:27 n 0sysdate+1, 16-nov-04 03:00:02 17-nov-04 18:14:49 n 0trunc (sysdate+1) +3/248 rows selected.

The Discovery Job task is not performing properly and the earliest one should be executed at 17-nov-04 01:00:00. But not executed.

2. Establish a Test job



Create or replace PROCEDURE pining Isbegin NULL; End;/variable jobno number;variable instno number;begin Select Instance_number into:instno from V$instance; Dbms_job.submit (: Jobno, ' pining; ', trunc (sysdate+1/288, ' mi '), ' trunc (sysdate+1/288, ' mi ') ', TRUE,: Instno); end;/

Found the same, not executed.
But there is no problem with dbms_job.run (<job>) execution.

3. Make a recovery attempt

Suspect is CJQ0 process invalidation, first set job_queue_processes for 0,oracle will kill CJQ0 and corresponding job process
sql> ALTER SYSTEM SET job_queue_processes = 0;

Wait 2-3 minutes, reset

sql> ALTER SYSTEM SET job_queue_processes = 5;

At this point, Pmon will restart the CJQ0 process.

You can see the following information in the alert log:



Thu Nov 11:59:50 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 12:01:30 2004ALTER SYSTEM SET job_queue_processes=10 scope=memory; Thu Nov 12:01:30 2004Restarting dead Background process cjq0cjq0 started with pid=8



But the job still does not execute, and when revise again, the CJQ0 dies directly.



Thu Nov 13:52:05 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 14:09:30 2004ALTER SYSTEM SET job_queue_processes=10 scope=memory; Thu Nov 14:10:27 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 14:10:42 2004ALTER SYSTEM SET job_queue_processes=10 scope=memory; Thu Nov 14:31:07 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 14:40:14 2004ALTER SYSTEM SET job_queue_processes=10 scope=memory; Thu Nov 14:40:28 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 14:40:33 2004ALTER SYSTEM SET Job_queue_processes=1 scope=memory; Thu Nov 14:40:40 2004ALTER SYSTEM SET job_queue_processes=10 scope=memory; Thu Nov 15:00:42 2004ALTER SYSTEM SET job_queue_processes=0 scope=memory; Thu Nov 15:01:36 2004ALTER SYSTEM SET job_queue_processes=15 scope=memory;

4. Try to restart the database
This has to be done at night.



Pmon started with pid=2dbw0 started with PID=3LGWR started and pid=4ckpt with started Pid=5smon with started St Arted with Pid=7cjq0 started and Pid=8qmn0 started with pid=9 ....

CJQ0 starts normally, but the job still does not execute.

5. There's no way ...

Continue to study ... Actually found Oralce had such a bug

1. Clear description of the problem encountered:
SLGCSF ()/Slgcs () on Solaris'll stop incrementing after
497 days 2 hrs mins (approx) machine uptime.


2. Pertinent configuration information
No special configuration other than long machine uptime. .

3. Indication of the frequency and predictability of the problem
100% but only after 497 days.

4. Sequence of events leading to the problem
If the Gethrtime () OS call returns a value > 42949672950000000
Nanoseconds then Slgcs () stays at 0xFFFFFFFF. This can
Cause some problems in parts of the code which rely on
Slgcs () to keep moving.
Eg:in KKJSSRH () does "now = Slgcs (&se)" and compares that
to a previous timestamp. After 497 days uptime Slgcs ()
Keeps returning 0xFFFFFFFF so "NOW-KKJLSRT" would
Always return 0. .

5. Technical impact on the customer. Include persistent After effects.
In this case DBMS, JOBS stopped running after 497 days uptime.
The other symptoms could occur in various places in the code.

Well, it turns out that the timer overflowed, a check on my host:

bash-2.03$ Uptime 10:00pm + day (s), 14:57, 1 user, load average:1.31, 1.09, 1.08bash-2.03$ Datefri-Nov 22:00:14 C ST 2004



It was a little more than 497 days when it happened. Ft.

6. Arrange to reboot the mainframe system.

This question is depressing enough, NND, who had wanted Oracle this to be ...

Oracle finally claimed:

Fix made it into 9.2.0.6 Patchset

9206 has not been released on Solaris ... Dizzy.

Well, as an experience, if there is a very strange question, then boldly suspected Oracle bar, is a bug, may be a bug.

The problem is solved after the reboot, the status is as follows:

$ sqlplus "/as SYSDBA" Sql*plus:release 9.2.0.3.0-production on Fri Nov-09:21:21 2004Copyright (c) 1982, 2002, Oracl E Corporation. All rights reserved. Connected to:oracle9i Enterprise Edition release 9.2.0.3.0-productionwith The partitioning, OLAP and Oracle Data Mining Optionsjserver release 9.2.0.3.0-productionsql> select Job,last_date,last_sec,next_date,next_sec from User_jobs; JOB last_date last_sec next_date next_sec------------------------------------------------------------26-nov-04 09 : 21:04 26-nov-04 09:26:00sql>/JOB last_date last_sec next_date next_sec------------------------------------------ ------------------26-nov-04 09:26:01 26-nov-04 09:31:00sql> sql> select * from V$timer; Hsecs----------3388153sql> select * from V$timer; Hsecs----------3388319sql>



7.FAQ

Some friends ask questions on the pub.
Q: Does the same problem exist for different platforms?

A: For different platforms, there is the same problem
Because Oracle uses the standard C function gethrtime
Reference:
Http://www.eygle.com/unix/Man.Page.Of.gethrtime.htm

There is a problem with the code that uses the function.

In the Metalink note:3427424.8 document, Oracle-defined platform impact is: Generic (all/most platforms affected)

Q. Counter overflow, looked at the job is basically about 1 days to perform once, if the set 3 days to perform a job, whether the problem of uptime should be 497*3 after it?

A: No

Oracle internally passes timers to improve relative time.
Because the Oracle internal hrtime_t uses a 32-bit count

So the maximum value is 0xFFFFFFFF.
0xFFFFFFFF = 4294967295

The Slgcs () is 1 billion of a second, spilling over the 42949672950000000 point.

Note that here 0xFFFFFFFF, when this value is reached, is an unsigned integer and now becomes-1, then the value increments by +1 = 0.
Time has stopped.

I wrote a small piece of code to verify this content, for reference:

[Oracle@jumper oracle]$ cat unsign.c#include int main (void) {unsigned int num = 0xffffffff;printf ("num is%d bits long\n", sizeof (NUM) * 8);p rintf ("num = 0x%x\n", num);p rintf ("num + 1 = 0x%x\n", num + 1); return 0;} [Oracle@jumper oracle]$ gcc-o unsign.sh unsign.c[oracle@jumper oracle]$./unsign.shnum is-bits longnum = 0xffffffffnum + 1 = 0x0[oracle@jumper oracle]$




Q: One of the internal clocks should be this: V$timer accurate to 1/100 seconds of data

That's right!

Pay attention to what you said earlier:

4. Sequence of events leading to the problem
If the Gethrtime () OS call returns a value > 42949672950000000
Nanoseconds then Slgcs () stays at 0xFFFFFFFF. This can
Cause some problems in parts of the code which rely on
Slgcs () to keep moving.

In other words, if the gethrtime () operating system call returns a value greater than 42949672950000000 (in 1 billion per second)

In other words, Oracle will get a time value of 4294967295 for the CS value

And the 4294967295 value is 0xFFFFFFFF.

So V$timer's timing was:

Sql> select * from V$timer; Hsecs----------4294967295sql>/hsecs----------4294967295sql>/hsecs----------4294967295sql>







The author of this article:
Eygle,oracle technology concern, from China's largest Oracle technology forum Itpub.
Www.eygle.com is the author's personal site. You can contact the author by Guoqiang.Gai@gmail.com. Welcome technical discussions and exchange of links.

Original source:

Http://www.eygle.com/case/Job.Can.Not.Execute.Auto.htm




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.