Cause and solution of SQL Server disk request timeout 833 error, sql833

Source: Internet
Author: User

Cause and solution of SQL Server disk request timeout 833 error, sql833

Recently, an SQL Server experienced extremely slow response and client request errors. In the errorlog of the database, an error message is reported when the disk request exceeds 15 s.

Are these problems caused by storage system or disk faults, SQL Server problems, or applications? What should we do?

This article will make a simple analysis of some factors that cause this problem, but it cannot cover all potential possibilities. Therefore, we need to make a specific analysis when encountering similar problems.

Disk request timeout in SQL Server

The English version of the error message is as follows:

SQL Server has encountered % d occurrence (s) of I/O requests taking longer than % d seconds to complete on file [% ls] in database id % d. the OS file handle is 0x % p. 0
The offset of the latest long I/O is: % # 016I64x

The Chinese version of the error message is as follows:

SQL Server has encountered % 1! Database ID % 4! File [% 3!] The number of I/O requests exceeds % 2! Seconds. The operating system file handle is 0x % 5 !. The offset of the latest long time I/O is: % 6!

Refer to error message 833 in message

Disk request timeout caused by 833 error

The specific error is as follows:

SQL Server has encountered an I/O Request to *** files in database n for more than 15 seconds. The operating system file handle is ***. The offset of the latest long-time I/O is :***

That is to say, an error occurs during the automatic growth of database files.

.

  

  

It is interesting that a DBA reports this error message to engineers responsible for storage (SAN storage, not attached disks) and thinks that the storage system may be faulty or unstable,

Storage engineers think that there is no storage problem. After checking the server, they say the server is abnormal and the memory is "almost full". For database servers, the memory is almost full, since the storage engineer is not a professional DBA, I may not be familiar with the memory usage of the SQL Server database Server. I can also understand this question.

Because the database server uses high-performance SAN storage, which exists as a service and is used by multiple N servers, no disk requests are sent to other servers, it is unlikely that a server may be suspected of a "Storage fault", which is simply regarded as a storage fault.

So what is the reason?

Description of database engine error 833

First, let's take a look at the specific meaning of this 833 error, so we won't have to install it on our own. It's very clear in this classic book.

In short, it means that when SQL Server requests disk read/write, it has not been completed for more than 15 seconds due to disk busy or other factors.
For example, when reading and writing data, you need to initiate a request to the disk, and the disk is busy or has other problems, it is too late or the corresponding is not timely enough, this will undoubtedly seriously affect the response time of the SQL Server to provide external servers.

The above is a simple analysis. Because this problem does not occur on a general chip and the storage system is unlikely to have a problem, it is very likely to locate the factors of the current server.

Cause Analysis

Because it is a dedicated SQL Server without requests from other applications, it is likely to be related to requests initiated to the SQL Server database.

In fact, before this problem occurs, it has long been a warning that the server is still stable at ordinary times (the CPU rarely exceeds 60%, and the memory PLE can also be stable for more than 20 minutes, low disk I/O latency, etc.), but occasionally it may take a while.

When the wind is exhausted, the CPU usage rate is about 80%, the memory usage is greatly reduced, and the IO latency is greatly increased.

At present, we can only start with the Session of SQL Server. When we observe the Active Session in SQL Server, we find that the query time for a certain type of SQL statements is very long,
Generally, such SQL statements are executed frequently within a certain period of time.

However, under normal circumstances, the execution efficiency of such SQL statements is still relatively high. Why is this suddenly changing?

When checking the execution plan corresponding to the Active Session, it is found that the waiting status of these active sessions is both IO wait (PAGEIOLATCH_SH), and SQL Execution is completely unexpected.

Because similar queries are executed frequently, these sessions will be initiated from different clients. Once the SQL Execution efficiency drops, a large number of active sessions will be accumulated on the server.

Why are normal SQL statements slow and slow,

The reason is that, at a certain point, SQL Server automatically triggers Statistics Updates, but this is a relatively large table, but the default sampling ratio for Statistics Updates is insufficient, if the sampling percentage is insufficient, this statistical information is completely unavailable.

Once the statistics are collected automatically, an efficient method (table scan instead of index seek) is sent to the previous SQL statement based on the collected statistics ), in fact, this method is not reasonable,
As a result, the corresponding SQL statement can be queried using a non-reasonable execution plan, which also causes Session congestion, the client sends a large number of sessions and uses an inefficient method for slow execution.

Therefore, the CPU will soar, IO latency will increase, and the memory PLE will be greatly reduced.

It is not hard to understand that the sessions for dozens of queries are frantically trying to send requests to the disk in an unreasonable way, and the disk is busy with data requests for active sessions, an error occurred when the system could not respond to the request for automatic growth of data or index files.

Finally, the index reconstruction (promoting the updating of statistical information, of course, purely updating of statistical information can also be done). For long-term prevention, You need to arrange a job to manually define the threshold value for updating statistical information and the sampling percentage.

Summary:

Many of the problems on the database server are a chain reaction process. Some of the observed phenomena are probably not the same as on the surface (Disk request timeout, the problem lies in storage ?)
Professional positions must be professional. For example, DBA mistakenly thought it was a storage problem at the beginning, and storage engineers thought it was abnormal when the server memory was full, in fact, this is not the root cause of the problem.
In the face of problems, it is essential to trace the root cause of the problem.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.