How to process 430 million records per day in SQL Server

Source: Internet
Author: User
Tags data structures sql server query

Project background

This is to a data center to do a project, the project is very difficult, the project really makes me feel, the mall like battlefield, and I just one of the soldiers, too many tactics, too many high-level contest, too much insider. Specific to the situation of this project, I am free to write the relevant blog post.

This project is required to do environmental monitoring, we temporarily refer to the monitored equipment as a collection device, the properties of the acquisition device is called monitoring indicators. Project requirements: The system supports not less than 10w monitoring indicators, each monitoring indicator data update is not more than 20 seconds , storage latency of not more than 120 seconds . So, we can get the ideal state by simple calculation-the data to be stored is: 30w per minute, 1800w per hour, that is, 432 million per day . In fact, the volume of data will be about 5% larger than this. (In fact, most of it is information garbage, can be processed by data compression, but others are to engage you, can do)

Above is the project requirements of the indicators, I think a lot of big data processing experience of students will be bared to the nose, so point? Well, I also read a lot of big data processing things, but did not deal with before, see others are well-written, what distributed, what read and write separation, it seems really easy to solve. However, the problem is not so simple, above I said, this is a very bad project, is an industry vicious competition typical project.

    1. No more servers, but this server in addition to the database, centralized capture (that is, data parsing, alerting, storage of the program), but also to support the 30w point of the North Interface (SNMP), the program is not optimized before the CPU occupies 80% or more year-round. Because the project requires the use of dual-machine hot standby, in order to save the hassle and reduce unnecessary trouble, we put the relevant services together so that we can take full advantage of HA features (externally purchased HA system)
    2. System data correctness requirements are extremely abnormal, requires from the bottom of the collection system to the top of the monitoring system, a data can not be poor
      Our system architecture is as follows, and we can see that the database pressure is very large, especially in the Levela node:

    3. The hardware configuration is as follows:
      CPU: intel® xeon® Processor e5-2609 (4-Core, 2.40GHz, 10MB, 6.4 gt/s)
      Memory : 4GB (2X2GB) DDR3 RDIMM memory, 1333MHZ,ECC
      HDD : 500GB 7200 RPM 3.5 ' SATA3 HDD, RAID5.
    4. Database version
      The use of the SQLSERVER2012 standard version, HP provides genuine software, the lack of many enterprise version of NB features.
Write bottlenecks

The first obstacle encountered is that we found that the existing procedures, SQL Server can not handle so much data volume, what is the situation?

Our storage structure

Generally in order to store a large amount of historical data, we will carry out a physical sub-table, otherwise every day millions records, a year down is hundreds of millions of. So, the structure of our table is this:

CREATE TABLE [dbo].[His20140822]( [No] [bigint] IDENTITY(1,1) NOT NULL, [Dtime] [datetime] NOT NULL, [MgrObjId] [varchar](36) NOT NULL, [Id] [varchar](50) NOT NULL, [Value] [varchar](50) NOT NULL,  CONSTRAINT [PK_His20140822] PRIMARY KEY CLUSTERED  ( [No] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

No as the unique identification, acquisition device ID (GUID), monitoring indicator ID (varchar (50)), recording time, record value. and the acquisition device ID and monitoring indicator ID as the index, in order to quickly find.

Bulk Write

Write at the time is used bulkcopy, yes, it is, called write millions records are second-level

    public static int BatchInert(string connectionString, string desTable, DataTable dt, int batchSize = 500)     {         using (var sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.UseInternalTransaction)         {             BulkCopyTimeout = 300,             NotifyAfter = dt.Rows.Count,             BatchSize = batchSize,             DestinationTableName = desTable         })         {             foreach (DataColumn column in dt.Columns)                 sbc.ColumnMappings.Add(column.ColumnName, column.ColumnName);             sbc.WriteToServer(dt);         }          return dt.Rows.Count;     }
What's the problem?

The above architecture, 40 million of the data per day is OK. However, in the context of the adjustment to the configuration, the central monitoring program on the memory overflow, the analysis that received too much data, put in memory, but did not have time to write to the database, resulting in the resulting data is larger than the consumption of data, resulting in memory overflow, the program does not work.

Where exactly is the bottleneck?

Is it a problem with a RAID disk? Is it a data structure problem? Is it a hardware problem? is the SQL Server version of the issue? Is there a problem with no partition table? Or a problem with the program?

At that time only one weeks, one weeks do not do well, the project supervision is about to leave us, so, with a continuous work 48 hours of feat, there are everywhere to call to beg for the chicken ...

However, this time need is calm, then calm ... SQL Server version? Hardware? It's unlikely to change at the moment. RAID disk array, should not be. So what exactly is, true TM of calm not down.

You may not be able to experience the tension of the atmosphere, in fact, after so long, I myself also very difficult to return to that situation. But so to speak, maybe we have a variety of methods now, or we have more thinking on the sidelines, but when a project oppresses you to give up, your thoughts, considering the constraints of environmental factors at the scene, are likely to cause significant deviations. It is possible for you to think quickly, or you may think of stagnation. Some colleagues in this high-pressure environment, even more low-level errors, thinking has been completely chaotic, less efficient ... 36 hours did not sleep, or only on the construction site (the rainy day is full of mud, dry words to the time is mud ash) squint two or three hours, and then continue to dry, for one weeks in a row! Or you have to continue!

Many people give a lot of ideas, but they seem to be useful and useless. Wait, why is "it seems useful, and it seems useless"? I seem to have grasped a direction in the faint, what is it exactly? By the way, verify that we are now running in the field environment, before no problem, does not mean that the current pressure is no problem, to analyze such a small function in a large system, the impact is too big, we should break it down. Yes, it's "unit testing", which is the test of a single method, and we need to verify each function, where does each individual step take?

Stepwise testing to verify system bottlenecks

Modifying the parameters of a bulkcopy
First of all, I think of the bulkcopy of the various parameters,, and BulkCopyTimeout BatchSize constantly test adjustments, the results are always in a range of fluctuations, the actual does not affect. May affect some CPU count, but far from achieving my expectations, the speed of writing is still in 5 seconds 1w~2w fluctuations, far less than the requirement of 20 seconds to write 20w records.

Storage by acquisition device
Yes, the above structure per indicator for each value is a record, is not too much waste? Is it possible to use the acquisition device + acquisition time as a record? The question is, how to solve the problem of different acquisition device properties? At this time, a colleague to play the talent, monitoring indicators + monitoring values can be stored in XML format. Wow, can you do that? queries, you can use this form for XML.

So, with this structure:No、MgrObjId、Dtime、XMLData

The results are slightly better than the above, but not too obvious.

Data table Partitioning???
At that time have not learned this skill, read the article on the Internet, seems quite complex, time is not much, dare not try.

Stop other programs
I know this must not work, because the software, hardware architecture temporarily cannot be modified. But I would like to verify that these factors are not affected. The results showed that the hints were obvious, but still did not meet the requirements.

Is it the bottleneck of SQL Server?
Can't do that, is this the bottleneck of SQL Server? On the Internet to check the relevant information, may be the bottleneck of Io, and what can be done, to upgrade the server, to replace the database, but the project party?

Wait, there seems to be another thing, index, to index! The presence of an index affects insertions, updates

Remove index

Yes, the query must be slow to get rid of the index, but I have to verify that removing the index will speed up the write. If you decisively remove the index of Mgrobjid and ID two fields.

Run, the miracle appeared, each write 10w records, in 7-9 seconds can be written completely, so as to achieve the system requirements.

How to solve the query?

A table of more than 400 million records a day, it is impossible to query, in the absence of an index. What to do!? I thought of our old-fashioned, physical sub-table. Yes, we are according to the talent table, then we are now divided by the hour table. So 24 tables, each table only needs to store about 1800w Records.

Then query, a property in one hours or a few hours of history. The result: Slow! Slow!! Slow!!! It is unthinkable to query more than 10 million of records without an index. What else can I do?

Continue to divide the table, I think, we can also press the bottom of the collector continue to sub-table, because the acquisition equipment in different collectors are different, then we query the historical curve, only to check the historical curve of a single indicator, then this can be scattered in different tables.

Say dry, the result, by pressing 10 to capture the embedded and according to 24 hour table, generate 240 tables per day (history table name similar to this: his_001_2014112615), and finally write a day to more than 400 million records and support simple query this problem to solve!!!

Query optimization

After the above problem is solved, the difficulty of the project has been solved half, the project supervision also embarrassed to come to find fault, do not know is out of what kind of tactical arrangement.

After a long time, to the end of the year, the problem has come again, is to drag you to the end of the year you can not accept other projects.

This request is this: Because the above is analog 10w monitoring indicators, and now actually line, but only about 5w of equipment. Then this obviously can not meet the requirements of the tender, can not be accepted. So what do we do? These smart people think, since the monitoring indicators halved, then we have to halve the time, not to achieve it: that is, according to the current 5w device, then you want to store storage within 10s. I take a go ah, according to your logic, if we have only 500 monitoring indicators, will not be in 0.1 seconds to storage? Don't you think about the devices that are being monitored?

But what can you do if someone else wants to play with you? . The result of the time to 10 seconds, the problem comes, we carefully analyze the above logic can be known, the sub-table is divided according to the collector, now the collector is reduced, but the number increased, what happened, write can support, but, each table records close to 400w, some collection equipment monitoring indicators more, Close to 600w, how to break?

The technical staff met to discuss the relevant initiatives.

How to optimize queries without indexing?

Some colleagues have suggested that the order of where clauses will affect the results of the query, because after you brush the results of the selection process, you can first brush to select a portion of the data, and then continue the next condition of the filter. Sounds like a lot of sense, but does SQL Server Query Analyzer automatically optimize? Forgive me is a small white, I also feel, it should be similar to vs compiler, should automatically optimize it.

What to do, or to use the facts to say:

Results after the colleague modified the client, test feedback, there is a greater improvement. I looked at the code:

Does that really have such a big impact? Wait, did you forget to empty the cache, causing the illusion?
Then let the colleague execute the following statement to draw more information:

--优化之前 DBCC FREEPROCCACHE DBCC DROPCLEANBUFFERS  SET STATISTICS IO ON select Dtime,Value from dbo.his20140825 WHERE  Dtime>=‘‘ AND Dtime<=‘‘ AND MgrObjId=‘‘ AND Id=‘‘ SET STATISTICS IO OFF  --优化之后 DBCC FREEPROCCACHE DBCC DROPCLEANBUFFERS  SET STATISTICS IO ON select Dtime,Value from dbo.his20140825 WHERE MgrObjId=‘‘ AND Id=‘‘ AND Dtime>=‘‘ AND Dtime<=‘‘ SET STATISTICS IO OFF

The results are as follows:

Better before the optimization?

Looking closely at the IO data, we found that the pre-read is the same, that is, we want to query the data records are consistent, physical reading, table scanning is always. Logical reads, however, are slightly different and should be caused by cache hits. In other words, the conditional order of the WHERE clause does not optimize the query result without indexing .

Then, you can only pass the index method.

Attempt to build an index

Indexing is not a simple thing, it is necessary to understand some basic knowledge, in this process, I took a lot of detours, and finally set up the index.

The following experiment is based on the validation of the following total number of records:

Index by a single field
The idea, mostly influenced by my building data structures, is that the data structure in my memory is: Dictionary<MgrObjId,Dictionary<Id,Property>> . I thought it would be faster to create an index of Mgrobjid and then an index of the ID, and SQL Server would query it.

Index by Mgrobjid First, index size 550M, time 5 minutes 25 seconds. As a result, like the projected plan, it doesn't work at all, but it's slower.

Indexing by multiple criteria
OK, since the above does not work, then we set the index on multiple criteria?CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id,Dtime)

As a result, the query speed does increase by one times:

Wait, is that the benefit of the index? It takes 7 minutes and 25 seconds to trade in 1.1G of space. There must be something wrong, so began to check the information, to see some related books, in the end, with increased progress.

Correct index creation

First, we need to understand the key points of several indexes:

    • after indexing, the best results are achieved by sorting by the fewest number of indexed fields . In our table, if a clustered index of no is established, placing no in the first position of the WHERE clause is the best, followed by the ID, then the Mgrobjid, the last time, the time index if the table is one hours, it is best not to use
    • the order of the WHERE clause determines whether the query parser uses an index to query . For example, the index of Mgrobjid and ID is established, then the index where MgrObjId=‘‘ and Id=‘‘ and Dtime=‘‘ lookup is used, and where Dtime=‘‘ and MgrObjId=‘‘ and Id=‘‘ the index lookup is not necessarily used.
    • Place the result column of the non-indexed column in the containing column . Because our condition is Mgrobjid and ID as well as dtime, so we only need to include dtime and value in the return result, so we put dtime and value in the included column, the result of the returned index has this value, we can achieve the optimal speed without checking the physical table.

With the above principles, we establish the following index:CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id) INCLUDE(Value,Dtime)

Time-consuming: 6 minutes with an index size of 903M.

Let's look at the estimated plan:

As you can see, the index is fully used here, with no additional consumption. and the actual implementation of the results, 1 seconds are not, unexpectedly not a second in the 1100w record in the results filtered out!! That's awesome!!

How do I apply an index?

Now that the write is complete and the reading is complete, how do you combine it? We can index the data one hours ago, and the current one hours of data are not indexed. That is, do not create the table again when the index!!

How can I optimize it?

Can try to read and write separation, write two libraries, one is a real-time library, one is read-only library. One hours of data query real-time library, one hours before the data query read-only library, read-only library timed storage, and then set up an index; more than one weeks of data, analysis processing and storage. In this way, no matter what time period of query data, can be handled correctly-one hours of query real-time library, one hours to one weeks of query read-only library, one weeks before the query report library.

If you do not need a physical table, you can rebuild the index periodically in a read-only library.

Summarize

How to work with billions of levels of data (historical data) in SQL Server can be done in the following ways:

      • Remove all indexes from a table
      • Insert with SqlBulkCopy
      • Sub-table or partition to reduce the amount of data per table
      • Index after a table is completely finished
      • The correct index field is specified
      • Put the fields you want to use in the included index (everything is included in the returned index)
      • Only the required fields are returned when querying

How to process 430 million records per day in SQL Server

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.