How do I process 0.4 billion million records per day in SQLServer

Last Update:2018-06-03 Source: Internet

Author: User

Tags sql server query

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, I declare that I am only a programmer, not a professional DBA. The following article describes how to solve a problem, rather than giving you a correct result from the beginning, if there is something wrong with this article, please correct it so that I can better handle this business. Project background this is a project for a data center.

Project Background

This is a project for a data center. The project is extremely difficult. This project really makes me feel like a mall is like a battlefield, and I am just a soldier, there are too many tactics, too many high-level competitions, and too many insiders. For details about this project, I have time to write relevant blog posts.

This project requires environment monitoring. For now, we call the monitored device a collection device, and the property of the collected device is called a monitoring indicator. Project requirements: The system supports no less10 wMonitoring metrics. The data of each monitoring metric is updated.No more than 20 seconds, Storage latencyNo more than 120 seconds. Then, we can get a better state through simple calculation -- the data to be stored is:30 w per minute, 1800 w per hour, that is, 0.4 billion million RMB per day. In reality, the data volume is about 5% larger than this. (In fact, most of them are information spam, which can be processed through data compression, but what other people want to do is to engage you)

The above are the indicators required by the project. I think many students with a lot of big data processing experience will be overwhelmed. That's all? Well, I have read a lot of big data processing things, but I haven't done it before. It seems easy to see what others are leading, what is distributed, and what is read/write splitting. However, the problem is not that simple. As I mentioned above, this is a very bad project and a typical project of vicious competition in the industry.

There are no more servers, but this server not only works with databases and centralized collectors (that is, data parsing, alarm, and storage programs), but also supports North-direction interfaces (SNMP) at 30 W ), before the program is optimized, the CPU usage is more than 80% for years. Because the project requires the use of dual-host Hot Standby, in order to save trouble and reduce unnecessary trouble, we put the relevant services together to make full use of the HA features (External Purchase of the HA System)
The correctness of system data is extremely abnormal. A data record must not be poor from the underlying collection system to the upper-layer monitoring system.
Our system architecture is as follows. We can see that the database pressure is very high, especially on the LevelA node:
The hardware configuration is as follows:
CPU: Intel®Xeon®Processor E5-2609 (4-core, 2.40 GHz, 10 MB, 6.4 GT/s)
Memory: 4 GB (2x2 GB) DDR3 RDIMM Memory, 1333 MHz, ECC
Hard Disk: 500 GB 7200 RPM 3.5 ''SATA3 hard drive, Raid5.
Database Version
The standard version of SQLServer2012 is used. The genuine software provided by HP lacks many NB functions of the Enterprise Edition.

Write bottleneck

The first obstacle we encounter is that SQL Server cannot handle so much data in the existing program. What is the specific situation?

Our Storage Structure

Generally, a physical sub-table is used to store a large amount of historical data. Otherwise, millions of records are generated every day, which is hundreds of millions in a year. Therefore, our table structure is as follows:

CREATE TABLE [dbo].[His20140822]([No] [bigint] IDENTITY(1,1) NOT NULL,[Dtime] [datetime] NOT NULL,[MgrObjId] [varchar](36) NOT NULL,[Id] [varchar](50) NOT NULL,[Value] [varchar](50) NOT NULL, CONSTRAINT [PK_His20140822] PRIMARY KEY CLUSTERED ([No] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

No is a unique identifier, collection device Id (Guid), monitoring indicator Id (varchar (50), record time, and record value. The device IDs and metric IDs are collected as indexes for quick search.

Batch write

At the time of writing, BulKCopy was used. That's right. It is said that millions of records are written in seconds.

    public static int BatchInert(string connectionString, string desTable, DataTable dt, int batchSize = 500)    {        using (var sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.UseInternalTransaction)        {            BulkCopyTimeout = 300,            NotifyAfter = dt.Rows.Count,            BatchSize = batchSize,            DestinationTableName = desTable        })        {            foreach (DataColumn column in dt.Columns)                sbc.ColumnMappings.Add(column.ColumnName, column.ColumnName);            sbc.WriteToServer(dt);        }        return dt.Rows.Count;    }

What are the problems?

In the above architecture, tens of millions of data is OK every day. However, when the configuration is changed to the above background, the centralized monitoring program has a memory overflow. The analysis shows that too much data is received and stored in the memory, but it has not been written into the database, eventually, the generated data is larger than the consumed data, leading to memory overflow and the program cannot work.

What is the bottleneck?

Is it because of a RAID disk problem? Is it a Data Structure Problem? Is it a hardware problem? Is the SQL Server version a problem? Is there no partition table problem? Or program problems?

At that time, there was only one week, and a week was not good, so we had to get out of the project supervision, so we had to work for 48 hours in a row and call people everywhere to capture chickens ......

However, what we need at this time is to calm down and then ...... SQLServer version? Hardware? At present, it is unlikely to be changed. RAID disk array, should not be. So what is it, the real TM can't calm down.

You may not be able to understand the tense atmosphere at the scene. In fact, after so long, it was difficult for me to return to that situation. But we can say that, maybe we have a variety of methods now, or we have more thoughts on outsiders, but when a project oppresses you to give up, your thoughts and considerations at that time may cause significant deviations under the constraints of the field environment. It may make you think fast, or you may lose your mind. In such a high-pressure environment, some colleagues may even encounter more low-level mistakes. Their thinking is completely messy and their efficiency is even lower ...... If there is no eye-catching service for 36 hours, or only on the construction site (when the rain is full of mud, if it is dry, it will be muddy), squat for two or three hours, and then continue to work, for such a week in a row! Or continue!

Many people give a lot of ideas, but they seem useful and useless. Wait, why is it "useful, useless "? I seem to have caught a trace of direction. What is it? By the way, verify that we are running in the on-site environment and there is no problem before. It does not mean there is no problem under the current pressure. We need to analyze such a small function in a large system, the impact is too big. We should break it down. Yes, it is a unit test, that is, testing a single method. We need to verify the time consumption of each independent step of each function?

Step-by-Step testing and verification of system bottlenecks

Modify BulkCopy Parameters
First of all, I want to repair the parameters of BulkCopy,BulkCopyTimeout,BatchSize, Continuous test adjustment, the results are always fluctuating within a certain range, and the actual results are not affected. It may affect some CPU counts, but it is far from meeting my expectation. The write speed is still 5 seconds to 1 W ~ 2 W fluctuations, far less than 20 seconds to write 20 W records.

Storage by collection device
Yes, the above structure is a record based on each indicator value. Is it too much waste? Is it feasible to use the collection device + collection time as a record? The problem is, how can we solve the problem that the properties of different collection devices are different? At this moment, a colleague can make full use of it. Monitoring metrics + monitoring values can be stored in XML format. Wow, what else can this happen? The query can be in the for XML format.

So we have this structure:No、MgrObjId、Dtime、XMLData

The result verification is slightly better than the above, but not too obvious.

Data Table partition ???
At that time, I had not learned this skill. I read the articles on the Internet. It seems very complicated. I didn't dare to try it because I didn't have much time.

Stop other programs
I know this is definitely not feasible, because the architecture of software and hardware cannot be modified for the time being. However, I want to verify if these factors have an impact. The results showed that the prompt was indeed obvious, but still did not meet the requirements.

Is it the bottleneck of SQLServer?
No. Is this the bottleneck of SQLServer? I checked the relevant information online, which may be the bottleneck of I/O. What else can I do? Do I need to upgrade the server and change the database? But will the project team give it?

Wait, there seems to be something else, index, index! The existence of indexes will affect insertion and update.

Remove Index

Yes, the query will be slow after the index is removed, but I must first verify whether the removal will speed up writing. If the index of the MgrObjId and Id fields is removed.

A miracle occurred. Every time records were written ~ Data can be written within 9 seconds, which meets the requirements of the system.

How to solve the query?

A table requires more than 0.4 billion records a day, which is impossible to query without an index. What should I do !? I thought of our old method, physical table sharding. Yes. We used to use the talent table, so we are now using the hour table. For a total of 24 tables, each table only needs to store about million records.

Then, query the historical records of an attribute within one hour or several hours. The result is: slow! Slow !! Slow !!! When the index is removed, it is impossible to query more than 10 million of records. What else can I do?

As we continue table sharding, we can continue table sharding based on the underlying collectors. Because the collection devices are different in different collectors, when we query the historical curve, only the historical curves of a single indicator can be queried, so that they can be scattered in different tables.

As a result, 240 tables are generated every day by collecting 10 embedded tables and table sharding by 24 hours (the History Table name is similar to His_001_2014112615 ), finally, the problem of writing more than 0.4 billion records per day and supporting simple query is solved !!!

Query Optimization

After the above problem is solved, the difficulty of this project has been solved in half, and the project supervision is also embarrassed to come and find the problem. I don't know what tactical arrangement it is out.

After a long time, the end of the year is approaching, and the problem is coming again, that is, to drag you to the end of the year so that you cannot accept other projects at the end of the year.

This requirement is as follows: because the above is a simulation of monitoring indicators, but now the actual online, but only about devices. Therefore, this is obviously unable to meet the requirements of the bidding documents, and cannot be accepted. So what should we do? These smart people think that, since the Monitoring metrics are halved, we will cut the time by half. Isn't it enough? That is to say, according to the current 5 W equipment, you need to store the data within 10 s. I'm sorry, according to your logic, if we only have 500 monitoring metrics, wouldn't we have to store them in 0.1 seconds? Do you think about the monitored devices?

But what do you do if someone else wants to play with you? Answer the question. After the result is reduced to 10 seconds, the problem arises. By carefully analyzing the logic above, we can know that table sharding is based on the collector. Now the number of collectors is reduced, but the number is increased. What happened, writing is supported. However, each table has nearly 400 million records, and some collection devices have many metric indicators, which is close to 600 million. How can this problem be solved?

As a result, the technical staff held a meeting to discuss related initiatives.

How can I optimize the query without adding an index?

Some colleagues have suggested that the order of the where clause will affect the query results. Because you can reprocess the results after the selection, you can first select a part of the data, then, filter the next condition. It sounds reasonable, but will the SQL Server Query analyzer be automatically optimized? Forgive me for being a little white. I also feel that it should be the same as the VS compiler. It should be automatically optimized.

The specific method should be to talk with facts:

As a result, after the client was modified, the test feedback was greatly improved. I checked the Code:

Is there such a big impact? Wait, do you forget to clear the cache?
So let colleagues execute the following statements to get more information:

-- Optimize dbcc freeproccachedbcc dropcleanbuffersset statistics io ONselect Dtime, Value from dbo. his20140825 WHERE Dtime> = ''AND Dtime <='' AND MgrObjId = ''AND Id ='' set statistics io off -- after optimization, dbcc freeproccachedbcc dropcleanbuffersset statistics io ONselect Dtime, value from dbo. his20140825 WHERE MgrObjId = ''AND Id ='' AND Dtime> = ''AND Dtime <='' SET STATISTICS IO OFF

The result is as follows:

Is it better before optimization?

Check the IO data carefully and find that the pre-read is the same, that is, the data records we want to query are consistent, and the physical read and table scan are also consistent. Logical reads are slightly different because of the number of cache hits. That is to say,When no index is created, the conditional order of the where clause does not play an obvious role in optimizing the query results..

Then, you can only use the index method.

Index creation attempt

Indexing is not a simple task. You need to know some basic knowledge. In this process, I took a lot of detours and finally built the index.

The following experiment is based on the total number of records:

Index by single Field
This idea is mainly influenced by my data structure. The data structure in my memory is:Dictionary > . I thought creating an MgrObjId index and an Id index would be faster when SQL Server queries.

Create an index based on MgrObjId. The index size is 550 MB, which takes about 5 minutes 25 seconds. As a result, like an estimation plan, it does not work at all, but is slower.

Index creation based on multiple conditions
OK. Since the above does not work, what if we create an index based on multiple conditions?CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id,Dtime)

As a result, the query speed is indeed doubled:

And so on. Is this the benefit of indexing? At 7 minutes 25 seconds, I used 1g space in exchange for this? There must be something wrong with it. So I started to look up the materials and view some relevant books. In the end, I made more progress.

Correct index creation

First, we need to understand the key points of several indexes:

After indexing, sorting by the minimum number of index fields will achieve the optimal effect.. In our table, if a clustered index of No is created, placing No in the first place of the where clause is the best, followed by Id, then MgrObjId, and finally time, time index if the table is an hour, it is best not to use
The order of the where clause determines whether the query analyzer uses indexes for queries.. For example, if an index of MgrObjId and Id is createdwhere MgrObjId='' and Id='' and Dtime=''Index search will be used, andwhere Dtime='' and MgrObjId='' and Id=''You may not necessarily use index search.
Place the result column of a non-index column in The include Column. Because our condition is MgrObjId, Id, and Dtime, the returned results only need to contain Dtime and Value. Therefore, if we put Dtime and Value in the contained column, the returned index results will have this Value, you do not need to query physical tables to achieve the optimal speed.

With the above principles, we will create the following indexes:CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id) INCLUDE(Value,Dtime)

It takes more than 6 minutes and the index size is 903 MB.

Let's look at the estimated plan:

We can see that the index is used completely, and there is no additional consumption. The actual execution result is less than 1 second, and the results are filtered out in the million records in no second !! Awesome !!

How to Apply indexes?

Since the write and read operations are complete, how can we combine them? We can index the data generated one hour ago, but the data generated for the current hour will not be indexed. That is, do not create an index when creating a table !!

How can we optimize it?

You can try read/write splitting to write two databases. One is the real-time database and the other is the read-only database. When you query real-time databases within an hour, only the databases are read for data queries earlier than an hour. The databases are read-only and stored on a regular basis, and indexes are created. Data that exceeds one week is analyzed and processed before being stored. In this way, no matter what time period the data is queried, it can be processed correctly-the real-time database is queried within one hour, and the query within one hour to one week only reads the database, query the report library one week ago.

If you do not need a physical sub-table, you can regularly re-create the index in the read-only database.

Summary

You can process hundreds of millions of data records (historical data) in SQLServer in the following ways:

Remove all indexes from the table
Insert with SqlBulkCopy
Table or partition to reduce the total data volume of each table
Create an index after a table is fully written.
Specify the correct index Field
Put the fields that need to be used in the include index (the returned index contains everything)
Only the required fields are returned during the query.

If you think it is helpful to read this article, click"Recommendation"Button, yourThumb UPSIt will be my greatest motivation for writing! If you want to keep an eye on my articles, scan the QR code to followMa Fei codePublic number, the latest article is automatically pushed:




http://www.cnblogs.com/marvin/p/HowCanIHandleBigDataBySQLServer.html#commentform

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More