"Open source Framework those things 27" leisurely strum: A SQL-induced performance crisis and its hidden design behind the lack of

Source: Internet
Author: User

There is a classmate, said that the system has a performance problem, said to let me help diagnose.  Originally did not want to spend this time, the result can not withstand each other's stalker dozen, as long as promised to help see. The background of the story is that when the file is uploaded, there is time for someone to upload the file, but in the end it does not use the uploaded file, which will produce some junk files.
The original software author would like to write a background timer task program to clear these junk files?
Because the author is determined not to let me send her SQL statement (this I also understand, so ugly sql), so here does not originate code, hair Pseudo-code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void deletemisslinkfile{
List filelist=getfilelist ();
List deletefilelist=new ArrayList ();
for (file:filelist) {
int Count1=execute (SELECT COUNT (*) from ...);
int Count2=execute (SELECT COUNT (*) from ...);
int Count3=execute (SELECT COUNT (*) from ...);
int Count4=execute (SELECT COUNT (*) from ...);
int Count5=execute (SELECT COUNT (*) from ...);
if (count1==0&&count2==0&&count3==0&&count4==0&&count5==0) {
Deletefilelist.add (file);
}
}
Delete (deletefilelist);
}

Of course, I have done some processing here, making it look a lot more beautiful, in fact, um, it is ugly. What are the performance conditions at this time? There are more than 500 data in the table, but the execution time is more than 100 seconds, but actually the actual application scenario is far more than this order of magnitude, and as the data increases, performance will decrease exponentially.
I said you go to add 100,000 records to test, make sure you don't come out overnight.
Okay, cut the crap, and then look at how to optimize the program.
Before we start, we can assume that there are n files, there are m file reference tables, and assume that all the file references are the same as the number of record bars in a table.
It is clear that the original implementation method was executed: 1 times the number of files query +n*m statistics operations
the stupidest method of optimizationFirst, use the lowest cost method to optimize a piece:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
void deletemisslinkfile{
List filelist=getfilelist ();
List deletefilelist=new ArrayList ();
for (file:filelist) {
int Count1=execute (SELECT COUNT (*) from ...);
if (count1>0) continue;
int Count2=execute (SELECT COUNT (*) from ...);
if (count2>0) continue;
int Count3=execute (SELECT COUNT (*) from ...);
if (count3>0) continue;
int Count4=execute (SELECT COUNT (*) from ...);
if (count4>0) continue;
int Count5=execute (SELECT COUNT (*) from ...);
if (count1>0) continue;
Deletefilelist.add (file);
}
Delete (deletefilelist);
}

Well, with the refactoring above, the performance can be boosted by a few times.  Ugly is a little ugly, but 1 time times is not a small increase oh. Reason, the original is to calculate all the statistical values, and then to judge, through the above reconstruction, average as long as the half can be checked out, so the performance will be 1 time times the promotion.
1 times File count query +N*M/2 statistics operation
general method of optimizationI reminded her that you can change the inside and outside, performance will improve a lot, the result is not understand,.
In fact, the logic is this, because the statistical operation is very low execution efficiency, and the query with the primary key is very fast, that is, the logic from: Traverse all the files to see how many citations, change to remove all the referenced files from all file lists, the rest is to delete the junk file.
1
2
3
4
5
6
7
8
9
10
11
12
13
void deletemisslinkfile{
List filelist=getfilelist ();
List reflist1=execute (select File from Tb1 ...)
for (Ref:reflist1) {
Filelist.remove (ref)
}
List reflist2=execute (select File from TB2 ...)
for (REF:REFLIST2) {
Filelist.remove (ref)
}
......
Delete (deletefilelist);
}

Through the above optimization, the SQL statements that need to be executed are: 1+m SQL statement, the other is a large amount of memory data pair, relatively speaking, performance will be much higher, through a certain skill to do some optimization, there will be a greater increase.
This way, I Mao Yan than the original way, can be increased by two orders of magnitude above.
Why is it that two or so orders of magnitude have been raised or more stupid?
Because this method has a significant improvement over the original method, there are still serious design problems.
First of all, when the amount of data is small (this is compared to the data in the Internet application), do a full traversal is not a problem, but when the amount of data is relatively large, with a SQL to traverse all the data, there is a very big problem. This time to introduce a series of complex problems to solve, such as: the single-machine computing into cluster computing, the entire calculation into a segmented time, no matter what, is a very complex process.
the method of inactionThe quickest, easiest and most efficient method is to be introduced below.
In fact, as long as the algorithm is the optimization of space and room, so generally speaking I rarely say full. This time I used the "most" word, that is used to indicate that the future has no optimization of space, then what kind of algorithm can not optimize the space? The answer is: do nothing.
Of course, in fact, it is impossible to do nothing, the problem is where, you do not do how can be good?
is actually the task of a certain decomposition. Through the rational analysis and design of the architecture, all the files uploaded and deleted into public methods (or services), in the need to deal with the document, when dealing with the document, do the following:
    • File Upload: Add a piece of data to the file upload data, such as: file-related information, unique identification, the number of citations is 0
    • File association: When data is associated with a file, the number of changes to the reference is +1
    • File unlink: When the data is associated with the file (usually when the deletion or edit is empty or replaced by another one), the number of references is 1
Since then, when it is time to clean up the rubbish, it is very simple, as long as:
Select ... where ref_times=0
And then do the corresponding cleanup work just fine.
At this time, the processing mode is optimized, and the maintenance of file reference data is decomposed into the process of business work, which can greatly improve the processing efficiency of garbage disposal. Of course some people say: if this will make my business process slow, then what? In fact, it doesn't matter, you can turn this into an asynchronous message, notify the file reference processing to do this thing, so it won't affect your business processing efficiency.
SummaryThrough the above analysis, we have optimized the garbage cleaning process during the file uploading process, analyzed the original problem, and the following 3 kinds of optimization methods and their advantages and disadvantages.
Of course, many friends will have a better solution, welcome to participate in the discussion, and criticize correct.
If, you like my blog, pleaseconcernMe, in order to receive my latest news.
If you are interested in my open source framework, you canhereGet to the latest code, or you can accessTiny official websiteGet more messages, or go toTiny CommunityFor instant communication.


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Open source Framework those things 27" leisurely strum: A SQL-induced performance crisis and its hidden design behind the lack of

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.