Lao Li share: Big Data performance Tuning Case 1, "Space Change Time" and "in-memory processing data"
For example, there are 200,000 different user_id in the User_id.csv file, according to user_id to check their corresponding user recently published a post, take out Post_id,post_title, Post_time and user_id (post table check, There is a column in the post table which is user_id, which means the post belongs to the person, and the number of posts is about 2 million, so how to deal with it?
My solutions are:
A. Export the four columns of the post table Post_id,post_title, Post_time, and user_id to the Posts.csv file, and then read the records in the Posts.csv file into the csvrecords using a CSV read component
B. Then, using the idea of space-for-time, first read the user_id in user_id.csv to the Useridlist list object, and then convert the useridlist to a dictionary:
var useriddict = useridlist.distinct (). ToDictionary (c = c, c = 1);
C. Finally, the results are compared to useriddict and csvrecords:
var resultrecords = Csvrecords.where (c = Postdict.containskey (C.userid)). ToList (); The query time complexity of ContainsKey here is O (1)
2. Join optimizes query performance
A page query is very inefficient (no results for more than a minute), and the query process executes 3 SQL in the background, where 2 SQL executes in about 39 seconds (2 SQL), causing the database connection to time out.
Background database queries use the EF framework, which can easily lead to poor query performance if the EF framework is used improperly.
Simply simulate the following (the database table name has been adjusted and the number of records has changed, without affecting the result):
Use 2 tables: A table for the account table, such as accounts, the number of records is about 3,000. The other table, for example, posts the post, with a record count of about 1.9 million.
Then the background processing process is probably: first based on the query criteria to obtain the account_id list of the Account table Accountids, and then according to ACCOUNT_ID list to find post record (the Post table has a field account_id), presumably:
VAR posts= db. Posts.where (M). Where (c = accountids.contains (C.accountid));
There is no need to focus too much on this line of code, I use the tool to monitor this line of code corresponding to the SQL:
SELECT * from Post
WHERE
((= account_id) OR (= account_id)) OR ...
ORDER by Created_at DESC
LIMIT 0, 15;
There are about 2000 or more conditions above. Explain results show that the rows value is more than 1.34 million and does not use an index.
Because the ORDER BY clause is more inefficient, it cannot be omitted for optimization because the business requires the first page to show the last 15 posts.
With the join query, you manually spelled a SQL, found almost instantaneous results, SQL probably as follows:
SELECT * FROM
Posts as A
JOIN
accounts as B
On a.account_id = b.ID
WHERE a.category = 1 # Posts by category
ORDER by A.created_at DESC
LIMIT 0, 15;
In this way, you discard EF and rewrite the code by using a method of spelling SQL instead. Speed improved a lot, it took nearly 2 minutes to display the results of the query, now only need 3-4 seconds.
This scenario uses join to improve query efficiency because the number of records in a table is only about 3,000, and the other one has millions of of the data. If both tables have millions of of the data, then join does not necessarily improve query efficiency
3, Business logic code level optimization
Understand business logic and eliminate redundant business logic code
The original link; http://www.cnblogs.com/laoli0201
Lao Li share: Big Data performance tuning case