A preliminary study of MYSQL SQL statement technique (i.)

Source: Internet
Author: User

A preliminary study of MYSQL SQL statement technique (i.)

This article is a summary of some of the features that I recently learned about some of the SQL methods () that will be updated later:

    1. Rand and RAND (n) Implement the extraction of random rows and the principle of order by.
    2. Bit_and,bit_or,bit_count equal-bit operation method to achieve statistical functions.

    1. Rand and RAND (n) extract

Use ORDER by rand (n) with limit N to lift a corresponding n random sample.

Example: SELECT * FROM Student grade= ' 2 ' ORDER by rand () limit 5

Randomly extract 5 Grade 2 student data.

There was some confusion at first about the way order by a random value. To understand this combination of statements, it is necessary to understand the implementation principle of order by (long ago to know that there are so many special reasons for order by using the index, etc.) so I found this place.

17807345

I have some doubts about this remark.

Note: MySQL can use at most one index at query time. Therefore, if the where condition already occupies the index, then the index is not used in the sort.

After querying the data, we found a concept of a merged index, which is not the focus of this article. (But here's an idea. If so, then I use the Join method to execute, need to have an index, but because where the index can not use the order by SQL statement, can effectively improve SQL efficiency? To be verified later)

The previous article learned that order by has Indexsort and filesort two ways to achieve it.

The Indexsort method is the sort implemented by the index. The indexes are ordered (except for special option when established). No further sorting is required.

Tips:sql Statement Execution order: SELECT from--1 WHERE--2 GROUP by--3 have--4 ORDER by--5)

Filesort is implemented by sorting algorithms:

Specific ways have

Read all data that needs to be sorted

Each row of data

Algorithm 1 (original): Store sort key and row pointers

Algorithm 2 (modified): Stores the fields in the sort key and select

Each sort sort_buffer_size can hold the number of rows, the sorting result is written to the Io_cache object (may be called IO1), the position information of this sort result is written to another Io_cache object (may be called IO2);

Io_cache Write temp file when more than 64k

When order by has limit N, only the first n sorting results are written to Io_cache;

Sort key length <=20 and sort key number between 1000 and 100,000 using radixsort, otherwise use quicksort

Merge Buffer

Read the sorting result (algorithm 2 reads the result directly from the temporary file; Algorithm 1 reads the row pointer from the temporary file and then reads the data from the table)

From

51988943

That is, the external sort is used most externally, and no external sort is used to sort the buckets or to sort quickly. Depending on the size of the data volume, Sort_buffer_size also affects the performance of the sorting.

As you can see from the above, order by rand () actually sorts the query query field with the n record or the n select Record, plus a Rand value. Returns the sort result for output interception.

Baidu ORDER by Rand () will jump out of a large number of reproduced: SQL Optimization-order by Rand in the article mentioned the reorganization will make SQL efficiency drops sharply!.

53446902

I did a bit of testing on that, too.

Explain select * from T_financial_bak ORDER by Cashandbank limit 5

Spents 1.085s

Explain select * FROM T_financial_bak ORDER by rand () limit 5

Spents 14.5s

Two are useless to the index, both use the filesort. But the statement using ORDER by RAND () uses temporary and takes up to 14.5s. As mentioned in that article, the rand () function cannot be used in an ORDER BY clause many times, because it causes the data column to be scanned multiple times, and Rand () is executed several times in the ORDER BY clause, natural efficiency and low.

I was a little bit at first, but because of this example I understand:

    1. SELECT * from T_financial_bak where id>= (select Floor (MAX (ID) * RAND ()) from T_financial_bak) Limit 5

Spents 2.728s

2. Select * from T_financial_bak as T1 joins (select rand () * (select MAX (ID) from T_financial_bak) as ID) as T2 limit 5

Spents 0.014s

Is the author's proposed improvement of Order by RAND (), a way of using subqueries, a join method. Subqueries are more in a way than join methods.

We should have concluded the question of the efficiency of order by Rand above.

The efficiency issue of ORDER by Rand () is not a special place between order by and Rand (), but because each data call Rand statement, resulting in a large number of RAND function call time, It is even necessary to open a temporary tablespace for SQL to meet the need for space to invoke the RAND function. This results in inefficient SQL.

Instead of using the sub-query and join method above is also an equity, because Rand () * MAX (ID) is a certain size, and the smaller number can not necessarily be larger than this number, only in greater than Rand () *max (ID) 5 pairs of small data (for already ordered data). The 5 data are clearly adjacent. The ORDER by RAND () results in a good randomness for each piece of data.

My solution:

If you can: you can add a random number to each row of data, using join a random number way to go random. However, there is a need to sacrifice a certain amount of data space, of course, if it can be implemented in the program more convenient.

The realization of bit_and,bit_or,bit_count statistic function

79841153

We have been exposed to the basics of C language, but the bit manipulation is not significant and not intuitive (although the rules are simple)

I am puzzled by the explanation of "1<<day" in the text.

Here is the explanation for 1<<day: Here The count takes a bitmap method, every day there is access to 1, 02 days to indicate that 100,23 represents 1000. (23 of 0).

After the last pass bit_or the bitwise or after becomes 10000. (20 x 0) 100, and then using Bit_count Statistics 2 1 is 2, very ingenious.

Said the shady. In fact, it is simple as the bit operation in C, which means to shift the 1 to the left day bit, so that different values occupy different bits.

In this paper, three methods are used to realize the statistic of de-weight.

Multiple query methods

Select year, month, COUNT (0)

from (Select year, month, Day,count (0) from T1 Group by year, month, day) as TMP

Group by year, month;

Bit Operation method

Select year,month,bit_count (Bit_or (1<<day)) as days from T1 Group by Year,month;

and the Distinct method

Select year,month,count (distinct day) from T1 Group by Year,month;

Let's compare the efficiency of the three.

Select QYGM,HYDM, COUNT (QYGM) from (select QYGM,HYDM from T_financial_bak Group by HYDM,QYGM) as TMP GROUP by HYDM

(Internal SQL removes content that HYDM and QYGM are duplicated)

Where QYGM and HYDM are indexed, spents 2.035s

You can see that both internal and external SQL statements are not indexed, and both temporary and filesort are used.

Inefficient.

Select Qygm,hydm,bit_count (Bit_or (1<<QYGM)) from T_financial_bak GROUP by HYDM

Spents 2.293s

The result of the explain is that the index named Hydm_sy is used and is not used for temporary and filesort

But it's still used for quite a short time.

Select Qygm,hydm,count (Distinct (QYGM)) from T_financial_bak GROUP by HYDM

Time 2.024s before both are at the same order of magnitude.

It also uses the index Hydm_sy, indicating that distinct is not specified in the use of the index, and that columns that are not distinct are still available for indexing.

So where do we spend the time of the three?

+----------------------+----------+

| Status | Duration |

+----------------------+----------+

| Starting | 8.6E-5 |

| Checking Permissions | 7E-6 |

| Opening Tables | 0.000135 |

| init | 1.6E-5 |

| System Lock | 9E-6 |

| Optimizing | 3E-6 |

| Optimizing | 4E-6 |

| Statistics | 1.5E-5 |

| Preparing | 1.6E-5 |

| Creating tmp Table | 6.2E-5 |

| Sorting result | 6E-6 |

| Statistics | 7E-6 |

| Preparing | 7E-6 |

| Creating tmp Table | 5.9E-5 |

| Sorting result | 4E-6 |

| Executing | 1.1E-5 |

| Sending Data | 1.1E-5 |

| Executing | 2E-6 |

| Sending Data | 2.086242 |

| Creating Sort Index | 0.005237 |

| removing TMP table | 1.8E-5 |

| Creating Sort Index | 0.002354 |

| Creating Sort Index | 0.002965 |

| End | 4E-6 |

| removing TMP table | 8E-6 |

| End | 3E-6 |

| Query End | 9E-6 |

| Closing Tables | 2E-6 |

| removing TMP table | 1E-5 |

| Closing Tables | 1.3E-5 |

| Freeing items | 5.2E-5 |

| Cleaning Up | 1.3E-5 |

+----------------------+----------+

The results show that all three of the time are spent on the sanding data, with a limit that I have modified the return of the information.

Results distinct and bit_count time were shortened to 0.002s

and internal and external query of SQL time is still 2s.

The time spent on both of them is also mostly used on sanding data.

When I replaced an indexed HYDM with a column that was not indexed, the three returned to the order of magnitude 2s. It is true that the gap between internal and external queries and the other two approaches is HYDM index.

Conclusion The combination of Bit_count () and bit_or () does not significantly improve the efficiency of SQL compared with the usual distinct.

Experience: These two examples of low rate of evaluation in daily SQL, ORDER by rand () are inefficient, not application-level SQL. And Bit_count,bit_or has not significantly improved the efficiency of SQL. In this process, however, you can understand the order by,distinct, as well as the view of SQL efficiency in a deeper understanding.

Because distinct and order by are inextricably linked. We put the exploration of distinct in the next section.

A preliminary study of MYSQL SQL statement technique (i.)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.