Implementing the median Algorithm in MySQL

Source: Internet
Author: User

Implementing the median Algorithm in MySQL

Purpose:

MySQL does not have a specific median algorithm. For those unfamiliar with SQL, median can only be written in Java or other languages. It is not recommended that you use MySQL to complete median calculation. The following implementation only aims to understand some common and uncommon functions and functions of MySQL during algorithm parsing, and give you an instant thinking.

Of course, for some temporary requirements, some temporary algorithms need to be tested, verified, and exported. MySQL can be used to complete these algorithms to highlight their efficiency.

Speaking of median, we need a batch of data for testing and implementation, and create the following table:

Drop table if exists CaseRent;

Create table CaseRent (

ID int (6) not null AUTO_INCREMENT,

ResidentialAreaID int (6) default null,

CaseFrom varchar (30) default null,

Price int (6) default null,

Primary key (ID)

);

It is called the rental case table. The key fields include the residential ID, Case source, and price.

Next, assign a random number to the rental case table:

Insert into CaseRent (ResidentialAreaID, CaseFrom, Price)

Select round (RAND () * 100) + 1), 'chains', ROUND (RAND () * 8000) + 1000)

The statement contains the following knowledge points:
1. assign values through insert into... SELECT (widely used, table creation is also available)
2. Use the Rand () random number function and ROUND () to rounding the function to complete the process from 0 ~ 100, price from 1000 ~ 9000 of random input.

Of course, one piece of data is not enough. We can execute it several times to increase the number of data entries to nearly 10. Modify the value assignment statement.

Insert into CaseRent (ResidentialAreaID, CaseFrom, Price)

Select round (RAND () * 100) + 1), 'chains', ROUND (RAND () * 8000) + 1000) FROM CaseRent

Continue N times, and then change the source "home online" to "home world" for a value assignment.

Insert into CaseRent (ResidentialAreaID, CaseFrom, Price)

Select round (RAND () * 100) + 1), 'House world', ROUND (RAND () * 8000) + 1000) FROM CaseRent

This completes the simulation! Example:

In fact, there are hundreds of online medians, but none of them are: long code, self-association or use of temporary variables.

Of course, there are similar methods we will talk about next. Either way, you need to learn more about and expand your knowledge.

Next, let's take the User-Defined simulation data as an example and arrange the first question:
1. Find the price median of the residential area ID = 99

The median of this type can be said to be the simplest, and most medians on the Internet are targeted at this median (single condition). The problem is similar to ours from the above websites, however, the amount of code is quite large.

We will analyze the problem: to obtain the median price, we must use order by for sorting. After sorting, We will calculate the total number of items to obtain the price of the intermediate one as the result (if it is an even number, two mean values can be obtained, or the first one can be taken for example, six data records. 3rd or four average values can be taken for calculation. Here, the first one is used for algorithm simulation)

The first step is to sort the prices from small to large:

SELECT * FROM CaseRent WHERE ResidentialAreaID = 99 order by Price

After sorting, the IDs are disordered. If we do this manually, we can only search for several numbers manually. Therefore, we need to have a new auto-increment ID, in this way, you can learn the corresponding ranking more quickly.

How to get a new auto-increment ID? We can create a new table and use insert into... SELECT to input new data to achieve auto-increment of Data IDs: for example:

Insert into NewCaseRent (ResidentialAreaID, CaseFrom, Price)

SELECT ResidentialAreaID, CaseFrom, Price FROM CaseRent WHERE ResidentialAreaID = 99 order by Price

However, we need to create a table, which is obviously troublesome. Because of an auto-incrementing table, creating a new table cannot meet the requirements,

Then we need a variable to implement the auto-increment function.

Like JAVA, Python, and other development languages, Mysql also has variables. Generally, User-Defined variables starting with @ and system variables starting.

So how do we use variables? It is easy to create and assign variable values through SET, and then query the results through SELECT, for example:

SET @ ID = 0;

SELECT @ ID;

With the variable, we can use the variable as the new auto-increment ID to replace the operation for creating a new table,

The variable auto-increment operation completes the new auto-increment ID function:

SET @ ID = 0;

SELECT @ ID: = @ ID + 1 as id, ResidentialAreaID, CaseFrom, Price FROM CaseRent WHERE ResidentialAreaID = 99 order by Price

Notes:
1. assign a value to the temporary variable in SELECT, using: =
2. Each statement is a loop query at the underlying layer. Therefore, you can add statements one by one.

Of course, the preceding statement is actually two statements, which can be executed in JAVA or other languages, which may be inconvenient. Therefore, you can modify the statements as follows:

SELECT @ ID: = @ ID + 1 as id, ResidentialAreaID, CaseFrom, Price FROM CaseRent, (SELECT @ ID: = 0) B

WHERE ResidentialAreaID = 99 order by Price

Result example:

The effect is good. Next we need to get the data with ID = Total number of entries/2.

Think about how to get a simple result?

SELECT * FROM (

SELECT @ ID: = @ ID + 1 as id, ResidentialAreaID, CaseFrom, Price FROM CaseRent, (SELECT @ ID: = 0) B

WHERE ResidentialAreaID = 99 order by Price) a where id = @ ID/2

Through simple median selection, you can deeply understand the usage of Mysql temporary variables.

Next we will introduce the median of deeper layers:
1. Calculate the median of the cell ID = 99 from different sources based on the case source.

Analysis problem: there is one more condition than the first step, and the result is also one more data.

So what should we do?

We know that we need to sort data by case source and price. What will happen if we directly add an ID?

SELECT @ ID: = @ ID + 1 as id, ResidentialAreaID, CaseFrom, Price FROM CaseRent, (SELECT @ ID: = 0) B

WHERE ResidentialAreaID = 99 order by CaseFrom, Price

Obviously, if you want to implement a real auto-incremental ID, you need to calculate the ID from 1 again when you go to the chainhouse online step.

So we divide it into two counts? If there are N cases, this method is obviously not feasible.

Next we will introduce the Mysql function IF

IF (condition, true, false)

Why is IF introduced? We need to determine whether the source of the case is the same as that of the previous case when auto-increment is sorted. If it is different, it indicates that the source is switched to the new source. At this time, the @ ID is set to start from 1, you can implement two auto-increment IDs from different sources.

To determine whether the source is the same, we have to add a temporary variable @ CaseFrom.


Copy code
SET @ ID: = 0, @ CaseFrom = '';

Select if (@ CaseFrom! = CaseFrom, @ ID: = 1, @ ID: = @ ID + 1) as id, ResidentialAreaID, CaseFrom, Price,

@ CaseFrom: = CaseFrom wy

FROM CaseRent WHERE ResidentialAreaID = 99 order by CaseFrom, Price;

Here, the wy field is purely used to assign a value to CaseFrom. It is useless for other operations.

The result is as follows:

But the problem arises. @ ID cannot be used to determine Count (*)/2. Because @ ID is already the online ID of the chain house, rather than the offline ID.

By creating a temporary table: temporary tables can be easily solved:

The Temporary table Temporary is only used in the current session. Other sessions create Temporary tables with the same name. They do not conflict with each other and do not directly generate entity tables.

However, temporary tables cannot be self-associated.


Copy code
SET @ ID: = 0, @ CaseFrom = '';

Drop table if exists CS_1;

Create temporary table CS_1

Select if (@ CaseFrom! = CaseFrom, @ ID: = 1, @ ID: = @ ID + 1) as id, ResidentialAreaID, CaseFrom, Price, @ CaseFrom: = CaseFrom wy

FROM CaseRent WHERE ResidentialAreaID = 99 order by CaseFrom, Price;

Drop table if exists CS_2;

Create temporary table CS_2

SELECT CaseFrom, FLOOR (Max (ID)/2) CenterID FROM CS_1 group by CaseFrom;

SELECT * FROM CS_1 a inner join CS_2 B ON a. ID = B. CenterID AND a. CaseFrom = B. CaseFrom;

This is an obvious drag-and-drop. After writing so much code, I created two temporary tables and joined them to obtain the results. However, for some temporary operations such as computing and export, even if you write a script in python, the amount of code is far greater than this.

In the preceding method, the multi-level median acquisition is realized through the temporary table + IF. However, we know that judging by IF means that if I add a new level, for example:
1. Obtain the median of each residential area and source.

In this way, we need to add a temporary variable for the Cell ID. Not only does the case source change, but we need to reset the ID to 1. When the cell ID changes, we also need to reset it to 1. The Code is as follows:


Copy code
SET @ ID: = 0, @ CaseFrom = '', @ ResidentialAreaID = 0;

Drop table if exists CS_1;

Create temporary table CS_1

Select if (@ CaseFrom! = CaseFrom, @ ID: = 1, @ ID: = @ ID + 1) as id,

IF (@ ResidentialAreaID! = ResidentialAreaID, @ ID: = 1, 1) AS ID2,

ResidentialAreaID, CaseFrom, Price, @ CaseFrom: = CaseFrom wy, @ ResidentialAreaID: = ResidentialAreaID wy2

FROM CaseRent order by ResidentialAreaID, CaseFrom, Price;

Drop table if exists CS_2;

Create temporary table CS_2

SELECT ResidentialAreaID, CaseFrom, FLOOR (Max (ID)/2) CenterID FROM CS_1 group by ResidentialAreaID, CaseFrom;

SELECT * FROM CS_1 a inner join CS_2 B ON a. ID = B. CenterID AND a. CaseFrom = B. CaseFrom

AND a. ResidentialAreaID = B. ResidentialAreaID;

An IF judgment is added, a temporary variable is added, and a field is associated.

This does not add much code to anyone familiar with and familiar with this logic, but it adds a level of logic and needs to be understood, which may be obfuscated.

It looks a lot more simple than other methods, but we still have a lot to try!

For example, writing Mysql user-defined functions and stored procedures for implementation, but this is a little different.

Next, use another method.

Use GROUP_CONCAT and SUBSTRING_INDEX to implement the median Algorithm

Group_concat is generally not a stranger. It is generally used with Group By. Of course, it is not useful for Group.

You can use Group_concat to separate result fields by commas by default to form a new string.

For example:

SELECT GROUP_CONCAT (Price) FROM CaseRent WHERE ResidentialAreaID = 99;

The result is as follows:

In GROUP_CONCAT, you can also write some SQL code. For example

GROUP_CONCAT (Price order by Price)

Or:

GROUP_CONCAT (DISTINCT Price)

Is it convenient? You can sort and remove duplicates by yourself to form a new string.

Next, we will introduce another function: SUBSTRING_INDEX.

First look at the results:

SELECT SUBSTRING_INDEX ('batch, number, data', ',', 1)

= Batch

SELECT SUBSTRING_INDEX ('batch, number, data', ',', 2)

= Batch, quantity

SELECT SUBSTRING_INDEX ('batch, number, data', ',', 3)

= Batch, number, Data

It is clear that the first parameter is a string, the second is a delimiter, and the third is the number of characters obtained.

-1 and-1 are common. In Redis and python, delimiter and search characters are often used, meaning reverse value. For example:

SELECT SUBSTRING_INDEX ('batch, number, data', ',',-1)

= Data

Combining the features of these two functions, the median can be obtained.

Let's take a look:

SELECT SUBSTRING_INDEX (GROUP_CONCAT (Price order by Price), ',', Count (1)/2), ',',-1) zws

FROM CaseRent WHERE ResidentialAreaID = 99;

The preceding two functions are involved: SUBSTRING_INDEX and GROUP_CONCAT,

GROUP_CONCAT sorts the results to form a new comma-separated string, and obtains the result of total/2 through SUBSTRING_INDEX, and then obtains the first reciprocal value through SUBSTIRNG_INDEX-1, that is, the median result.

So what will the Code become if we add the case source to obtain the median?

SELECT CaseFrom, SUBSTRING_INDEX (GROUP_CONCAT (Price order by Price), ',', Count (1)/2), ',',-1) zws

FROM CaseRent WHERE ResidentialAreaID = 99 Group By CaseFrom;

What about differentiation of cells? :

SELECT ResidentialAreaID, CaseFrom,

SUBSTRING_INDEX (GROUP_CONCAT (Price order by Price), ',', Count (1)/2), ',',-1) zws

FROM CaseRent Group By ResidentialAreaID, CaseFrom;

It seems simple, but GROUP_CONCAT has a default length of 1024

If you do not modify the parameters, the median statistics of a large amount of data exceeds the length of GROUP_CONCAT, resulting in a calculation error.

In general, we cannot modify the Mysql parameter of the server:

Show variables like 'group _ concat_max_len'

To see the current parameter.

And:

-- Temporarily modify the support length of GROUP_CONCAT based on the current session.

SET @ GROUP_CONCAT_MAX_LEN = 1024000;

Of course, if necessary, you can directly notify O & M personnel to modify the parameter length. If not commonly used, you can use this method for temporary use. Therefore, when the data volume is large, the correct statement is as follows:


Copy code
SET @ GROUP_CONCAT_MAX_LEN = 1024000;

SELECT ResidentialAreaID, CaseFrom,

SUBSTRING_INDEX (GROUP_CONCAT (Price order by Price), ',', Count (1)/2), ',',-1) zws

FROM CaseRent Group By ResidentialAreaID, CaseFrom;

At this point, the median algorithm ends.

Key knowledge points:

Temporary Variable

Temporary table

System Variables

IF

GROUP_CONCAT

SUBSTRING_INDEX

This article permanently updates link: https://www.bkjia.com/Linux/2018-03/151272.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.