How does mysql collect more than 500 million daily table data?

Source: Internet
Author: User
Tags ip number
{Code...} selectappid, count (distinct (ip) fromlog0812_tbwhereiptype4groupbyappid; {code ...}
Ask: There are daily table data (one generated every day), and the data in each table is about million. We need to collect statistics from the Daily Table: The number of ip addresses is counted based on appid, and the ip address needs to be de-duplicated. The approximate SQL statement is:

Select appid, count (distinct (ip) from log0812_tb where iptype = 4 group by appid;

Then, add the appid and ip number to another statistical table. 1. if you directly execute the SQL statement, it will certainly Time Out (the system only configures the read time of Ms ). 2. if all data is taken out of the memory and then the operation is performed, the memory is insufficient, and the memory is only 50 MB... (Not difficult for programmers) is there any optimization solution? Thank you.

Reply content:
Ask: There are daily table data (one generated every day), and the data in each table is about million. We need to collect statistics from the Daily Table: The number of ip addresses is counted based on appid, and the ip address needs to be de-duplicated. The approximate SQL statement is:

Select appid, count (distinct (ip) from log0812_tb where iptype = 4 group by appid;

Then, add the appid and ip number to another statistical table. 1. if you directly execute the SQL statement, it will certainly Time Out (the system only configures the read time of Ms ). 2. if all data is taken out of the memory and then the operation is performed, the memory is insufficient, and the memory is only 50 MB... (Not difficult for programmers) is there any optimization solution? Thank you.

The following table may be optimized:

  1. Make a composite index (appid, ip)

  2. The ip address is an integer. do not store strings.

If it still times out, try to read the data to the memory, but your memory is only 50 MB, then you can try to use HyperLogLog, the memory consumption is extremely small, however, the statistical data is slightly deviated, about 2%.

Finally, it is recommended that you do not store such log data in SQL. you can select nosql such as hbase and mongodb to meet your requirements.

@ Manong
Thank you. the Two optimization schemes are both good.

I have created a composite index for typeid, appid, and ip. in this way, this statement goes through index query and does not return to the table. the time limit is less than 1.5s, which is effective.

As for the HyperLogLog algorithm, I just checked it and did not use it in practice. thank you for your suggestion.

Another method I used was to process the 500 million + data in batches by planning tasks. after the data was de-duplicated twice, array_diff was used to compare the second different data, sum to get the total count. In this way, the time can be controlled below 1 s. Here is a trick to convert the first-time comparison array into a string and store it in the array. The second comparison will convert string to an array, which saves a lot of memory, nested arrays consume more memory than long string value arrays.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.