The article table 5 million data, every day there will be 100,000 data updates, from the updated 100,000 randomly selected 3,000 to do data research, if efficient?

Source: Internet
Author: User
The topic is a question of a face
My idea is another table, which contains the 100,000 articles that are updated today;
I only think of this first step, what to do next I do not know how to achieve;
Assuming that I have such a way of thinking, I even know the daily update of the 100,000 data, then I have to go to 5 million to find out 3,000 data oh
Do not know the brothers, what is the good idea?

Reply content:

The topic is a question of a face
My idea is another table, which contains the 100,000 articles that are updated today;
I only think of this first step, what to do next I do not know how to achieve;
Assuming that I have such a way of thinking, I even know the daily update of the 100,000 data, then I have to go to 5 million to find out 3,000 data oh
Do not know the brothers, what is the good idea?

3,000 in 10W, the probability is 3%.

So as long as you save the article, according to the probability of 3%, this update article saved to the cache

This cache is best with Redis's set type, the set type does not hold duplicate elements, so the article is updated repeatedly and does not produce multiple results in the list

The format of key can be used "analyze:list: (y-m-d)"

The cache can then be set to expire 48 hours, and can be archived to the database the day before, if necessary

Considering the random probability of the error, you can zoom 3% to 5%, the final will certainly record more than 3000, but also not too much, anyway, the last only 3,000 to use the line

Every update is recorded, whether it is recorded to the cache or database, in fact, most of the records are useless, rather than according to the probability of filtering through the first

Actually record each article of Update_time also can, I think where Update_time >=? and Update_time <=? Order by random () limit 3000 almost on the line, anyway, every day just run once, but since it is the interview, how can not show the idea of the show

Advantages:
1, no update_time field can play, no requirements on the existing table structure, to the production environment database plus field is a troublesome thing
2, in case the production environment database load is relatively high, order by random () query results in the database card is not good, so, it is best to read and write the separation structure, in the read-only library query only, resulting in the architecture requirements, I this design is a bypass record, in addition to Redis not required
3, how much to remember how much, the additional IO less

Some humble ideas, for reference only
Partition
5 million, for convenience. Database partitioning based on update time of the data (this is not used for MySQL partitioning, which is discussed later in the article),
For example, according to the month, I assume that your 5 million data is a year, then divided into 12 parts, each area is about 420,000 records
This way, when you use the update time to search, MySQL will choose the partition according to your update time.
That is, the data being searched is found in these 420,000 articles (this is certainly much faster than you in 5 million, of course, if you divide by the day, it will be faster)

Add cache
This is nothing, you write to MySQL every day to take 3,000 of data written to Redis or mongodb inside, do research do not read from the MySQL inside. Read from the cache in PHP

Multi-process
You said to do research, I assume your research algorithm is very complex. You learn to learn swoole, open three processes, one process processes 1000 data, and finally summarizes the results

  1. Remove the updated 100,000 on the day

  2. ID into an array to randomly remove 3,000 IDs from the array

  3. Reads the specified 3,000 records with select in

SELECT id FROM table WHERE date_refresh = 20120329SELECT * FROM table WHERE id IN (id_0, id_1, id_2, ..., id_2999)

https://www.zhihu.com/question/20151242

    • First, I'll use caching to record the data that is updated every day 主键 .

    • From the cache, randomly get 3000 a主键

    • Take the 3,000 primary keys and use the in query to get the corresponding data.

Shallow analysis, do not laugh.

1. Get the ID range

select max(id) as max_id, min(id) as min_id from (   

Update_time with index, id as self-growth ID
2. Random Access

select * from article_tb where id >= min_id and id < ( rand()*(max_id-min_id)+min_id ) limit 1

Query 3,000 times

// STEP 1 : 获取当天文章ID区间// maxId -> select max(id) from news where 当天时间限定// minId -> select min(id) from news where 当天时间限定// STEP 2 : 取得随机ID// 因为你一天有10万数据,数据总量有很高// 所以避免使用MYSQL中的随机$minId = 5000000;$maxId = 5100000;$i = 0;$resultIds = [];while(true){    $randId = rand($minId,$maxId);    if(in_array($randId, $resultIds)){        continue;    }        // 查询验证    // 根据你的需要验证数据是否是审核的呀,是否是正常数据呀    // 如果正常就载入到结果数组中。    $resultIds[] = $randId;    $i++;        if($i==3000){        break;    }}// 到这里结果已经有了// 可以储存到结果集用其他方式分页进行研究或者浏览。
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.