The topic is a question of a face
My idea is another table, which contains the 100,000 articles that are updated today;
I only think of this first step, what to do next I do not know how to achieve;
Assuming that I have such a way of thinking, I even know the daily update of the 100,000 data, then I have to go to 5 million to find out 3,000 data oh
Do not know the brothers, what is the good idea?
Reply content:
The topic is a question of a face
My idea is another table, which contains the 100,000 articles that are updated today;
I only think of this first step, what to do next I do not know how to achieve;
Assuming that I have such a way of thinking, I even know the daily update of the 100,000 data, then I have to go to 5 million to find out 3,000 data oh
Do not know the brothers, what is the good idea?
3,000 in 10W, the probability is 3%.
So as long as you save the article, according to the probability of 3%, this update article saved to the cache
This cache is best with Redis's set type, the set type does not hold duplicate elements, so the article is updated repeatedly and does not produce multiple results in the list
The format of key can be used "analyze:list: (y-m-d)"
The cache can then be set to expire 48 hours, and can be archived to the database the day before, if necessary
Considering the random probability of the error, you can zoom 3% to 5%, the final will certainly record more than 3000, but also not too much, anyway, the last only 3,000 to use the line
Every update is recorded, whether it is recorded to the cache or database, in fact, most of the records are useless, rather than according to the probability of filtering through the first
Actually record each article of Update_time also can, I think where Update_time >=? and Update_time <=? Order by random () limit 3000 almost on the line, anyway, every day just run once, but since it is the interview, how can not show the idea of the show
Advantages:
1, no update_time field can play, no requirements on the existing table structure, to the production environment database plus field is a troublesome thing
2, in case the production environment database load is relatively high, order by random () query results in the database card is not good, so, it is best to read and write the separation structure, in the read-only library query only, resulting in the architecture requirements, I this design is a bypass record, in addition to Redis not required
3, how much to remember how much, the additional IO less
Some humble ideas, for reference only
Partition
5 million, for convenience. Database partitioning based on update time of the data (this is not used for MySQL partitioning, which is discussed later in the article),
For example, according to the month, I assume that your 5 million data is a year, then divided into 12 parts, each area is about 420,000 records
This way, when you use the update time to search, MySQL will choose the partition according to your update time.
That is, the data being searched is found in these 420,000 articles (this is certainly much faster than you in 5 million, of course, if you divide by the day, it will be faster)
Add cache
This is nothing, you write to MySQL every day to take 3,000 of data written to Redis or mongodb inside, do research do not read from the MySQL inside. Read from the cache in PHP
Multi-process
You said to do research, I assume your research algorithm is very complex. You learn to learn swoole, open three processes, one process processes 1000 data, and finally summarizes the results
Remove the updated 100,000 on the day
ID into an array to randomly remove 3,000 IDs from the array
Reads the specified 3,000 records with select in
SELECT id FROM table WHERE date_refresh = 20120329SELECT * FROM table WHERE id IN (id_0, id_1, id_2, ..., id_2999)
https://www.zhihu.com/question/20151242
First, I'll use caching to record the data that is updated every day 主键
.
From the cache, randomly get 3000
a主键
Take the 3,000 primary keys and use the in query to get the corresponding data.
Shallow analysis, do not laugh.
1. Get the ID range
select max(id) as max_id, min(id) as min_id from (
Update_time with index, id as self-growth ID
2. Random Access
select * from article_tb where id >= min_id and id < ( rand()*(max_id-min_id)+min_id ) limit 1
Query 3,000 times
// STEP 1 : 获取当天文章ID区间// maxId -> select max(id) from news where 当天时间限定// minId -> select min(id) from news where 当天时间限定// STEP 2 : 取得随机ID// 因为你一天有10万数据,数据总量有很高// 所以避免使用MYSQL中的随机$minId = 5000000;$maxId = 5100000;$i = 0;$resultIds = [];while(true){ $randId = rand($minId,$maxId); if(in_array($randId, $resultIds)){ continue; } // 查询验证 // 根据你的需要验证数据是否是审核的呀,是否是正常数据呀 // 如果正常就载入到结果数组中。 $resultIds[] = $randId; $i++; if($i==3000){ break; }}// 到这里结果已经有了// 可以储存到结果集用其他方式分页进行研究或者浏览。