How to optimize tens of millions of mysql data this post is finally edited by demofere from 2015-02-1113:56:31 function & nbsp; insert () {global & nbsp; $ m; if & nbsp ;(! Isset ($ _ REQUEST [strs]) & nbsp; | & nbsp ;! Isset (how to optimize tens of millions of mysql data
This post was last edited by demofere on 13:56:31
Function insert (){
Global $ m;
If (! Isset ($ _ REQUEST ['strs']) |! Isset ($ _ REQUEST ['type'])
|! Isset ($ _ REQUEST ['hash']) {
Echo 'param error ';
Return;
}
// Strs is all strings
$ Poststr = $ _ REQUEST ['strs'];
$ Xstrs = json_decode (stripslashes ($ poststr), true );
$ Type = $ _ REQUEST ['type'];
$ Hash = $ _ REQUEST ['hash'];
If (count ($ xstrs) <= 0 ){
$ Msg = 'Str error ';
DsLog: errLog ($ msg. $ poststr );
Echo $ msg;
Return;
}
If ($ type = '0 '){
$ Table = 'white ';
}
Else if ($ type = '1 '){
$ Table = 'black ';
}
Else {
$ Msg = 'type error ';
DsLog: errLog ($ msg );
Echo $ msg;
Return;
}
$ Strs = array ();
For ($ I = 0; $ I <count ($ xstrs); $ I ++ ){
$ Strtmp = $ xstrs [$ I];
$ Strtmp = trim ($ strtmp );
$ Strtmp = strtolower ($ strtmp );
$ Strtmp = addslashes ($ strtmp );
If (strlen ($ strtmp) & gt; 256 ){
$ Strtmp = substr ($ strtmp, 0,256 );
}
If (strlen ($ strtmp)> = 7)
{
Array_push ($ strs, $ strtmp );
}
}
// Concatenate all strings
$ Tmp = '("'. implode ('", "', $ strs ).'")';
// Obtain an existing string
$ SQL = "select * from $ table where str in $ tmp ";
$ Ret = mysql_query ($ SQL, $ m );
If (! $ Ret ){
$ Msg = 'exec error: '. mysql_error ($ m).', '. $ SQL;
DsLog: errLog ($ msg );
Echo $ msg;
Return;
}
$ Exists = array ();
$ Notexists = array ();
$ Count = mysql_num_rows ($ ret );
For ($ I = 0; $ I <$ count; $ I ++)
{
$ Item = mysql_fetch_assoc ($ ret );
If (! $ Item ){
Break;
}
Array_push ($ exists, $ item ['str']);
}
For ($ I = 0; $ I <count ($ strs); $ I ++ ){
If (in_array ($ strs [$ I], $ exists )){
Continue;
}
Array_push ($ notexists, $ strs [$ I]);
}
For ($ I = 0; $ I <count ($ exists); $ I ++ ){
$ Exists [$ I] = addslashes ($ exists [$ I]);
}
For ($ I = 0; $ I <count ($ notexists); $ I ++ ){
$ Notexists [$ I] = addslashes ($ notexists [$ I]);
}
If (count ($ exists)> 0 ){
// Update the count field of an existing string
$ Tmp = '("'. implode ('", "', $ exists ).'")';
$ Time = date ('ymdhi ');
$ SQL = "update $ table set count = count + 1 where str in $ tmp ";
$ Ret = mysql_query ($ SQL, $ m );
If (! $ Ret ){
$ Msg = 'exec error: '. mysql_error ($ m).', '. $ SQL;
DsLog: errLog ($ msg );
Echo $ msg;
Return;
}
// Update the upd field of an existing string
$ SQL = "update $ table set upd = '$ time' where str in $ tmp ";
$ Ret = mysql_query ($ SQL, $ m );
If (! $ Ret ){
$ Msg = 'exec error: '. mysql_error ($ m).', '. $ SQL;
DsLog: errLog ($ msg );
Echo $ msg;
Return;
}
}
// Insert new information
If (count ($ notexists)> 0 ){
$ Time = date ('ymdhi ');
$ SQL = "insert ignore into $ table (str, hash, count, upd) values ";
For ($ I = 0; $ I <count ($ notexists); $ I ++ ){
$ Str = $ notexists [$ I];
$ Crc = sprintf ("% u", crc32 ($ str ));
$ SQL. = "('$ str', '$ crc', '1', '$ time '),";
}
$ SQL = substr ($ SQL, 0, strlen ($ SQL)-1 );
$ Ret = mysql_query ($ SQL, $ m );
If (! $ Ret ){
$ Msg = 'Insert error: '. mysql_error ($ m).', '. $ SQL;
DsLog: errLog ($ msg );
Echo $ msg;
Return;
}
}
Echo !! $ Ret;
}
Now I want to make statistics on the number of strings, which is similar to a map. this character string: Number. of course, the database also saves the crc information in the character.
The insert function receives strs from post. this is a string array. The average number of strings sent from each post is 1500.
Currently, there are hundreds of thousands of records in the database, and the average time consumed by each insert operation is 20 seconds (usually 30 seconds of timeout)
I would like to ask you how to optimize it. now, only a small portion of the statistics are collected. the estimated data volume is about 1-10 billion.
------ Solution ----------------------
Are you sure the bottleneck lies in the three statements for database query and modification?
You said the average number of strings in each post is 1500.
Is that 1500 words? Even if it is not (string length), there are 75 words based on an average of 20 characters per word.
Your filter condition is str in $ tmp (str in ('XXX', 'XXX '....)) that is, each record in the table must be compared with a string of 1500 (75) times, and the hit rate is at most 1/1500 (1/75). do you think this is appropriate?
The database's strength lies in the comparison between records, while you are comparing between columns that are weak.
You can use the input data to construct a temporary table (one word per row) and associate it with the master table. In this way, we can take advantage of the database.
What's more, can't your two consecutive update statements be merged together?
------ Solution ----------------------
1. mysql has a memory table. you have not used it.
2. Merge two updates, which can be reduced by 10 seconds.
------ Solution ----------------------
Mysql_error library functions are not efficient and are not officially recommended. I have hardly used them. when I was learning PHP, I started PDO directly.
Also, if you use the in keyword, it seems that it does not take the index.
------ Solution ----------------------
Reference:
The effect is almost the same.
It should be a problem in your testing environment. in most cases, join table queries are faster than in.
Your test results may be due to the following reasons:
1. the data in the query table does not reach a certain amount.
2. the result set found in the two query methods is different.
3. the query field is not indexed or is incorrect.
4. being affected by excessive indexes
5. other reasons
I think the obvious reason is 2nd.
When you join a table, no query field is specified. the joined table must have more fields than the non-connected table.
There is also a one-to-Multiple Scenario for table queries, so that one record will become multiple, and the group