Php sorted 0.1 billion QQ numbers and posted them.
Split into thousands of parts for sorting and merging.
First, create a txt file with 0.1 billion QQ numbers.
$ V) {$ arr = range ($ v * 10000 + 10000,100 00 * ($ v + 1) + 9999); shuffle ($ arr); fputs ($ fp, implode ("\ n", $ arr ). "\ n"); unset ($ arr);} echo microtime (true)-$ st;?>
Wait a minute or two to complete the creation of 0.1 billion random QQ.
QQ number range:> 10000. The file size is about 840 MB.
The following is a classification of thousands of files.
Take the length of the QQ number as the folder, and the first three digits of the QQ number as the file name.
$ V) {if ($ v! = '') {$ Tag =" $ v [0] $ v [1] $ v [2] "; $ text_arr [strlen ($ v)] [$ tag] [] = $ v ;}} foreach ($ text_arr as $ k =>$ v) {$ n_dir = 'qq _ no /'. $ k; if (! Is_dir ($ n_dir) mkdir ($ n_dir); foreach ($ v as $ tag => $ val) {$ n_tf = fopen ($ n_dir. '/'.20.tag.'.txt', 'A + '); fputs ($ n_tf, implode ("\ n", $ val ). "\ n") ;}} unset ($ text_arr); ++ $ I;} echo microtime (true)-$ st;?>
Finally, sort and merge data for each file.
$ Val) {if ($ val! = '.' & $ Val! = '.. ') $ Dirs [$ val] = scandir ($ root. '/'. $ val);} foreach ($ dirs as $ key => $ val) {foreach ($ val as $ v) {if ($ v! = '.' & $ V! = '.. ') {$ File = $ root. '/'. $ key. '/'. $ v; $ c = file_get_contents ($ file); $ arr = explode ("\ n", $ c); sort ($ arr); fputs ($ qq_done, implode ("\ n", $ arr); unlink ($ file) ;}} rmdir ($ root. '/'. $ key) ;}rmdir ($ root); echo microtime (true)-$ st;?>
It took about 20 minutes.
Although it is done, the method is very good. The experts in the altar have improved.
Reply to discussion (solution)
I have never learned about php. Can I use hash to arrange them separately?
I went to soy sauce.
There are people so late,
Eat melon seeds upstairs.
Powerful array engineering
Hand it to data pants
Use your code to generate qq.txt,
Then directly sort,
Output results in my fedora (vmware) for 18 minutes
(It may be a little shorter, because I put it in the background)
I have never learned about php. Can I use hash to arrange them separately?
I have never learned it. I can only hash the C and Java... hash columns. if the QQ numbers are not repeated, they can be fully mapped.
Come to a C version
# Include
# Define BITSPERWORD 32 # define SHIFT 5 # define MASK 0x1F # define N distinct int a [1 + N/BITSPERWORD]; void set (int I) {a [I> SHIFT] | = (1 <(I & MASK); // I & MASK is equivalent to 1 & (32-1 ), that is, 1% 32} void clr (int I) {a [I> SHIFT] & = ~ (1 <(I & MASK);} int test (int I) {return a [I> SHIFT] & (1 <(I & MASK ));} int main () {int I; // initialize for (I = 0; I <N; I ++) clr (I); // read the file, set while (scanf ("% d", & I )! = EOF) set (I); for (I = 0; I <N; I ++) if (test (I) printf ("% d \ n ", i); return 0 ;}
Perform a boundary test
0.1 billion QQ numbers are not less than 0.1 billion
The QQ number applied by a friend of mine in March is already 1450250164.
Never underestimate one of the many
Store the data in the database first, and then read the data. Will the time be faster? will it crash?
This classmate asks, how does sort get it?
Use your code to generate qq.txt,
Then directly sort,
Output results in my fedora (vmware) for 18 minutes
(It may be a little shorter, because I put it in the background)
This is true .. This is not taken into account.
The difference is much worse.
From the perspective of capacity statistics, we can see that there is a lot of difference.
Probably.
The total size of the six-digit QQ number is 7 mb.
The total size of the seven-digit QQ number is 70 mb.
The total size of the 8-digit QQ number is 700 mb.
...
Perform a boundary test
0.1 billion QQ numbers are not less than 0.1 billion
The QQ number applied by a friend of mine in March is already 1450250164.
Never underestimate one of the many
The c method has to be learned.
Sorting by database is still being tested...
Not yet successful. Import is a problem.
This classmate asks, how does sort get it?
Reference the reply from helloyou0 on the 6th floor:
Use your code to generate qq.txt,
Then directly sort,
Output results in my fedora (vmware) for 18 minutes
(It may be a little shorter, because I put it in the background)
It's the unix command sort.
Another day, I will try more files and mysql databases.
I have tested mysql.
Reply by referencing ci1699 on the 12th floor:
This classmate asks, how does sort get it?
Reference the reply from helloyou0 on the 6th floor:
Use your code to generate qq.txt,
Then directly sort,
Output results in my fedora (vmware) for 18 minutes
(It may be a little shorter, because I put it in the background)
It's the unix command sort.
Another day, I will try more files and mysql databases.
Mysql is used for sorting.
It may take up to 10 minutes to sort and export data into the database.
It is faster than php above.
Start talking.
I don't know where it hurts.
To sort data using mysql, you must first import data to the database.
The first problem to be solved is to store 0.1 billion QQ numbers into the database.
Cannot INSERT 0.1 billion times?
At the beginning, I want to directly convert the generated QQ number into an SQL statement for direct import.
However, more than 1 GB after the SQL statement is generated. So big, how to play?
Think about how many hundred MB are separated.
Then it is divided into several hundred MB SQL import.
This guide upu and memory started to throw and did not respond for a long time.
Check the process table to see what the situation is, but the table is still empty.
No error is reported and no information is displayed. Is the file too large?
Try again by dividing it into dozens of MB.
Wait for more than 10 minutes, but it still fails. Max_allowed_packet error is reported.
If you change max_allowed_packet to continue the import, an error is still returned.
Finally, there is no way to try a small amount of data and finally find the cause.
It is because the SQL statement is too long... (), (), ()..., and the SQL statement is scored in addition to the file.
Split the SQL statement into 1 MB.
Paste the code below. you can try it.
Five SQL files are generated after execution.
Go to the mysql console,
Use test;
Source D: xxxx \ SQL _X.txt, respectively.
OK. Hundreds of millions of data entries are stored in the database.
Finally, sort and import txt files.
Completed successfully.
This process involves two issues: Engine and index.
Tests were also conducted. You can compare them.
Create table if not exists 'test' ('QQ' int (10) not null, KEY 'QQ' ('QQ') ENGINE = MyISAM default charset = utf8; # MyISAM type ----------------------- # No index ------------ about 210 seconds to import a billion records. Export txtSELECT * FROM 'test' order by 'QQ' asc into outfile "ok1.txt"; 426.73 seconds (in addition, order by 'QQ' separately is 60 seconds, writing files quickly during export. Average 60 MB write per second) total spending 636 seconds about 10 minutes. # It takes 1100.1 seconds to import hundreds of millions of records with indexes. Export txtSELECT * FROM 'test' order by 'QQ' asc into outfile "ok2.txt"; 391.24 seconds (in addition, order by 'QQ' separately is 0.0003 seconds, which is quite fast, a little shocked. However, I do not know why File writing is so slow during export. 2, 3 MB write per second on average) the total cost is 1491 seconds about 24 minutes. # InnoDB type ----------------------- No index ------------ about 1544 seconds to import a billion records. There are indexes ------------ importing billions of records> 4200 seconds. (Inserting it to the end is relatively slow) it is too slow. Export will not happen.
To sum up, the best solution is to use the MyISAM engine without indexing.
Finally, I have some questions about how INTO OUTFILE works?
There are so many differences in the speed of writing files for one index and one non-index export.
What's depressing about InnoDB is that InnoDB is so slow and slow?
Unimaginable. No, go to bed. 0_0zzzz
Tainiu
Really powerful, speechless.
Powerful array engineering
Since there are ready-made data files, there is no need to construct an insert string.
set_time_limit(0);$sql =<<< SQLCREATE TABLE IF NOT EXISTS qq1 ( `qq` int(10) NOT NULL, KEY `qq` (`qq`)) ENGINE=MyISAM DEFAULT CHARSET=utf8;SQL;mysql_connect('localhost', 'root', '');mysql_select_db('test');mysql_query($sql);$filename = str_replace('\\', '/', realpath('qq.txt'));$sql =<<< SQLLOAD DATA INFILE '$filename' INTO TABLE qq1SQL;check_speed(1);mysql_query($sql) or print(mysql_error());;check_speed();
Time: 182,955,851 microseconds
Memory: 664
Set_time_limit (0); mysql_connect ('localhost', 'root', ''); mysql_select_db ('test'); echo 'in ascending order
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_1.txt'); if (file_exists ($ filename) unlink ($ filename ); $ SQL = <SQLSELECT qq FROM qq1 order by qq asc into outfile '$ filename' SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error ()); check_speed (); echo 'descending order
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_2.txt'); if (file_exists ($ filename) unlink ($ filename ); $ SQL = <SQLSELECT qq FROM qq1 order by qq desc into outfile '$ filename' SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error ()); check_speed (); echo 'primary keys are not sorted
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_0.txt'); if (file_exists ($ filename) unlink ($ filename); $ SQL = <SQLSELECT qq FROM qq1 INTO outfile' $ filename 'SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error (); check_speed ();
Ascending
Time: 46,308,538 microseconds
Memory: 520
Descending order
Time: 105,860,001 microseconds
Memory: 432
Primary key unordered
Time: 38,615,022 microseconds
Memory: 432
What's depressing about InnoDB is that InnoDB is so slow and slow?
Data of all InnoDB tables is stored in an ibdata1 file, and data is located by file pointers.
However, since he does not require continuous disk space (unlike oracle, oracle needs to create a continuous disk space as a tablespace when creating a database ), therefore, the access speed of the InnoDB type table depends on the degree of your hard disk fragmentation. The more fragments, the slower the speed.
I was touched by the spirit of seeking truth. I also thought about this issue. if I only use memory, I feel that it is best to use hash. However, if the hard disk is used, the memory is controlled. A large number of IO becomes the main contradiction, and mysql is like this.
Now that you have data, try it. The time for re-indexing. If it takes more than 10 minutes, you don't have to try again.
An index file is generated when an index is created, which is a real sorting process. Mysql uses the B + tree structure. with the depth of the tree, it will take more and more time to locate data in the future. Therefore, index file generation will become slower and slower. It is not optimistic. As for InnoDB indexes, you do not need to consider them.
InnoDB table test
Set_time_limit (0); $ SQL = <SQLCREATE TABLE IF NOT EXISTS qq2 ('QQ' int (10) NOT NULL, KEY 'QQ' ('QQ ')) ENGINE = InnoDB default charset = utf8; SQL; mysql_connect ('localhost', 'root', ''); mysql_select_db ('test'); mysql_query ($ SQL); echo 'import
'; $ Filename = str_replace (' \ ','/', realpath('qq.txt'); $ SQL = <SQLLOAD DATA INFILE '$ filename 'into TABLE qq2SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error (); check_speed ();
Import
Time: 3,286,588,777 microseconds
Memory: 664
Export
Set_time_limit (0); mysql_connect ('localhost', 'root', ''); mysql_select_db ('test'); echo 'in ascending order
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_1.txt'); if (file_exists ($ filename) unlink ($ filename ); $ SQL = <SQLSELECT qq FROM qq2 order by qq asc into outfile '$ filename' SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error ()); check_speed (); echo 'descending order
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_2.txt'); if (file_exists ($ filename) unlink ($ filename ); $ SQL = <SQLSELECT qq FROM qq2 order by qq desc into outfile '$ filename' SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error ()); check_speed (); echo 'primary keys are not sorted
'; $ Filename = str_replace (' \ ','/', dirname (_ FILE __). '/qq_0.txt'); if (file_exists ($ filename) unlink ($ filename); $ SQL = <SQLSELECT qq FROM qq2 INTO outfile' $ filename 'SQL; check_speed (1); mysql_query ($ SQL) or print (mysql_error (); check_speed ();
Ascending
Time: 367,638,625 microseconds
Memory: 520
Descending order
Time: 390,232,528 microseconds
Memory: 432
Primary key unordered
Time: 367,762,026 microseconds
Memory: 432
OK. I tried 11 QQ numbers, but actually 100020001,
Sort, 22 minutes
Reply by referencing ci1699 on the 12th floor:
This classmate asks, how does sort get it?
Reference the reply from helloyou0 on the 6th floor:
Use your code to generate qq.txt,
Then directly sort,
Output results in my fedora (vmware) for 18 minutes
(It may be a little shorter, because I put it in the background)
It's the unix command sort.
Another day, I will try more files and mysql databases.
It is meaningless to only give the test time, because the execution of the same code on different machines and operating systems does not depend on the algorithm and the code itself.
Different algorithms and codes run in the same environment to reflect the advantages and disadvantages.
It turns out that we can still do this.
I just executed,
It uses 387.14144778252.
Now the imported data has an index.
It looks like I still use the index ..
Since there are ready-made data files, there is no need to construct an insert string.
PHP code
Set_time_limit (0 );
$ SQL = <SQL
Create table if not exists qq1 (
'QQ' int (10) not null,
KEY 'QQ' ('QQ ')
) ENGINE = MyISAM default charset = utf8;
SQL;
Mysql_connec ......
The same file, mysql myisam table import/index creation/export
4 + 13.5 + 2 = 19.5 minutes, faster than sort
OK. I tried 11 QQ numbers, but actually 100020001,
Sort, 22 minutes
Reference the reply from helloyou0 on the 16th floor:
Reply by referencing ci1699 on the 12th floor:
This classmate asks, how does sort get it?
Reference the reply from helloyou0 on the 6th floor:
Use your code to generate qq.txt,
Then directly sort,
In my fedora (vmware), there are 18 points ......
I learned. I still think that InnoDB write operations are faster than MyISAM.
It seems that the advantages of InnoDB are only in other aspects of transactions.
Reply by referencing ci1699 on the 18 th floor:
What's depressing about InnoDB is that InnoDB is so slow and slow?
Data of all InnoDB tables is stored in an ibdata1 file, and data is located by file pointers.
However, since he does not require continuous disk space (unlike oracle, oracle needs to create a continuous disk space as a tablespace when creating a database ), therefore, the access speed of the InnoDB type table depends on the degree of your hard disk fragmentation. The more fragments, the faster ......
I also want to try the c program on the fifth floor to see how fast it can be,
Http://topic.csdn.net/u/20111213/20/269a86f6-640e-4921-b8b1-b65840a5ef63_2.html
But I just tried a small file with a bug and need to debug it.
I was touched by the spirit of seeking truth. I also thought about this issue. if I only use memory, I feel that it is best to use hash. However, if the hard disk is used, the memory is controlled. A large number of IO becomes the main contradiction, and mysql is like this.
Now that you have data, try it. The time for re-indexing. If it takes more than 10 minutes, you don't have to try again.
An index file is generated when an index is created, which is a real sorting process. Mysql uses the B + tree structure. with the depth of the tree, it will take more and more time to locate data in the future. So index file generation will get slower and slower ......
Sorry, you have to worry about it.
I also want to try the c program on the fifth floor to see how fast it can be,
Http://topic.csdn.net/u/20111213/20/269a86f6-640e-4921-b8b1-b65840a5ef63_2.html
But I just tried a small file with a bug and need to debug it.
:) Haha
In fact, what I want to talk about is that it is boring to do interview questions with such questions ....
In my opinion, the current test shows that using sort is the most realistic,
Obviously, this job won't work too often. it's totally acceptable for 22 minutes and far beyond my imagination (I did it in my VM on my desktop, the VM allocates 1 GB of memory). If the server is used, it will obviously be faster. you don't have to worry about using sort, just a line of command ....
By the way, I admire the author of sort and other unix tools... really good ....
Sorry, you have to worry about it.
Reference the reply from helloyou0 on the 32th floor:
I also want to try the c program on the fifth floor to see how fast it can be,
Http://topic.csdn.net/u/20111213/20/269a86f6-640e-4921-b8b1-b65840a5ef63_2.html
But I just tried a small file with a bug and need to debug it.
C must be faster than php. after all, php is interpreted and executed, and the final execution is still a c program.
Isn't the time for importing 0.1 billion pieces of data to the database counted? Database methods are definitely not available! (Of course, the data itself exists in the database)
Powerful landlord.
Haha. It is true that this job will not work frequently. it will take less than ten minutes to use sort in linux. It's so fast.
But how can we know it without passing the test, so we will have experience next time.
:) Haha
In fact, what I want to talk about is that it is boring to do interview questions with such questions ....
In my opinion, the current test shows that using sort is the most realistic,
Obviously, this job won't work too often. it's totally acceptable for 22 minutes and far beyond my imagination (I did it in my VM on my desktop, the VM allocates 1 GB of memory). If the server is used, it will obviously be faster. you don't have to worry about using sort, just a line of command ....
By the way, sort and other unix ......
C must be faster than php. after all, php is interpreted and executed, and the final execution is still a c program.
I don't think the c code is faster than the sort command in linux?
Which of the following students writes c for test?
Isn't the time for importing 0.1 billion pieces of data to the database counted? Database methods are definitely not available! (Of course, the data itself exists in the database)
You do not read the post carefully. Currently, the test above shows that the database method is the fastest.
I have been engaged in php for more than half a year. now I am engaged in website opening in an Internet company. I feel that I can't learn anything to write my own code every day. how many years have I learned php?
Several years .. I am a little bird.
I have been engaged in php for more than half a year. now I am engaged in website opening in an Internet company. I feel that I can't learn anything to write my own code every day. how many years have I learned php?
The default maximum QQ number of the main post is 100009999, but the current maximum QQ number may be 9999999999, which is many orders of magnitude different. in actual situations, the bit sorting is about 9999999999/8/1024/1024/1 GB memory...
Tested with the data generated by the primary sticker, the machine was poor, and it took more than 300 seconds to generate the data. the bit sorting method implemented by c was used for reading, sorting, and writing. The total time was about 1 minute 10 seconds, LZ can be used for testing. By the way, I changed the qq number generated in the main post to one line for easy reading.
fputs($fp,implode(PHP_EOL, $arr).PHP_EOL);
Gcc Compilation in ubuntu linux
#include
#include
#define MAX 100009999 #define SHIFT 5 #define MASK 0x1F #define DIGITS 32 int a[1+MAX/DIGITS]; void set(int n) { a[n>>SHIFT]=a[n>>SHIFT]|(1<<(n&MASK)); } void clear(int n) { a[n>>SHIFT]=a[n>>SHIFT]&(~(1<<(n&MASK))); }int test(int n) { return a[n>>SHIFT] & (1<<(n&MASK)); }int main(int argc, char *argv[]){ int i; int tp; FILE *ip; FILE *op; for(i=1;i<=MAX;i++) { clear(i); } ip = fopen("qq_before_sort.txt","r"); while( !feof(ip) ) { fscanf(ip,"%d\n",&tp); set(tp); } fclose(ip); op = fopen("qq_after_sort.txt","w"); for(i=1;i<=MAX;i++) { if(test(i)){ fprintf(op, "%d\n",i); } } fclose(op); return 0;}
I 've seen it again. I'm busy. I'm coming to study.
C
Programs are so fast
????
The default maximum QQ number of the main post is 100009999, but the current maximum QQ number may be 9999999999, which is many orders of magnitude different. in actual situations, the bit sorting is about 9999999999/8/1024/1024/1 GB memory...
Tested with the data generated by the primary sticker, the machine was poor, and it took more than 300 seconds to generate the data. the bit sorting method implemented by c was used for reading, sorting, and writing. The total time was about 1 minute 10 seconds, LZ can be used for testing. By the way, I changed the qq number generated in the main post to one line for a qq, which is convenient ......
Why can't I compile it ..
Error qq. c 7: The array is too small
The default maximum QQ number of the main post is 100009999, but the current maximum QQ number may be 9999999999, which is many orders of magnitude different. in actual situations, the bit sorting is about 9999999999/8/1024/1024/1 GB memory...
Tested with the data generated by the primary sticker, the machine was poor, and it took more than 300 seconds to generate the data. the bit sorting method implemented by c was used for reading, sorting, and writing. The total time was about 1 minute 10 seconds, LZ can be used for testing. By the way, I changed the qq number generated in the main post to one line for a qq, which is convenient ......