Matching the coincidence between the csdn user database and official users and filtering out overlapping users

Source: Internet
Author: User

Process:
1. Obtain the csdn user database and import it locally
After trying editplus for a trial, the system prompts that the memory is insufficient and there is no way to solve the problem. I checked it in my colleague's linux. The basic format is as follows:
Username # password # email
Username # password # email
Data structure:
Copy codeThe Code is as follows:
Create table if not exists 'csdn _ userdb '(
'Id' int (10) not null auto_increment,
'Username' varchar (50) character set gbk not null,
'Password' varchar (50) character set gbk not null,
'Email 'varchar (50) character set gbk not null,
Primary key ('id '),
KEY 'username' ('username '),
KEY 'email '('email ')
) ENGINE = MyISAM default charset = gbk AUTO_INCREMENT = 1;

It has been suspected that fopen opened files are written to the cache, but practice has proved that the speed is very fast. It should not be written to the cache. The following code is used to import data.
Copy codeThe Code is as follows:
<? Php
$ Link = mysql_connect ('localhost', 'root', 'admin', true );
Mysql_select_db ('csdn ', $ link );
$ Handle = fopen ("C: \ Users \ zhudong \ Desktop \ www.csdn.net. SQL", "r ");
While (! Feof ($ handle )){
$ I ++;
$ Buffer = fgets ($ handle );
List ($ u, $ p, $ e) = explode ("#", $ buffer );
Mysql_query ("insert into csdn_userdb (username, password, email) VALUES ('$ U',' $ P', '$ e')", $ link );
If ($ I % 1000 = 0) echo $ I. "\ n ";
}
Fclose ($ handle );
?>

The above code is very inefficient, so the code after modification is as follows:
Copy codeThe Code is as follows:
<? Php
$ Link = mysql_connect ('localhost', 'root', 'admin', true );
Mysql_select_db ('csdn ', $ link );
$ Handle = fopen ("C: \ Users \ zhudong \ Desktop \ www.csdn.net. SQL", "r ");
$ Perpage = 50;
While (! Feof ($ handle )){
$ I ++;
$ Buffer = fgets ($ handle );
List ($ u, $ p, $ e) = explode ("#", $ buffer );
$ InsertValue [] = "('$ U',' $ P', '$ e ')";
If ($ I % $ perpage = 0) {$ perpage = 0 ){
$ Effectvaluestring = implode (',', $ insertValue );
Mysql_query ("insert into csdn_userdb (username, password, email) VALUES $ effectvaluestring", $ link );
Echo $ I. "\ n ";
$ InsertValue = array ();
}
}
Fclose ($ handle );

In order to find out the factors that affect the efficiency of imported data, we tested the data according to different settings.
Total CSDN user data 6428600
When $ perpage = 500; data imported: 5,902,000; data loss: 526600 Loss Rate: 8%; data table engine: MyISAM; index: Yes; Total time: 15 minutes
When $ perpage = 200, the total number of imported data: 6,210,200; data loss: 218400; Loss Rate: 3.3%; data table engine: MYISAM; index: Yes; Total time: 30 minutes
When $ perpage = 200, the total number of imported data: 6,210,200; data loss: 218400; Loss Rate: 3.3%; data table engine: INNODB; index: Yes; Total time: 65 minutes
When $ perpage = 200, the total number of imported data: 6,210,200; data loss: 218400; Loss Rate: 3.3%; data table engine: MYISAM; index: none; Total time consumption: 14 minutes (re-create an index after data import)
When $ perpage = 50, the total number of imported data: 6,371,200; data loss: 57400; Loss Rate: 0.8%; data table engine: MYISAM; index: None: Total time consumed: 20 minutes
The following is a summary based on the above situation:
1. The efficiency of adding an index after the data is imported first is twice as high as that after the data is imported.
2. InnoDB is much less efficient in inserting data from a single process than MYISAM.
3. When perpage is set to 50, the data loss rate is below 1%.
Copy codeThe Code is as follows:
Because browser-based execution may cause a timeout problem and the efficiency is low, it may take a lot of time to run through the command line.
At first, I executed the following code:
Php.exe E: \ usr \ www \ importcsdndb. php
However, the following error is reported: call to undefined function mysql_connect
Tossing and finding that php. ini is not loaded
The correct code is:
Php.exe-c E:/usr/local/apache2/php. ini importcsdndb. php

2. Import user data that needs to be matched to the local device.
Run the command line to enter msyql (not Baidu)
Run mysql> source C:/Users/zhudong/Desktop/userdb. SQL.
3. Compare and filter users
After the comparison program is written, remember to run it in the command line:
Copy codeThe Code is as follows:
<? Php
$ Link = mysql_connect ('localhost', 'root', 'admin', true );
Mysql_select_db ('csdn ', $ link );
$ Handle_username = fopen ("E:/records_username.txt", "");
// $ Handle_email = fopen ("E:/records_email.txt", "");
$ Username_num = $ email_num = $ uid = 0;
While ($ uid <2181106 ){
$ Nextuid = $ uid + 10000;
$ Query = mysql_query ("SELECT * FROM pw_members WHERE uid> '$ uid' AND uid <' $ nextuid '");
While ($ rt = mysql_fetch_array ($ query, MYSQL_ASSOC )){
$ Username = $ rt ['username'];
$ Email = $ rt ['email '];
$ Query2 = mysql_query ("SELECT * FROM scdn_userdb WHERE username = '$ username' OR email =' $ email '");
While ($ rt2 = mysql_fetch_array ($ query2, MYSQL_ASSOC )){
If ($ rt ['Password'] = md5 ($ rt2 ['Password']) {
If ($ rt2 ['username'] = $ username ){
$ Username_num ++;
Fwrite ($ handle_username, 'own :'. $ rt ['uid']. '| '. $ rt ['username']. '| '. $ rt ['Password']. '| '. $ rt ['email ']. 'csdn :'. $ rt2 ['username']. '| '. $ rt2 ['Password']. '| '. $ rt2 ['email ']. "\ r \ n ");
Echo 'username _ num = '. $ username_num. "\ r \ n ";
Continue;
}
/*
If ($ rt2 ['email '] = $ email ){
$ Email_num ++;
Fwrite ($ handle_email, 'own :'. $ rt ['uid']. '| '. $ rt ['username']. '| '. $ rt ['Password']. '| '. $ rt ['email ']. 'csdn :'. $ rt2 ['username']. '| '. $ rt2 ['Password']. '| '. $ rt2 ['email ']. "\ r \ n ");
Echo 'email _ num = '. $ email_num. "\ r \ n ";
}
*/
}
}
Mysql_free_result ($ query2 );
}
$ Uid = $ nextuid;
}
?>

The code you see above is very poor, because it is very inefficient, millions of data, it takes more than 10 hours to run, how can I forget the basics of table connection query? The following is the corrected method.
Copy codeThe Code is as follows:
$ Link = mysql_connect ('localhost', 'root', 'admin', true );
Mysql_select_db ('csdn ', $ link );
$ Handle_username = fopen ("E:/records_username.txt", "");
While ($ uid <2181106) {// The number here is the maximum ID of the user library to be compared
$ Nextuid = $ uid + 10000;
$ Query = mysql_query ("SELECT m. uid, m. username, m. password, m. email, u. password as csdn_password, u. email as csdn_email FROM own_members m left join csdn_userdb u USING (username) WHERE m. uid> '$ uid' AND m. uid <= '$ nextuid' AND u. username! = ''");
While ($ rt = mysql_fetch_array ($ query, MYSQL_ASSOC )){
If ($ rt ['Password'] = md5 ($ rt ['csdn _ password']) {
$ Username_num ++;
Fwrite ($ handle_username, 'own :'. $ rt ['uid']. '| '. $ rt ['username']. '| '. $ rt ['Password']. '| '. $ rt ['email ']. 'csdn :'. $ rt ['username']. '| '. $ rt ['sdn _ password']. '| '. $ rt ['sdn _ email ']. "\ r \ n ");
Echo 'username _ num = '. $ username_num. "\ r \ n ";
}
}
$ Uid = $ nextuid;
Echo 'uid = '. $ uid;
}
?>

The total comparison time is 25 minutes, which is greatly improved compared with the previous 10 hours.
Total name users: 34175
Total membership: 1.7%
1.7% of users with duplicate names are still very serious. I hope this article will help you compare the users on this site.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.