Matching the CSDN user database with the official user and filtering out the overlapping parts of the user _php instance

Source: Internet
Author: User
Tags explode md5
Process:
1, get the CSDN user database import Local
Trial EditPlus Open prompt memory, did not find a way, colleagues under the Linux view, the basic format is as follows:
User name # password # mailbox
User name # password # mailbox
corresponding data structure:
Copy Code code as follows:

CREATE TABLE IF not EXISTS ' Csdn_userdb ' (
' ID ' int (a) not NULL auto_increment,
' username ' varchar (m) Character set GBK not NULL,
' Password ' varchar (m) Character set GBK not NULL,
' Email ' varchar character Set GBK not NULL,
PRIMARY KEY (' id '),
KEY ' username ' (' username '),
KEY ' email ' (' email ')
) Engine=myisam DEFAULT CHARSET=GBK auto_increment=1;

Always suspected fopen Open file is write cache, but the practice proved fast, should be no write cache, the following is the code to import data
Copy Code code as follows:

<?php
$link = mysql_connect (' localhost ', ' root ', ' admin ', true);
mysql_select_db (' csdn ', $link);
$handle = fopen ("C:\Users\zhudong\Desktop\www.csdn.net.sql", "R");
while (!feof ($handle)) {
$i + +;
$buffer = fgets ($handle);
List ($u, $p, $e) = Explode ("#", $buffer);
mysql_query ("INSERT into Csdn_userdb (username,password,email) VALUES (' $u ', ' $p ', ' $e ')", $link);
if ($i%1000 = = 0) echo $i. " \ n ";
}
Fclose ($handle);
?>

The above code efficiency is very poor, so make the modified code as follows:
Copy Code code as follows:

<?php
$link = mysql_connect (' localhost ', ' root ', ' admin ', true);
mysql_select_db (' csdn ', $link);
$handle = fopen ("C:\Users\zhudong\Desktop\www.csdn.net.sql", "R");
$perpage = 50;
while (!feof ($handle)) {
$i + +;
$buffer = fgets ($handle);
List ($u, $p, $e) = Explode ("#", $buffer);
$insertValue [] = "(' $u ', ' $p ', ' $e ')";
if ($i% $perpage = = 0) {$perpage = = 0) {
$instrtValueString = Implode (', ', $insertValue);
mysql_query ("INSERT into Csdn_userdb (username,password,email) VALUES $instrtValueString", $link);
echo $i. " \ n ";
$insertValue = Array ();
}
}
Fclose ($handle);

In order to understand the factors that have an impact on the efficiency of importing data, the tests are based on different settings
Total number of CSDN users according to 6428600
When $perpage=500 data: 5,902,000 data loss 526600 loss rate: 8%; Datasheet engine: MyISAM; index: there; Total time: 15 minutes
When $perpage=200, total data after import: 6,210,200; data loss: 218400; loss rate: 3.3%; Datasheet engine: MYISAM; index: there; Total time: 30 minutes
When $perpage=200, total data after import: 6,210,200; data loss: 218400; loss rate: 3.3%; Datasheet engine: INNODB; index: there; Total time: 65 minutes
When $perpage=200, total data after import: 6,210,200 data loss: 218400; loss rate: 3.3%; Datasheet engine: MYISAM; index: none; Total time consuming: 14 minutes (after data is imported and indexed separately)
When $perpage=50, total data after import: 6,371,200; Data loss: 57400, loss rate: 0.8%; Datasheet engine: MYISAM; Index: None: Total time: 20 minutes
Based on the above, summarized as follows:
1, the first import data after the index is more efficient than the first index after the import data is one times higher
2, InnoDB in the single process data insertion efficiency is much lower than MyISAM
3, when perpage=50 in the case of data loss rate below 1%
Copy Code code as follows:

Because of the problem of timeout through the browser, and the efficiency of the underground, so run through the command line, the process encountered a little trouble and delayed a lot of time
At first I executed the following code:
Php.exe E:\usr\www\importcsdndb.php
But always the error: Call to undefined function mysql_connect
Toss found not loaded into the php.ini
The correct code is:
Php.exe-c E:/usr/local/apache2/php.ini importcsdndb.php

2, import needs to match the user data data to the local
Command line to enter MSYQL (will not own Baidu)
Then execute: Mysql>source c:/users/zhudong/desktop/userdb.sql
3. Compare and filter users
The contrast program is written, remember to run it at the command line:
Copy Code code as follows:

<?php
$link = mysql_connect (' localhost ', ' root ', ' admin ', true);
mysql_select_db (' csdn ', $link);
$handle _username = fopen ("E:/records_username.txt", "a");
$handle _email = fopen ("E:/records_email.txt", "a");
$username _num = $email _num = $uid = 0;
while ($uid <2181106) {
$nextuid = $uid +10000;
$query = mysql_query ("select * from Pw_members WHERE uid> ' $uid ' and uid< ' $nextuid '");
while ($rt = Mysql_fetch_array ($query, Mysql_assoc)) {
$username = $rt [' username '];
$email = $rt [' email '];
$query 2 = mysql_query ("select * from Scdn_userdb WHERE username= ' $username ' OR email= ' $email");
while ($rt 2 = mysql_fetch_array ($query 2,mysql_assoc)) {
if ($rt [' password '] = MD5 ($rt 2[' password ')) {
if ($rt 2[' username '] = = $username) {
$username _num++;
Fwrite ($handle _username, ' OWN: '. $rt [' uid ']. ' | ' $rt [' username ']. $rt [' Password ']. $rt [' email ']. ' CSDN: '. $rt 2[' username ']. $rt 2[' password ']. $rt 2[' email ']. " \ r \ n ");
Echo ' username_num= '. $username _num. " \ r \ n ";
Continue
}
/*
if ($rt 2[' email '] = = $email) {
$email _num++;
Fwrite ($handle _email, ' OWN: '. $rt [' uid ']. ' | ' $rt [' username ']. $rt [' Password ']. $rt [' email ']. ' CSDN: '. $rt 2[' username ']. $rt 2[' password ']. $rt 2[' email ']. " \ r \ n ");
Echo ' email_num= '. $email _num. " \ r \ n ";
}
*/
}
}
Mysql_free_result ($query 2);
}
$uid = $nextuid;
}
?>

You see the above code is very bad, because its efficiency is particularly low, millions of of data, to run more than 10 hours, how can forget the table query so basic things, the following is the revised method
Copy Code code as follows:

$link = mysql_connect (' localhost ', ' root ', ' admin ', true);
mysql_select_db (' csdn ', $link);
$handle _username = fopen ("E:/records_username.txt", "a");
while ($uid <2181106) {//The number here is to compare the maximum ID of the user library
$nextuid = $uid +10000;
$query = mysql_query ("Select M.uid,m.username,m.password,m.email,u.password as csdn_password,u.email as Csdn_email From Own_members M-left JOIN csdn_userdb u USING (username) WHERE m.uid> ' $uid ' and m.uid<= ' $nextuid ' and u.username!= ''");
while ($rt = Mysql_fetch_array ($query, Mysql_assoc)) {
if ($rt [' password '] = = MD5 ($rt [' Csdn_password '])) {
$username _num++;
Fwrite ($handle _username, ' OWN: '. $rt [' uid ']. ' | ' $rt [' username ']. $rt [' Password ']. $rt [' email ']. ' CSDN: '. $rt [' username ']. $rt [' Csdn_password ']. $rt [' Csdn_email ']. " \ r \ n ");
Echo ' username_num= '. $username _num. " \ r \ n ";
}
}
$uid = $nextuid;
Echo ' uid= '. $uid;
}
?>

The total comparison time is 25 minutes, compared to the first more than 10 hours of implementation is really a great improvement
Total Duplicate User: 34175
Percentage of total membership: 1.7%
1.7% of the name of the user is very serious, I hope this article on the webmaster to compare the site to help users

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.