For example, if this Page is a list of all jiamen merchants on alibaba, You can see Page: 129 www. alibaba.
-- Authored by Li JiaYou Alibaba merchant Information Collection instructions 1. How to obtain the merchant list Page Link http://www.alibaba.com/corporations/jiangmen/CN-----------.html such as this Page for all the jiamen merchant information list on alibaba, you can also see Page: 1/29 words http://www.alibaba.
-- Authored by Li JiaYou
Alibaba merchant Information Collection instructions
1. How to obtain the link to the merchant list page
Http://www.alibaba.com/corporations/jiangmen/CN-----------.html
For example, if this Page is a list of all jiamen merchants on alibaba, You can see Page: 1/29.
Http://www.alibaba.com/corporations/jiangmen/CN-----------/2.html? Tracelog = 24581_list_turnpage
It can be found that the end is changed to 2.html ......
Put? Remove the following parameters and modify the parameters 3, 4, and 5.
Http://www.alibaba.com/corporations/jiangmen/CN-----------/2.html
Http://www.alibaba.com/corporations/jiangmen/CN-----------/3.html
Http://www.alibaba.com/corporations/jiangmen/CN-----------/5.html
Common List page links should be:
Http://www.alibaba.com/configurations/jiangmen/cn---------/??page=.html
2. Retrieve all page content from the list page
Because alibaba is anti-collection, we pretend to be an Internet Explorer's HTTP access.
$ HTTP_SESSION = _ rand ();
$ HTTP_SESSION;
$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, $ HTTP_URL );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");
$ Res = curl_exec ($ ch );
Curl_close ($ ch );
?>
In this way, the content of the list page is assigned to $ res.
3. How to obtain a specific merchant link from the list page
Take the first page as an Example
Http://www.alibaba.com/corporations/jiangmen/CN-----------/1.html
View the source code and you can find that the link of all merchant names is in this style.
JiangmenRonda Battery Co., Ltd.
Http: // {CompanyName} .en.alibaba.com
Use regular expressions to find all {CompanyName} from the $ res content }:
Preg_match_all ('/href \ s * = \ s * ["| \']? ([^ \ S "\ '>] *) .en.alibaba.com \"/I', $ res, $ arr );
In this way, $ arr is the link of all sellers on the first page of the list.
4. How to collect merchant Information
First, obtain the link of a single seller cyclically.
Foreach ($ arr [1] as $ a => $ web)
?>
Use $ web to spell .en.alibaba.com as the seller link.
For example, http://rondabattery.en.alibaba.com/
Browsing found that all company contact information is http://rondabattery.en.alibaba.com/contactinfo.html
In disguise, IE collects the contact information page of a single merchant.
$ HTTP_SESSION = _ rand ();
$ HTTP_SESSION;
$ HTTP_Server = $ web;
$ HTTP_URL = ".en.alibaba.com/contactinfo.html#;
$ Response = curl_init ();
Curl_setopt ($ scheme, CURLOPT_URL, $ HTTP_Server. $ HTTP_URL );
Curl_setopt ($ scheme, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ scheme, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");
$ Res1 = curl_exec ($ response );
Curl_close ($ close );
?>
In this way, $ res1 contains the content on the rondabattery Company Contact Information Page.
5. How to collect detailed information such as contacts and telephones
Http://rondabattery.en.alibaba.com/contactinfo.html
Check the source code to find
Company Name:
Jiangmen Ronda Battery Co., Ltd.
Company Name and other information are in this format
You can use regular expression matching:
Preg_match ("/Company Name :(.*?) <\/Td>/s ", $ res1, $ Cname );
In this way, $ Cname contains green content.
Company Name:
Jiangmen Ronda Battery Co., Ltd.
ApparentlyNot what we need
$ Cname = trim (strip_tags ($ Cname [1]);
Trim: removes spaces at the beginning and end of a string,
Tip:
Sometimes there are double quotation marks (") in the content, which are common in the Company Introduction and so on. You need to replace them. Otherwise, an SQL statement error occurs when saving the database.
$ Name = str_replace ("\", ", $ name );
6. Store the obtained information to the database
Mysql_pconnect ("localhost", "root", "password") or
Die ("cocould not connect". mysql_error ());
Mysql_select_db ("company");
Mysql_query ("set names 'utf8 ′");
$ Result = mysql_query ("
Insert into alibaba (
Name, Company, Address, City, Province, Region, Zip, Tel, Phone, Fax, Web
) VALUES (
'Privacy .html specialchars ($ name )."',
'{.Html specialchars ($ Cname )."',
'Privacy .html specialchars ($ Add )."',
'Privacy .html specialchars ($ City )."',
'Privacy .html specialchars ($ Pronvice )."',
'{.Html specialchars ($ Region )."',
'Privacy .html specialchars ($ Zip )."',
'Mirror.html specialchars ($ Tel )."',
'Privacy .html specialchars ($ Phone )."',
'Privacy .html specialchars ($ Fax )."',
'Privacy .html specialchars ($ Web )."'
)");
?>
Htmlspecialchars escapes html characters in the content, but double quotation marks are not processed by default. Therefore, we mentioned that double quotation marks must be replaced separately.
Modify the account, password, database name, table name, and field definitions as needed.
7. How to process each list page
The previous loop only collects 20 sellers on the first page of the List page, but 29 List pages need to be processed.
In this way, a page Jump with parameters is made.
If ($ page> = 29 ){
Echo "OVER !"; Exit ();
} Else {
Echo" ";
}
?>
Obtain the page number of the list to be processed at the beginning of the PHP file.
If ($ _ GET ['page']) {
$ Page = $ _ GET ['page'];
} Else {
$ Page = '1 ′;
}
?>
In this way, the php will first obtain the page number of the list to be processed,
If the page parameter is not set, the execution starts from the first page,
Spell out the link, that is, the link mentioned at the beginning of the article
$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";
At the end of the program, it will determine the number of pages currently processed. If the maximum number of pages is reached, OVER is displayed and the program is terminated.
Otherwise, + $ page adds the page number to 1 and performs the jump to continue executing this program.
8. php code
Alibaba collection
Set_time_limit (0 );
Function _ rand (){
$ Length = 26;
$ Chars = "0123456789abcdefghijklmnopqrstuvwxyz ";
$ Max = strlen ($ chars)-1;
Mt_srand (double) microtime () * 1000000 );
$ String = ";
For ($ I = 0; $ I <$ length; $ I ++ ){
$ String. = $ chars [mt_rand (0, $ max)];
}
Return $ string;
}
Error_reporting (0 );
Ini_set ('html _ errors ', false );
Ini_set ('display _ errors ', false );
Mysql_pconnect ("localhost", "root", "password") or
Die ("cocould not connect". mysql_error ());
Mysql_select_db ("company ");
Mysql_query ("set names 'utf8 ′");
If ($ _ GET ['page']) {
$ Page = $ _ GET ['page'];
} Else {
$ Page = '1 ′;
}
$ HTTP_SESSION = _ rand ();
$ HTTP_SESSION;
$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, $ HTTP_URL );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");
$ Res = curl_exec ($ ch );
Curl_close ($ ch );
Preg_match_all ('/href \ s * = \ s * ["| \']? ([^ \ S "\ '>] *) .en.alibaba.com \"/I', $ res, $ arr );
Foreach ($ arr [1] as $ a => $ web ){
$ HTTP_SESSION = _ rand ();
$ HTTP_SESSION;
$ HTTP_Server = $ web;
$ HTTP_URL = ".en.alibaba.com/contactinfo.html ";
$ Response = curl_init ();
Curl_setopt ($ scheme, CURLOPT_URL, $ HTTP_Server. $ HTTP_URL );
Curl_setopt ($ scheme, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ scheme, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");
$ Res1 = curl_exec ($ response );
Curl_close ($ close );
Preg_match ("/contactName (.*?) <\/A>/s ", $ res1, $ name );
$ Name = strip_tags ($ name [1]);
// $ Name = str_replace ("\" "," ", $ name );
// $ Name = str_replace (">", "", $ name );
$ Name = trim ($ name );
Preg_match ("/Company Name :(.*?) <\/Td>/s ", $ res1, $ Cname );
$ Cname = trim (strip_tags ($ Cname [1]);
Preg_match ("/Street Address :(.*?) <\/Td>/s ", $ res1, $ Add );
$ Add = trim (strip_tags ($ Add [1]);
Preg_match ("/City :(.*?) <\/Td>/s ", $ res1, $ City );
$ City = trim (strip_tags ($ City [1]);
Preg_match ("/Province \/State :(.*?) <\/Td>/s ", $ res1, $ Pronvice );
$ Pronvice = trim (strip_tags ($ Pronvice [1]);
Preg_match ("/Country \/Region :(.*?) <\/Td>/s ", $ res1, $ Region );
$ Region = trim (strip_tags ($ Region [1]);
Preg_match ("/Zip :(.*?) <\/Td>/s ", $ res1, $ Zip );
$ Zip = trim (strip_tags ($ Zip [1]);
Preg_match ("/Telephone :(.*?) <\/Td>/s ", $ res1, $ Tel );
$ Tel = trim (strip_tags ($ Tel [1]);
Preg_match ("/Mobile Phone :(.*?) <\/Td>/s ", $ res1, $ Phone );
$ Phone = trim (strip_tags ($ Phone [1]);
Preg_match ("/Fax :(.*?) <\/Td>/s ", $ res1, $ Fax );
$ Fax = trim (strip_tags ($ Fax [1]);
Preg_match ("/Website :(.*?) <\/Td>/s ", $ res1, $ Web );
$ Web = trim (strip_tags ($ Web [1]);
$ Result = mysql_query ("
Insert into alibaba (
Name,
Company,
Address,
City,
Province,
Region,
Zip,
Tel,
Phone,
Fax,
Web
) VALUES (
'Privacy .html specialchars ($ name )."',
'{.Html specialchars ($ Cname )."',
'Privacy .html specialchars ($ Add )."',
'Privacy .html specialchars ($ City )."',
'Privacy .html specialchars ($ Pronvice )."',
'{.Html specialchars ($ Region )."',
'Privacy .html specialchars ($ Zip )."',
'Mirror.html specialchars ($ Tel )."',
'Privacy .html specialchars ($ Phone )."',
'Privacy .html specialchars ($ Fax )."',
'Privacy .html specialchars ($ Web )."'
)");
}
If ($ page> = 29 ){
Echo "OVER !"; Exit ();
} Else {
Echo" ";
}
?>