100 lines of PHP code collect Alibaba merchant Information

Source: Internet
Author: User
For example, if this Page is a list of all jiamen merchants on alibaba, You can see Page: 129 www. alibaba.

-- Authored by Li JiaYou Alibaba merchant Information Collection instructions 1. How to obtain the merchant list Page Link http://www.alibaba.com/corporations/jiangmen/CN-----------.html such as this Page for all the jiamen merchant information list on alibaba, you can also see Page: 1/29 words http://www.alibaba.

-- Authored by Li JiaYou

Alibaba merchant Information Collection instructions

1. How to obtain the link to the merchant list page

Http://www.alibaba.com/corporations/jiangmen/CN-----------.html

For example, if this Page is a list of all jiamen merchants on alibaba, You can see Page: 1/29.

Http://www.alibaba.com/corporations/jiangmen/CN-----------/2.html? Tracelog = 24581_list_turnpage

It can be found that the end is changed to 2.html ......

Put? Remove the following parameters and modify the parameters 3, 4, and 5.

Http://www.alibaba.com/corporations/jiangmen/CN-----------/2.html

Http://www.alibaba.com/corporations/jiangmen/CN-----------/3.html

Http://www.alibaba.com/corporations/jiangmen/CN-----------/5.html

Common List page links should be:

Http://www.alibaba.com/configurations/jiangmen/cn---------/??page=.html

2. Retrieve all page content from the list page

Because alibaba is anti-collection, we pretend to be an Internet Explorer's HTTP access.

$ HTTP_SESSION = _ rand ();

$ HTTP_SESSION;

$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";

$ Ch = curl_init ();

Curl_setopt ($ ch, CURLOPT_URL, $ HTTP_URL );

Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );

Curl_setopt ($ ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");

$ Res = curl_exec ($ ch );

Curl_close ($ ch );

?>

In this way, the content of the list page is assigned to $ res.

3. How to obtain a specific merchant link from the list page

Take the first page as an Example

Http://www.alibaba.com/corporations/jiangmen/CN-----------/1.html

View the source code and you can find that the link of all merchant names is in this style.

JiangmenRonda Battery Co., Ltd.

Http: // {CompanyName} .en.alibaba.com

Use regular expressions to find all {CompanyName} from the $ res content }:

Preg_match_all ('/href \ s * = \ s * ["| \']? ([^ \ S "\ '>] *) .en.alibaba.com \"/I', $ res, $ arr );

In this way, $ arr is the link of all sellers on the first page of the list.

4. How to collect merchant Information

First, obtain the link of a single seller cyclically.

Foreach ($ arr [1] as $ a => $ web)

?>

Use $ web to spell .en.alibaba.com as the seller link.

For example, http://rondabattery.en.alibaba.com/

Browsing found that all company contact information is http://rondabattery.en.alibaba.com/contactinfo.html

In disguise, IE collects the contact information page of a single merchant.

$ HTTP_SESSION = _ rand ();

$ HTTP_SESSION;

$ HTTP_Server = $ web;

$ HTTP_URL = ".en.alibaba.com/contactinfo.html#;

$ Response = curl_init ();

Curl_setopt ($ scheme, CURLOPT_URL, $ HTTP_Server. $ HTTP_URL );

Curl_setopt ($ scheme, CURLOPT_RETURNTRANSFER, true );

Curl_setopt ($ scheme, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");

$ Res1 = curl_exec ($ response );

Curl_close ($ close );

?>

In this way, $ res1 contains the content on the rondabattery Company Contact Information Page.

5. How to collect detailed information such as contacts and telephones

Http://rondabattery.en.alibaba.com/contactinfo.html

Check the source code to find

Company Name:

Jiangmen Ronda Battery Co., Ltd.

Company Name and other information are in this format

You can use regular expression matching:

Preg_match ("/Company Name :(.*?) <\/Td>/s ", $ res1, $ Cname );

In this way, $ Cname contains green content.

Company Name:

Jiangmen Ronda Battery Co., Ltd.

ApparentlyNot what we need

$ Cname = trim (strip_tags ($ Cname [1]);

Trim: removes spaces at the beginning and end of a string,

Tip:

Sometimes there are double quotation marks (") in the content, which are common in the Company Introduction and so on. You need to replace them. Otherwise, an SQL statement error occurs when saving the database.

$ Name = str_replace ("\", ", $ name );

6. Store the obtained information to the database

Mysql_pconnect ("localhost", "root", "password") or

Die ("cocould not connect". mysql_error ());

Mysql_select_db ("company");

Mysql_query ("set names 'utf8 ′");

$ Result = mysql_query ("

Insert into alibaba (

Name, Company, Address, City, Province, Region, Zip, Tel, Phone, Fax, Web

) VALUES (

'Privacy .html specialchars ($ name )."',

'{.Html specialchars ($ Cname )."',

'Privacy .html specialchars ($ Add )."',

'Privacy .html specialchars ($ City )."',

'Privacy .html specialchars ($ Pronvice )."',

'{.Html specialchars ($ Region )."',

'Privacy .html specialchars ($ Zip )."',

'Mirror.html specialchars ($ Tel )."',

'Privacy .html specialchars ($ Phone )."',

'Privacy .html specialchars ($ Fax )."',

'Privacy .html specialchars ($ Web )."'

)");

?>

Htmlspecialchars escapes html characters in the content, but double quotation marks are not processed by default. Therefore, we mentioned that double quotation marks must be replaced separately.

Modify the account, password, database name, table name, and field definitions as needed.

7. How to process each list page

The previous loop only collects 20 sellers on the first page of the List page, but 29 List pages need to be processed.

In this way, a page Jump with parameters is made.

If ($ page> = 29 ){

Echo "OVER !"; Exit ();

} Else {

Echo" ";

}

?>

Obtain the page number of the list to be processed at the beginning of the PHP file.

If ($ _ GET ['page']) {

$ Page = $ _ GET ['page'];

} Else {

$ Page = '1 ′;

}

?>

In this way, the php will first obtain the page number of the list to be processed,

If the page parameter is not set, the execution starts from the first page,

Spell out the link, that is, the link mentioned at the beginning of the article

$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";

At the end of the program, it will determine the number of pages currently processed. If the maximum number of pages is reached, OVER is displayed and the program is terminated.

Otherwise, + $ page adds the page number to 1 and performs the jump to continue executing this program.

8. php code

Alibaba collection

Set_time_limit (0 );

Function _ rand (){

$ Length = 26;

$ Chars = "0123456789abcdefghijklmnopqrstuvwxyz ";

$ Max = strlen ($ chars)-1;

Mt_srand (double) microtime () * 1000000 );

$ String = ";

For ($ I = 0; $ I <$ length; $ I ++ ){

$ String. = $ chars [mt_rand (0, $ max)];

}

Return $ string;

}

Error_reporting (0 );

Ini_set ('html _ errors ', false );

Ini_set ('display _ errors ', false );

Mysql_pconnect ("localhost", "root", "password") or

Die ("cocould not connect". mysql_error ());

Mysql_select_db ("company ");

Mysql_query ("set names 'utf8 ′");

If ($ _ GET ['page']) {

$ Page = $ _ GET ['page'];

} Else {

$ Page = '1 ′;

}

$ HTTP_SESSION = _ rand ();

$ HTTP_SESSION;

$ HTTP_URL = "http://www.alibaba.com/configurations/jiangmen/cn-----------/developer.?page.#.html ";

$ Ch = curl_init ();

Curl_setopt ($ ch, CURLOPT_URL, $ HTTP_URL );

Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );

Curl_setopt ($ ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");

$ Res = curl_exec ($ ch );

Curl_close ($ ch );

Preg_match_all ('/href \ s * = \ s * ["| \']? ([^ \ S "\ '>] *) .en.alibaba.com \"/I', $ res, $ arr );

Foreach ($ arr [1] as $ a => $ web ){

$ HTTP_SESSION = _ rand ();

$ HTTP_SESSION;

$ HTTP_Server = $ web;

$ HTTP_URL = ".en.alibaba.com/contactinfo.html ";

$ Response = curl_init ();

Curl_setopt ($ scheme, CURLOPT_URL, $ HTTP_Server. $ HTTP_URL );

Curl_setopt ($ scheme, CURLOPT_RETURNTRANSFER, true );

Curl_setopt ($ scheme, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 1.1.4322;. net clr 2.0.50727 )");

$ Res1 = curl_exec ($ response );

Curl_close ($ close );

Preg_match ("/contactName (.*?) <\/A>/s ", $ res1, $ name );

$ Name = strip_tags ($ name [1]);

// $ Name = str_replace ("\" "," ", $ name );

// $ Name = str_replace (">", "", $ name );

$ Name = trim ($ name );

Preg_match ("/Company Name :(.*?) <\/Td>/s ", $ res1, $ Cname );

$ Cname = trim (strip_tags ($ Cname [1]);

Preg_match ("/Street Address :(.*?) <\/Td>/s ", $ res1, $ Add );

$ Add = trim (strip_tags ($ Add [1]);

Preg_match ("/City :(.*?) <\/Td>/s ", $ res1, $ City );

$ City = trim (strip_tags ($ City [1]);

Preg_match ("/Province \/State :(.*?) <\/Td>/s ", $ res1, $ Pronvice );

$ Pronvice = trim (strip_tags ($ Pronvice [1]);

Preg_match ("/Country \/Region :(.*?) <\/Td>/s ", $ res1, $ Region );

$ Region = trim (strip_tags ($ Region [1]);

Preg_match ("/Zip :(.*?) <\/Td>/s ", $ res1, $ Zip );

$ Zip = trim (strip_tags ($ Zip [1]);

Preg_match ("/Telephone :(.*?) <\/Td>/s ", $ res1, $ Tel );

$ Tel = trim (strip_tags ($ Tel [1]);

Preg_match ("/Mobile Phone :(.*?) <\/Td>/s ", $ res1, $ Phone );

$ Phone = trim (strip_tags ($ Phone [1]);

Preg_match ("/Fax :(.*?) <\/Td>/s ", $ res1, $ Fax );

$ Fax = trim (strip_tags ($ Fax [1]);

Preg_match ("/Website :(.*?) <\/Td>/s ", $ res1, $ Web );

$ Web = trim (strip_tags ($ Web [1]);

$ Result = mysql_query ("

Insert into alibaba (

Name,

Company,

Address,

City,

Province,

Region,

Zip,

Tel,

Phone,

Fax,

Web

) VALUES (

'Privacy .html specialchars ($ name )."',

'{.Html specialchars ($ Cname )."',

'Privacy .html specialchars ($ Add )."',

'Privacy .html specialchars ($ City )."',

'Privacy .html specialchars ($ Pronvice )."',

'{.Html specialchars ($ Region )."',

'Privacy .html specialchars ($ Zip )."',

'Mirror.html specialchars ($ Tel )."',

'Privacy .html specialchars ($ Phone )."',

'Privacy .html specialchars ($ Fax )."',

'Privacy .html specialchars ($ Web )."'

)");

}

If ($ page> = 29 ){

Echo "OVER !"; Exit ();

} Else {

Echo" ";

}

?>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.