PHP Curl Extension for data fetching

Source: Internet
Author: User

PHP Version: 5.5.30

Server: Apche

Crawl Site Address: http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml

Fetch target: Get the price data of the day

First, the preparatory work:

1. Open the PHP.ini configuration file and turn on the Curl feature extension

Extension=php_curl.dll

If you are still unable to use Curl's extension method after opening (remember to restart the Apache service), check that the environment variable is set.

Second, the analysis of the page to crawl, data structure

View http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml source code.

See all the data in the <tbody> and </tbody> tags.

Third, the idea

1. Get the current page data, intercept to <tbody> and </tbody>. (where data is composed of multiple lines:<tr><td></td></tr>)

2. In <tbody> and </tbody>, read the rows, and the values of the columns, to get the data.

3. The date of the data, may be on the first page, the second page and so on multiple pages. Therefore, the page is read, until the date of the day to find the data, stop the program to run!

A total of three loops: page loop --loop-----column loop

Iv. implementation of the Code

Define the page address to crawl

$url = ' http://nc.mofcom.gov.cn/channel/gxdj/jghq/jg_list.shtml ';//?page=1

Initializing a Curl Connection
$curl =curl_init ();
Set Header
curl_setopt ($curl, Curlopt_header, 0);
Sets the curl parameter, which requires the result to be saved to a string or output to the screen.
curl_setopt ($curl, Curlopt_returntransfer, 1);

$flag = "Start"; Page loop execution flag, start: Indicates continued execution. End: The flag terminates execution (this page cannot find data that contains the date of the day).


//Start Page loop
$i = 1; $i as the page value, control page flipping
$myResultSet =[];//definition array for storing data for all rows
While ($i >0)//Loop Stop Condition: This page cannot find data that contains today's date
{
if ($flag = = "End") {
//Stop page Loop
Break ;
    }
//Set the page address to crawl, and the number of pages
curl_setopt ($curl, Curlopt_url, $url. "?    Page= ". $i);
    
//Run Curl, request Web page
$data = curl_exec ($curl);
    
//through the substr and Strpos functions, get data between <tbody> and </tbody>
$mycontent =substr ($data, Strpos ($data, "<tbody>"), Strpos ($data, "</tbody>")-strpos ($data, "<        Tbody> "));
    
//<table> Line Loop Read (<tr>)
while (Strpos ($mycontent, "<tr>")!==false) {//If the row exists Strpos ($mycontent, "<tr>")
$myResultTr =[]; Defines a local row array. Data to hold a row, each time the row is redefined to empty the data from the previous row in the array
//Get rows
$mytr =substr ($mycontent, Strpos ($mycontent, "<tr>"), Strpos ($mycontent, "</tr>")-strpos ($mycontent , "<tr>"));
//If the data is not current date, set program termination flag, exit loop
if (Strpos ($MYTR, Date ("y-m-d")) ===false) {//This page cannot find data that contains today's date.
$flag = "End"; Set the flag to end.
Break ;
         }
//Column loop gets the first four columns
$j = 0; Column Identification
//&&strpos ($mytr, Date ("y-m-d"))!==false
while (Strpos ($MYTR, "<td>")!==false) {
the data after the IF ($j >=4) {//4th column is not required.
Break ;
             }
//Get columns
$mytd =substr ($mytr, Strpos ($mytr, "<td>"), Strpos ($mytr, "</td>")-strpos ($mytr, "<td>")) ;
              
$pre =array ("," "," \ T "," \ n "," \ R ");
//str_replace removes spaces, tabs, newline characters, carriage returns, which are included in the value. Strip_tags Removing HTML tags
//echo str_replace ($pre, ", Strip_tags ($MYTD));
             
//Will column data, press into the results
Array_push ($myResultTr, Str_replace ($pre, ", Strip_tags ($MYTD)));
              
$mytr =substr ($mytr, Strpos ($mytr, "</td>") +5);
$j + +;
         }
         
//Line results are pressed into the total result set
Array_push ($myResultSet, $myResultTr);
         
//Change $mycontent (intercept the first row of data and leave the remaining data
$mycontent =substr ($mycontent, Strpos ($mycontent, "</tr>") +5);
     }               
    
$i + +;
}
Var_dump ($myResultSet); Print out all the data crawled
echo "Application stop in Page".--$i;
//Close URL request
curl_close ($curl);

V. Test results:

Crawl to 127 data, the page executes to the 10th page to stop.

  

PHP Curl Extension for data fetching

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.