How to collect articles in PHP

Last Update:2013-11-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Most of the data is collected using regular expressions. I will briefly introduce how to implement data collection. this is the php implementation. it is generally run on the local machine and it is unwise to put it into space, because not only does it consume resources, but it also needs to support remote crawling functions, such as file_get_contents ($ urls) file ($ url.

1. Automatic Switch of the article list page and article path acquisition.

2. Obtain the title and content.

3. warehouse receiving

4. Problem

1. Automatic Switch of the article list page and article path acquisition.

A. automatic switching of list pages generally relies on dynamic pages. For example

Http://www.phpfirst.com/foru... d = 1 & page = $ I

You can use $ I auto increment or range later, for example, $ I ++;

You can also, like the one demonstrated by penzi, control the scope of code from page number to page number.

B. The article path must be divided into two types: Regular and regular:

1) You do not need to fill in the regular expression to obtain all the connections on the above article list page.

But it is best to filter and process connections-to judge duplicate connections, leave only one, and process relative paths into absolute paths, such as.../And.

The following are the messy implementation functions I have written:

PHP:

--------------------------------------------------------------------------------

// $ E = clinchgeturl ("http://phpfirst.com/forumdisplay.php? Fid = 1 ");

// Var_dump ($ e );

Function clinchgeturl ($ url)

{

// $ Url = "http: // 127.0.0.1/1.htm ";

// $ Rootpath = "http: // fsrootpathfsfsf/yyyyyy /";

// Var_dump ($ rrr );

If (eregi (.) * [.] (.) *, $ url )){

$ Roopath = split ("/", $ url );

$ Rootpath = "http: //". $ roopath [2]. "/";

$ Nnn = count ($ roopath)-1; for ($ yu = 3; $ yu <$ nnn; $ yu ++) {$ rootpath. = $ roopath [$ yu]. "/";}

// Var_dump ($ rootpath); // http:, 127.0.0.1, xnml, index. php

}

Else {$ rootpath = $ url; // var_dump ($ rootpath );

}

If (isset ($ url )){

Echo "$ url has the following links:
";

$ Fcontents = file ($ url );

While (list (, $ line) = each ($ fcontents )){

While (eregi (href [[: space:] * = [[: space:] * "? [[: Alnum:]: @/. _-] + [?]? [^ "] *"?), $ Line, $ regs )){

// $ Regs [1] = eregi_replace (href [[: space:] * = [[: space:] * "?) ([[: Alnum:]: @/. _-] + )("?), "\ 2", $ regs [1]);

$ Regs [1] = eregi_replace (href [[: space:] * = [[: space:] * ["]?) ([[: Alnum:]: @/. _-] + [?]? [^ "] *) (. *) [^"/] * (["]?), "\ 2", $ regs [1]);

If (! Eregi (^ http: //, $ regs [1]) {

If (eregi (^..., $ regs [1]) {

// $ Roopath = eregi_replace (http ://)? ([[: Alnum:]: @/. _-] +) [[: alnum:] +] (. *) [[[: alnum:] +], "http: // \ 2", $ url );

$ Roopath = split ("/", $ rootpath );

$ Rootpath = "http: //". $ roopath [2]. "/";

// Echo "this is the root d :"."";

$ Nnn = count ($ roopath)-1; for ($ yu = 3; $ yu <$ nnn; $ yu ++) {$ rootpath. = $ roopath [$ yu]. "/";}

// Var_dump ($ rootpath );

If (eregi (^ .. [/[: alnum:], $ regs [1]) {

// Echo "this is ../directory /:"."";

// $ Regs [1] = "../xx/xxxxxx. xx ";

// $ Rr = split ("/", $ regs [1]);

// For ($ oooi = 1; $ oooi

$ Rrr = $ regs [1];

// {$ Rrr. = "/". $ rr [$ oooi];

$ Rrr = eregi_replace ("^ [.] [.] [/]", $ rrr );//}

$ Regs [1] = $ rootpath. $ rrr;

}

} Else {

If (eregi (^ [: alnum:], $ regs [1]) {$ regs [1] = $ rootpath. $ regs [1];}

Else {$ regs [1] = eregi_replace ("^ [/]", $ regs [1]); $ regs [1] = $ rootpath. $ regs [1];}

}

$ Line = $ regs [2];

If (eregi ((.) * [.] (htm | shtm | html | asp | aspx | php | jsp | cgi )(.) *, $ regs [1]) {

$ Out [0] [] = $ regs [1];}

}

} For ($ ouou = 0; $ ouou

{

If ($ out [0] [$ ouou] = $ out [0] [$ ouou + 1]) {

$ Sameurlsum = 1;

// Echo "sameurlsum = 1 :";

For ($ sameurl = 1; $ sameurl

If ($ out [0] [$ ouou + $ sameurl] = $ out [0] [$ ouou + $ sameurl + 1]) {$ sameurlsum ++ ;}

Else {break ;}

}

For ($ p = $ ouou; $ p

{$ Out [0] [$ p] = $ out [0] [$ p + $ sameurlsum];}

}

$ I = 0;

While ($ out [0] [++ $ I]) {

// Echo $ root. $ out [0] [$ I]. "";

$ Outed [0] [$ I] = $ out [0] [$ I];

}

Unset ($ out );

$ Out = $ outed; return $ out;

}

The above things can only be zend, otherwise it will hinder the city appearance :(

After getting all the unique connections, put them in the array

2) Regular Expression Processing

If you want to accurately obtain the required Article connection, use this method.

Ketle

Use

PHP:

--------------------------------------------------------------------------------

Function cut ($ file, $ from, $ end ){

$ Message = explode ($ from, $ file );

$ Message = explode ($ end, $ message [1]);

Return $ message [0];

}

$ From is the html code before the list

$ End is the html code behind the list

The preceding parameters can be submitted through the form.

The list page is not removed from the list, and the rest is the required connection,

You only need to obtain the following regular expression:

PHP:

--------------------------------------------------------------------------------

Preg_match ("/^ (http ://)? (. *)/I ",

$ Url, $ matches );

Return $ matches [2];

2. Obtain the title and content.

A first, read the target path using the obtained article path.

You can use the following functions:

PHP:

--------------------------------------------------------------------------------

Function getcontent ($ url ){

If ($ handle = fopen ($ url, "rb ")){

$ Contents = "";

Do {

$ Data = fread ($ handle, 2048 );

If (strlen ($ data) = 0 ){

Break;

}

$ Contents. = $ data;

} While (true );

Fclose ($ handle );

}

Else

Exit ("........");

Return $ contents;

}

Or directly

PHP:

--------------------------------------------------------------------------------

File_get_contents ($ urls );

The latter is more convenient, but the disadvantages are compared with the above.

B. The following is the title:

This implementation is generally used:

PHP:

--------------------------------------------------------------------------------

Preg_match ("|", $ allcontent, $ title );

The content is obtained by submitting the form.

You can also use a series of cut Functions

For example, for the function cut ($ file, $ from, $ end) mentioned above, the specific string cutting can be achieved through the character processing function cutting, which will be described in detail later in "get content.

C. Obtain the content

The idea of obtaining content is the same as that of obtaining titles, but the situation is complicated, because the content is not so simple.

1) The feature strings near the content are enclosed by double quotation marks, spaces, and line breaks, which are a major obstacle.

Double quotation marks need to be changed to "can be processed through addslashes ()

To remove the line feed symbol, you can use

PHP:

--------------------------------------------------------------------------------

$ A = ereg_replace ("", $ );

Remove.

2) idea 2: using a large number of cutting-related functions to extract content requires a lot of practice and debugging. I am working on this and have not made any breakthroughs ~~~~~~~~

3. warehouse receiving

A. Ensure that your database can be inserted

For example, I can insert the following directly:

PHP:

--------------------------------------------------------------------------------

$ SQL = "INSERT INTO $ articles VALUES (, $ title, $ article, clinch, from, keyword, 1, $ column id, $ time, 1 );";

Where

PHP:

--------------------------------------------------------------------------------

Is automatically ascending

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to collect articles in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to collect articles in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support