Keep the news data of your website synchronized with Sina

Source: Internet
Author: User
Collection is no longer a new term. many webmasters are dedicated to the lack of manpower to help their websites, for example, my personal website www.xxfsw.com has also collected a large amount of news. what if so? Today we use php to achieve this.

Collection is no longer a new term. many webmasters are dedicated to the lack of manpower to help their websites, for example, my personal website www.xxfsw.com has also collected a large amount of news. what if so? Today, we use php to implement this function.

When talking about collection, we have to talk about two things. The first is how to obtain the source code of a remote website. this can be obtained through an extension curl in php, and the other is to match the information you need, the solution is a regular expression.

To enable curl in Windows, follow these steps:

1. copy the libeay32.dll, ssleay32.dll, php5ts. dll, and php_curl.dll files in the PHP directory to the system32 directory.

2. modify php. ini: Configure extension_dir and remove the semicolon before extension = php_curl.dll.

3. restart apache.

To enable curl in Linux, follow these steps:

Go to the source code directory for installing the original php,

Cd ext
Cd curl
Phpize
./Configure -- with-curl = DIR
Make

The curl. so file is generated under PHPDIR/ext/curl/moudles.

Copy the curl. so file to the configuration directory of extensions and modify php. ini.

Then you can use curl to obtain the webpage source code of the specified url. here we will give you an encapsulated function:

Reference content is as follows:
Function getwebcontent ($ url ){
$ Ch = curl_init ();
$ Timeout = 10;
Curl_setopt ($ ch, CURLOPT_URL, $ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout );
Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 );
$ Contents = trim (curl_exec ($ ch ));
Curl_close ($ ch );
Return $ contents;
}

Next we should talk about the regular expression in php:

1. brackets

[0-9] match 0-9

[A-z] matches lowercase letters of a-z

[A-Z] matching A-Z capital letters

[A-zA-Z] matches all uppercase and lowercase letters.

You can use ascii to create more

2. quantifiers

Reference content is as follows:
P + matches at least one string containing p
P * companion to any string containing 0 or more p
P? Match any string containing 0 or one p
P {2} matches a string that contains 2 p sequences
P {2, 3} matches any string containing 2 or 3
P $ matches any string ending with p
^ P matches any string starting with p
[^ A-zA-Z] matches any string that does not contain a-zA-Z.
P. p matches any string containing p, followed by any character, followed by p
^. {2} $ match any string whose value contains 2 characters
(. *) B> match any> Surrounded string
P (hp) * matches any string containing p, followed by multiple or zero hp

3. pre-defined character range

Reference content is as follows:
[: Alpha:] same as [a-zA-Z]
[: Alnum:] same as [a-zA-Z0-9]
[: Cntrl:] match control characters, such as tabs, backslash, and escape character
[: Digit:] same as [0-9]
[: Graph:] all ASCII33 ~ Characters that can be printed within 166
[: Lower:] same as [a-z]
[: Punct:] punctuation marks
[: Upper:] same as [A-Z]
[: Space:] blank characters, including spaces, horizontal tabs, line breaks, page breaks, and carriage returns
[: Xdigit:] hexadecimal symbols same as [a-fA-F0-9]

If you don't talk much about it, go directly to my source code. if you don't know anything, check it for hundreds of times.

Reference content is as follows:
Header ("Content-type: text/html; charset = utf-8 ");
Getinfo ("http://rss.sina.com.cn/rollnews/news/gn_total.js", 1 );
Getinfo ("http://rss.sina.com.cn/rollnews/news/gj_total.js", 2 );
Getinfo ("http://rss.sina.com.cn/rollnews/news/sh_total.js", 3 );
Getinfo ("http://rss.sina.com.cn/rollnews/sports/sports_total.js", 4 );
Getinfo ("http://rss.sina.com.cn/rollnews/tech/tech1_total.js", 5 );
Getinfo ("http://rss.sina.com.cn/rollnews/finance/finance1_news_total.js", 6 );
Getinfo ("http://rss.sina.com.cn/rollnews/ent/ent_total.js", 7 );
Getinfo ("http://rss.sina.com.cn/rollnews/jczs/jczs_total.js", 8 );
Function getinfo ($ infourl, $ catid)
{
$ Pagecontent = getwebcontent ($ infourl );
Preg_match_all ("/title :\"(.*?) \ "/", $ Pagecontent, $ match );
$ Titlearr = $ match [1];
Preg_match_all ("/link :\"(.*?) \ "/", $ Pagecontent, $ match );
$ Urlarr = $ match [1];
For ($ I = 1; $ I Echo "go {$ titlearr [$ i-1]} \ n ";
$ Title = iconv ("gbk", "UTF-8", $ titlearr [$ i-1]);
$ Content = iconv ("gbk", "UTF-8", getnewscontent ($ urlarr [$ I]);
$ Content = mysql_escape_string ($ content );
If (! Insertdb ($ title, $ content, $ catid) break;
}
}
Function insertdb ($ title, $ content, $ catid ){
Write data to your database
}
Function getnewscontent ($ newsurl ){
$ Newscontent = getwebcontent ($ newsurl );
Preg_match_all ("/([\ s \ S] *?) /", $ Newscontent, $ match );
$ Content = preg_replace ("// si", "", $ match [1] [0]);
$ Content = preg_replace ("/.*? <\/Div>/si "," ", $ content );
$ Content = preg_replace ("/.*? <\/Div>/si "," ", $ content );
$ Content = str_replace ("", "", $ content );
Return $ content;
}
Function getwebcontent ($ url ){
$ Ch = curl_init ();
$ Timeout = 10;
Curl_setopt ($ ch, CURLOPT_URL, $ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout );
Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 );
$ Contents = trim (curl_exec ($ ch ));
Curl_close ($ ch );
Return $ contents;
}
?>

Then, how can we implement real-time synchronization? in this way, we can use the job plan in windows or the crontab in linux to execute this program at regular intervals (such as 10 minutes, you will no longer worry about the lack of content on the website. haha, I also opened a studio www.beijingjianzhan.com (Beijing site). we developed a system that not only collects information, in addition, it can be automatically re-processed and pseudo-original, so that it is more in line with the taste of the search engine, so that your website is frantically indexed, in addition, I can add q3700004340 to discuss technical topics.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.