Collection is no longer a new term. many webmasters are dedicated to the lack of manpower to help their websites, for example, my personal website www.xxfsw.com has also collected a large amount of news. what if so? Today we use php to achieve this.
Collection is no longer a new term. many webmasters are dedicated to the lack of manpower to help their websites, for example, my personal website www.xxfsw.com has also collected a large amount of news. what if so? Today, we use php to implement this function.
When talking about collection, we have to talk about two things. The first is how to obtain the source code of a remote website. this can be obtained through an extension curl in php, and the other is to match the information you need, the solution is a regular expression.
To enable curl in Windows, follow these steps:
1. copy the libeay32.dll, ssleay32.dll, php5ts. dll, and php_curl.dll files in the PHP directory to the system32 directory.
2. modify php. ini: Configure extension_dir and remove the semicolon before extension = php_curl.dll.
3. restart apache.
To enable curl in Linux, follow these steps:
Go to the source code directory for installing the original php,
Cd ext
Cd curl
Phpize
./Configure -- with-curl = DIR
Make
The curl. so file is generated under PHPDIR/ext/curl/moudles.
Copy the curl. so file to the configuration directory of extensions and modify php. ini.
Then you can use curl to obtain the webpage source code of the specified url. here we will give you an encapsulated function:
Reference content is as follows: Function getwebcontent ($ url ){ $ Ch = curl_init (); $ Timeout = 10; Curl_setopt ($ ch, CURLOPT_URL, $ url ); Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); Curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout ); Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 ); $ Contents = trim (curl_exec ($ ch )); Curl_close ($ ch ); Return $ contents; } |
Next we should talk about the regular expression in php:
1. brackets
[0-9] match 0-9
[A-z] matches lowercase letters of a-z
[A-Z] matching A-Z capital letters
[A-zA-Z] matches all uppercase and lowercase letters.
You can use ascii to create more
2. quantifiers
Reference content is as follows: P + matches at least one string containing p P * companion to any string containing 0 or more p P? Match any string containing 0 or one p P {2} matches a string that contains 2 p sequences P {2, 3} matches any string containing 2 or 3 P $ matches any string ending with p ^ P matches any string starting with p [^ A-zA-Z] matches any string that does not contain a-zA-Z. P. p matches any string containing p, followed by any character, followed by p ^. {2} $ match any string whose value contains 2 characters (. *) B> match any> Surrounded string P (hp) * matches any string containing p, followed by multiple or zero hp |
3. pre-defined character range
Reference content is as follows: [: Alpha:] same as [a-zA-Z] [: Alnum:] same as [a-zA-Z0-9] [: Cntrl:] match control characters, such as tabs, backslash, and escape character [: Digit:] same as [0-9] [: Graph:] all ASCII33 ~ Characters that can be printed within 166 [: Lower:] same as [a-z] [: Punct:] punctuation marks [: Upper:] same as [A-Z] [: Space:] blank characters, including spaces, horizontal tabs, line breaks, page breaks, and carriage returns [: Xdigit:] hexadecimal symbols same as [a-fA-F0-9] |
If you don't talk much about it, go directly to my source code. if you don't know anything, check it for hundreds of times.
Reference content is as follows: Header ("Content-type: text/html; charset = utf-8 "); Getinfo ("http://rss.sina.com.cn/rollnews/news/gn_total.js", 1 ); Getinfo ("http://rss.sina.com.cn/rollnews/news/gj_total.js", 2 ); Getinfo ("http://rss.sina.com.cn/rollnews/news/sh_total.js", 3 ); Getinfo ("http://rss.sina.com.cn/rollnews/sports/sports_total.js", 4 ); Getinfo ("http://rss.sina.com.cn/rollnews/tech/tech1_total.js", 5 ); Getinfo ("http://rss.sina.com.cn/rollnews/finance/finance1_news_total.js", 6 ); Getinfo ("http://rss.sina.com.cn/rollnews/ent/ent_total.js", 7 ); Getinfo ("http://rss.sina.com.cn/rollnews/jczs/jczs_total.js", 8 ); Function getinfo ($ infourl, $ catid) { $ Pagecontent = getwebcontent ($ infourl ); Preg_match_all ("/title :\"(.*?) \ "/", $ Pagecontent, $ match ); $ Titlearr = $ match [1]; Preg_match_all ("/link :\"(.*?) \ "/", $ Pagecontent, $ match ); $ Urlarr = $ match [1]; For ($ I = 1; $ I Echo "go {$ titlearr [$ i-1]} \ n "; $ Title = iconv ("gbk", "UTF-8", $ titlearr [$ i-1]); $ Content = iconv ("gbk", "UTF-8", getnewscontent ($ urlarr [$ I]); $ Content = mysql_escape_string ($ content ); If (! Insertdb ($ title, $ content, $ catid) break; } } Function insertdb ($ title, $ content, $ catid ){ Write data to your database } Function getnewscontent ($ newsurl ){ $ Newscontent = getwebcontent ($ newsurl ); Preg_match_all ("/([\ s \ S] *?) /", $ Newscontent, $ match ); $ Content = preg_replace ("// si", "", $ match [1] [0]); $ Content = preg_replace ("/.*? <\/Div>/si "," ", $ content ); $ Content = preg_replace ("/.*? <\/Div>/si "," ", $ content ); $ Content = str_replace ("", "", $ content ); Return $ content; } Function getwebcontent ($ url ){ $ Ch = curl_init (); $ Timeout = 10; Curl_setopt ($ ch, CURLOPT_URL, $ url ); Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); Curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout ); Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 ); $ Contents = trim (curl_exec ($ ch )); Curl_close ($ ch ); Return $ contents; } ?> |
Then, how can we implement real-time synchronization? in this way, we can use the job plan in windows or the crontab in linux to execute this program at regular intervals (such as 10 minutes, you will no longer worry about the lack of content on the website. haha, I also opened a studio www.beijingjianzhan.com (Beijing site). we developed a system that not only collects information, in addition, it can be automatically re-processed and pseudo-original, so that it is more in line with the taste of the search engine, so that your website is frantically indexed, in addition, I can add q3700004340 to discuss technical topics.