Application of Regular Expressions in thieves

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Almost all programming languages provide methods for you to quickly obtain content from other websites. Especially for some real-time update data, it is undoubtedly a practical resource to capture the data on your website in time. However, these methods are used to capture all the source code on the web page. The unmodified data is definitely not what you want, so you need to process the data.

1. Obtain source code

$ Content = file_get_contents ("http://blog.csdn.net/shanshan209? Viewmode = contents ");

The function used to capture the page source code in PHP is file_get_contents. You only need to provide the URL of the webpage. For convenience, the captured data is the list of my blog articles.

Output $ content. Will you find all the content is garbled? That's because your encoding format is not converted. $ Content = iconv ("UTF-8", "gb2312 // ignore", $ content); Remember to add // ignore for Fault Tolerance in decoding, otherwise, it will automatically interrupt when an error occurs, and the intercepted content will be incomplete.

PS: some host service providers disable the allow_url_fopen option of PHP, that is, they cannot directly use file_get_contents to obtain the content of the remote web page. That is, you can use another function curl. You can use the function_exists function to check whether the file_get_contents function exists.

if(function_exists('file_get_contents')) {$file_contents = file_get_contents($url);} else {$ch = curl_init();$timeout = 5;curl_setopt ($ch, CURLOPT_URL, $url);curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);$file_contents = curl_exec($ch);curl_close($ch);}

2. Process intercepted data

Output $ content and you will find that these are not all what you need. You only need the article name in the blog list. How can we effectively process data? In this case, we can use regular expressions for matching.

Preg_match_all ("/<Div class =\\" article_title \\ "> (. + ?) <\/Div>/s ", $ content, $ article_list );

Obtain the content where div is article_title. (. + ?) The lazy match principle is used to match as few characters as possible.

Preg_match only matches once, And preg_match_all is full-text match, that is, all matching expressions are found.

3. Get the data you want

$out=array();            foreach($article_list[1] as $i=>$key){                $out[$i]=trim(strip_tags($key));            }

Remove the HTML Tag. The final data is an array $ out.

PS: The above is just a small instance, which may be much more complicated than this. Of course, we will be okay when we see the moves.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Application of Regular Expressions in thieves

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support