Tips on data processing based on preg_match_all (encoding conversion and regular expression matching) _ php

Source: Internet
Author: User
This article mainly introduces some notes on data processing after data collection, encoding conversion and regular expression matching. based on preg_match_all, you can refer to section 1. use curl to implement out-of-site collection.

Please refer to my previous note: http://www.jb51.net/article/46432.htm

2. encoding conversion
First, check the source code to find the encoding used by the collected website and use the mb_convert_encoding function for transcoding;

Usage:

The code is as follows:


// The Source character is $ str

// The following known original encoding is GBK and converted to UTF-8
Mb_convert_encoding ($ str, "UTF-8", "GBK ");

// The following unknown original encoding. after auto detection, the conversion encoding is UTF-8
Mb_convert_encoding ($ str, "UTF-8", "auto ");

3. in order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to clear the line breaks, space characters, and tabs in the collected source code.

The code is as follows:


// Method 1, replace with str_replace
$ Contents = str_replace ("\ r \ n", '', $ contents); // clear line breaks
$ Contents = str_replace ("\ n", '', $ contents); // clear line breaks
$ Contents = str_replace ("\ t", '', $ contents); // clear tabs
$ Contents = str_replace ("", '', $ contents); // clear the space character

// Method 2, replace with a regular expression
$ Contents = preg_replace ("/([\ r \ n | \ t |] +)/", '', $ contents );

4. find the code segment to be obtained through regular expression matching, and use preg_match_all to implement the matching.

The code is as follows:


Function explanation:
Int preg_match_all (string pattern, string subject, array matches [, int flags])
Pattern is a regular expression.
Subject is the original text to be searched.
Matches is an array used to store output results.
Flags is the storage mode, including:
PREG_PATTERN_ORDER; // the entire array is a two-dimensional array. $ arr1 [0] is an array consisting of the matched strings composed of the boundary. $ arr1 [1] removes the array of matched strings composed of the boundary.
PREG_SET_ORDER; // the entire array is a two-dimensional array. $ arr2 [0] [0] is the first matching string consisting of the boundary, $ arr2 [0] [1] is the first matching string consisting of removing the boundary, followed by the array and so on.
PREG_OFFSET_CAPTURE; // the entire array is a three-dimensional array. $ arr3 [0] [0] [0] is the first matching string consisting of the boundary, $ arr3 [0] [0] [1] is the offset to reach the boundary of the first matching string (the boundary is not counted), and so on, $ arr2 [1] [0] [0] is the first matching string consisting of the boundary, $ arr3 [1] [0] [1] is the offset to reach the boundary of the first matching string (the boundary is included );

// Practical application
Preg_match_all ('/ (.*?) <\/P>/', $ contents, $ out, PREG_SET_ORDER );
$ Out gets all matching elements
$ Out [0] [0] will include

Including full-range characters
$ Out [0] [1] will only include (.*?) Character segments matching in parentheses

// Similarly, the nth matched field can be obtained using the following method:
$ Out [n-1] [1]

// If the regular expression contains many parentheses, the method for getting the m matching point in the sentence is
$ Out [n-1] [m]

5. after obtaining the characters to be found, to remove the html tag, use the strip_tags function provided by PHP to conveniently implement

The code is as follows:


// Example
$ Result = strip_tags ($ out [0] [1]);

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.