Php filter advertisement content (part-time job, QQ number, Taobao part-time job, Website)

Source: Internet
Author: User
Tags comments join regular expression


There are several types of user comments or advertisements for other content:

1: Taobao part-time and QQ 123456789 group (with QQ number, WeChat number, or other digital numbers)
2: taobao part-time job, plus QQ number (with English keywords)
3: Taobao part-time job, plus QQ ① (special numbers)
22222222 (full-angle number)

Filtering method:
Use regular expressions to match and replace the punctuation marks, numbers, and letters of a string and determine whether consecutive numbers or keywords exist (full-width and rounded corners are supported ), because advertisements generally carry QQ numbers and other contact information. Therefore, we must first "purify" and replace the comments, convert the fullwidth into halfwidth, and remove some "sand", such as punctuation marks, spaces, and letters, only Chinese characters and numbers are left.

Example:

$ Comment = "this $ % is a (1) 8x8 website, b. Come and join ④ @ # qq 1 2 3 4 5 6 7 8 & Prime ;;

1: "purify" content and remove punctuation marks

$ Flag_arr = array ('? ','! ',' ¥ ',' (',') ',': ',' '"', '… ','. ', 'Nbsp ',']','【','~ ');

$ Comment = preg_replace ('/\ s/', '', preg_replace ("/[: punct:]/",'', strip_tags (html_entity_decode (str_replace ($ flag_arr, '', $ comment), ENT_QUOTES, 'utf-8 '))));
After processing, $ comment is changed to: "This is a (1) 8x8 website B. Come and join ① ④ haha qqq12345678 & Prime;

2: some full-width symbols or numbers may be included in the code below.

$ Quanjiao = array ('0' => '0', '1' => '1', '2' => '2 ', '3' => '3', '4' => '4', '5' => '5', '6' => '6 ', '7' => '7', '8' => '8', '9' => '9', 'a' => 'A ', 'B' => 'B', 'c' => 'C', 'D' => 'D', 'E' => 'e ', 'F' => 'F', 'G' => 'G', 'H' => 'h', 'I' => 'I ', 'j' => 'J', 'K' => 'K', 'L' => 'L', 'M' =>'m ', 'n' => 'N', 'O' => 'O', 'P' => 'P', 'q' => 'Q ', 'R' => 'R', 's' =>'s ', 'T' => 'T', 'u' => 'u ', 'V' => 'V', 'W' => 'W', 'x' => 'X', 'y' => 'y', 'Z' => 'Z ', 'A' => 'A', 'B' => 'B', 'c' => 'C', 'D' => 'D ', 'E' => 'e', 'F' => 'F', 'G' => 'G', 'H' => 'h ', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L ', 'M' => 'm', 'n' => 'N', 'O' => 'O', 'P' => 'P ', 'q' => 'Q', 'R' => 'R', 's' => 's', 'T' => 'T ', 'U' => 'u', 'V' => 'V', 'W' => 'W', 'x' => 'X ', 'y' => 'y', 'Z' => 'Z', '(' => '(', ')' => ')', '[' => '[', ']' => ']', '[' => '[', ']' => ']', '<' => '[', '<' => ']', '=>' [',' "'=>'] ', ''' =>' [',' \'' => ']', '{' => '{', '}' => '}', '=>' <',' "'=> ', '%' => '%', '+' => '+', '-' => '-', '-' => '-','~ '=>'-',': '=> ':','. '=> '. ',', '=>', '=> '. ',', '=> '. ','; '=> ',','? '=> '? ','! '=> '! ','... '=>'-', ''' =>' | ',' "'=>'" ',' \ ''=> ''', ''' => ''', '|' => '|', 'Region' => '"','' => '');

$ Comment = strtr ($ comment, $ quanjiao );
The strtr function of php is used to convert specific characters in a string.
Available
Strtr (string, from,)
Or
Strtr (string, array)

After processing, $ comment becomes: "This is a website B with 18 artifacts, 3 and 4. Come and join ① ④ haha qq12345678 & Prime ;;

3: The comment may also contain special characters (you can expand new special characters in the following array)

$ Special_num_char = array ('① '=> '1', '②' => '2', '③ '=> '3 ', '④ '=> '4', '⑤' => '5', '6' => '6', '7' => '7 ', 'hangzhou' => '8', 'hangzhou' => '9', 'hangzhou' => '10', '(1)' => '1 ', '(2)' => '2', '(3)' => '3', '(4)' => '4', 'hangzhou' => '5 ', 'handler' => '6', 'handler' => '7', 'handler' => '8', 'handler' => '9 ', 'Taobao' => '10', '1' => '1', '2' => '2', '3' => '3 ', '4' => '4', '5' => '5', '6' => '6', '7' => '7 ', '8' => '8', '9' => '9', '0' => '0 ');
$ Comment = strtr ($ comment, $ special_num_char );
After processing, $ comment becomes: "This is an 18-piece website B. Come and join the 14 qq12345678 & Prime ;;
If the comment contains traditional numbers, such as '0', 'yi', 'er', 'san', 'siz', 'wu', 'Loan ', 'handler', and 'handler' are all added and extended in the above $ special_num_char.

4: There may also be a mix of normal numbers and Chinese characters in the comments. Similarly, you can use the 3rd-point method to convert them into normal numbers.

Example: This is an advertisement qq 1 2 2 2 45 6 7899
After conversion:
This is an advertisement qq 1224567899

5: Regular expression processing to filter advertisements

Use regular expressions to match preg_match_all ('/\ d +/', $ comment, $ match)
Match [0] match array obtained by analysis

Foreach ($ match [0] as $ val) // whether there is a digital QQ number and micro signal
{
If (strlen ($ val)> = 6)
{// There is a number string with a continuous length of more than 6 digits, which is highly suspected of advertising.
$ Is_ad = true;
Break;
    }
}
If (count ($ match [0])> = 10)
{// There are a lot of intermittent numbers and there is suspicion of advertising
$ Is_ad = true;
}

OK, so that you can determine whether the content is advertisement and filter most of the common advertisements.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.