There are several types of user comments or advertisements for other content:
1: Taobao part-time and QQ 123456789 group (with QQ number, WeChat number, or other digital numbers)
2: taobao part-time job, plus QQ number (with English keywords)
3: Taobao part-time job, plus QQ ① (special numbers)
22222222 (full-angle number)
Filtering method:
Use regular expressions to match and replace the punctuation marks, numbers, and letters of a string and determine whether consecutive numbers or keywords exist (full-width and rounded corners are supported ), because advertisements generally carry QQ numbers and other contact information. Therefore, we must first "purify" and replace the comments, convert the fullwidth into halfwidth, and remove some "sand", such as punctuation marks, spaces, and letters, only Chinese characters and numbers are left.
Example:
$ Comment = "this $ % is a (1) 8x8 website, b. Come and join ④ @ # qq 1 2 3 4 5 6 7 8 & Prime ;;
1: "purify" content and remove punctuation marks
$ Flag_arr = array ('? ','! ',' ¥ ',' (',') ',': ',' '"', '… ','. ', 'Nbsp ',']','【','~ ');
$ Comment = preg_replace ('/\ s/', '', preg_replace ("/[: punct:]/",'', strip_tags (html_entity_decode (str_replace ($ flag_arr, '', $ comment), ENT_QUOTES, 'utf-8 '))));
After processing, $ comment is changed to: "This is a (1) 8x8 website B. Come and join ① ④ haha qqq12345678 & Prime;
2: some full-width symbols or numbers may be included in the code below.
$ Quanjiao = array ('0' => '0', '1' => '1', '2' => '2 ', '3' => '3', '4' => '4', '5' => '5', '6' => '6 ', '7' => '7', '8' => '8', '9' => '9', 'a' => 'A ', 'B' => 'B', 'c' => 'C', 'D' => 'D', 'E' => 'e ', 'F' => 'F', 'G' => 'G', 'H' => 'h', 'I' => 'I ', 'j' => 'J', 'K' => 'K', 'L' => 'L', 'M' =>'m ', 'n' => 'N', 'O' => 'O', 'P' => 'P', 'q' => 'Q ', 'R' => 'R', 's' =>'s ', 'T' => 'T', 'u' => 'u ', 'V' => 'V', 'W' => 'W', 'x' => 'X', 'y' => 'y', 'Z' => 'Z ', 'A' => 'A', 'B' => 'B', 'c' => 'C', 'D' => 'D ', 'E' => 'e', 'F' => 'F', 'G' => 'G', 'H' => 'h ', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L ', 'M' => 'm', 'n' => 'N', 'O' => 'O', 'P' => 'P ', 'q' => 'Q', 'R' => 'R', 's' => 's', 'T' => 'T ', 'U' => 'u', 'V' => 'V', 'W' => 'W', 'x' => 'X ', 'y' => 'y', 'Z' => 'Z', '(' => '(', ')' => ')', '[' => '[', ']' => ']', '[' => '[', ']' => ']', '<' => '[', '<' => ']', '=>' [',' "'=>'] ', ''' =>' [',' \'' => ']', '{' => '{', '}' => '}', '=>' <',' "'=> ', '%' => '%', '+' => '+', '-' => '-', '-' => '-','~ '=>'-',': '=> ':','. '=> '. ',', '=>', '=> '. ',', '=> '. ','; '=> ',','? '=> '? ','! '=> '! ','... '=>'-', ''' =>' | ',' "'=>'" ',' \ ''=> ''', ''' => ''', '|' => '|', 'Region' => '"','' => '');
$ Comment = strtr ($ comment, $ quanjiao );
The strtr function of php is used to convert specific characters in a string.
Available
Strtr (string, from,)
Or
Strtr (string, array)
After processing, $ comment becomes: "This is a website B with 18 artifacts, 3 and 4. Come and join ① ④ haha qq12345678 & Prime ;;
3: The comment may also contain special characters (you can expand new special characters in the following array)
$ Special_num_char = array ('① '=> '1', '②' => '2', '③ '=> '3 ', '④ '=> '4', '⑤' => '5', '6' => '6', '7' => '7 ', 'hangzhou' => '8', 'hangzhou' => '9', 'hangzhou' => '10', '(1)' => '1 ', '(2)' => '2', '(3)' => '3', '(4)' => '4', 'hangzhou' => '5 ', 'handler' => '6', 'handler' => '7', 'handler' => '8', 'handler' => '9 ', 'Taobao' => '10', '1' => '1', '2' => '2', '3' => '3 ', '4' => '4', '5' => '5', '6' => '6', '7' => '7 ', '8' => '8', '9' => '9', '0' => '0 ');
$ Comment = strtr ($ comment, $ special_num_char );
After processing, $ comment becomes: "This is an 18-piece website B. Come and join the 14 qq12345678 & Prime ;;
If the comment contains traditional numbers, such as '0', 'yi', 'er', 'san', 'siz', 'wu', 'Loan ', 'handler', and 'handler' are all added and extended in the above $ special_num_char.
4: There may also be a mix of normal numbers and Chinese characters in the comments. Similarly, you can use the 3rd-point method to convert them into normal numbers.
Example: This is an advertisement qq 1 2 2 2 45 6 7899
After conversion:
This is an advertisement qq 1224567899
5: Regular expression processing to filter advertisements
Use regular expressions to match preg_match_all ('/\ d +/', $ comment, $ match)
Match [0] match array obtained by analysis
Foreach ($ match [0] as $ val) // whether there is a digital QQ number and micro signal
{
If (strlen ($ val)> = 6)
{// There is a number string with a continuous length of more than 6 digits, which is highly suspected of advertising.
$ Is_ad = true;
Break;
}
}
If (count ($ match [0])> = 10)
{// There are a lot of intermittent numbers and there is suspicion of advertising
$ Is_ad = true;
}
OK, so that you can determine whether the content is advertisement and filter most of the common advertisements.