Use regular expressions in PHP to extract Chinese implementation notes
This article mainly introduces the use of regular expressions in PHP to extract Chinese implementation notes, this article also explains the Korean, Japanese regular expressions, and at the same time to give the implementation code and use examples, the need for friends can refer to the following
Recently, the boss called a data check-up exercise that involves extracting Chinese text from a file containing Chinese text segments and storing it, using PHP development. The middle involves the PHP regular expression Chinese match question, the net collects a big, but also very disorderly does not have a notified son, passes through own code the revision and the examination, first will write down the extract function.
The first thing to note is that the double-byte character encoding problem, here we may also encounter like Korean, Japanese and other coding problems, and Chinese understanding is a meaning.
1. GBK (gb2312/gb18030)
The code is as follows:
\x00-\xff GBK Double byte encoding range
\x20-\x7f ASCII
\xa1-\xff Chinese gb2312
\x80-\xff Chinese GBK
2. UTF-8 (Unicode)
The code is as follows:
\U4E00-\U9FA5 (English)
\x3130-\x318f (Korean
\XAC00-\XD7A3 (Korean)
\u0800-\u4e00 (Japanese)
Below the notepad++, we can first test our regular writing errors or not. The first expression I use [\u4e00-\u9fa5]+ to test, + number means more than one
The match character. The result is the same as expected, so is it possible to use the regular in the script?
We test, we use Preg_match_all ('/[\u4e00-\u9fa5]+/', $subject, $matches) call, and then you see a result: compilation Failed:pcre does Not support \l, \l, \n{name}, \u, or \u at offset 2 .... Isn't it a big head?? What is the reason for this?
Looking at a lot of data, you find that u (PCRE_UTF8) is the above PCRE, which is a Perl library, including a Perl-compatible regular expression library. This modifier enables additional features in a PCRE that are incompatible with Perl. The pattern string is treated as UTF-8. This modifier is available under Unix from PHP 4.1.0 and is available under Win32 from PHP 4.2.3. The PHP regular expression is also different in the way hexadecimal data is expressed, in PHP, the hexadecimal data is represented by \x. Here we will optimize the code, the detection function becomes:
The code is as follows:
Class Storedataadapter extends store{
Private $dsData;
/**
* Data conversion function, call Preg_match_all to match the value according to $pattern, and store the returned result as an array in $matches.
* $matches [0] will contain text that matches the entire pattern, $matches [1] will contain text that matches the sub-pattern in the first captured parenthesis, and so on
* @see Store::d Ata_convert ()
*/
Public Function Data_convert ($pattern, $subject) {
$matches =array ();
if (Preg_match_all ($pattern, $subject, $matches)) {
return $matches [0];
}else
{
return null;
}
}
}
When called, it becomes:
The code is as follows:
$store =new Storedataadapter ($txtContent);
$match =array ();
$dsName = $store->data_convert ('/[\x7f-\xff]+/', $txtContent);
foreach ($dsName as $val) {
echo $val. "
";
}
The input file is:
, the following is the output file content after extracting the Chinese:
To meet the expected requirements.
http://www.bkjia.com/PHPjc/971941.html www.bkjia.com true http://www.bkjia.com/PHPjc/971941.html techarticle using regular expressions to extract Chinese implementation notes in PHP this article mainly introduces the use of regular expressions in PHP to extract Chinese implementation notes, this article also explains the Korean, Japanese regular expressions ...