UTF-8 Regular Expression how to match Chinese characters, UTF-8 Regular Expression
Check the following code to determine whether the entered content contains illegal characters:
$ Str = "programming"; // if (! Preg_match ("/^ [\ x {4e00}-\ x {9fa5} A-Za-z0-9 _] + $/u", $ str )) // UTF-8 Chinese characters, letters, numbers, underscores, regular expressions if (! Preg_match ("/^ [\ x {4e00}-\ x {9fa5}] + $/u", $ str )) // UTF-8 Chinese characters, letters, numbers, underscores, regular expressions {echo "<font color = red> [". $ str. "] contains illegal characters </font>";} else {echo "<font color = green> [". $ str. "] Completely legal, pass! </Font> ";}
-----------------------
UTF-8 match:
In javascript, it is very easy to judge that the string is Chinese.
For example:
Copy codeThe Code is as follows:
Var str = "php programming ";
If (/^ [\ u4e00-\ u9fa5] + $/. test (str ))
{Alert ("all strings are Chinese ");
}
Else {alert ("Not all strings are Chinese ");
}
In php, \ x is used to represent hexadecimal data.
Therefore, it is transformed into the following code:
Copy codeThe Code is as follows:
$ Str = "php programming ";
If (preg_match ("/^ [\ x4e00-\ x9fa5] + $/", $ str ))
{
Print ("all strings are Chinese ");
}
Else {print ("Not all strings are Chinese ");
}
It seems that no error is reported, and the result is correct. However, if you replace $ str with "programming", the result still shows "Not all the strings are Chinese ", it seems that such judgment is not accurate enough.
Important:
After reading the <proficient Regular Expression>, I found that I made an enhanced explanation for [\ x4e00-\ x9fa5 ].
In php regular expressions, [\ x4e00-\ x9fa5] is actually the concept of character and character group. \ x {hex} represents a hexadecimal number, note that hex can be 1-2 bits or 4 bits, but if it is 4 bits, braces must be added,
At the same time, if it is a hex greater than x {FF}, it must be used with the u modifier. Otherwise, an error will occur.
Only regular expressions matching full-width characters can be found on the Internet: ^ [\ x80-\ xff] * ^/. brackets [\ u4e00-\ u9fa5] can be added here to match Chinese characters, however, PHP does not support it. Why is the range \ x4e00-\ x9fa5 different from the hexadecimal data in \ x?
So I switched to the following code and found that it was really accurate:
Copy codeThe Code is as follows:
$ Str = "php programming ";
If (preg_match ("/^ [\ x {4e00}-\ x {9fa5}] + $/u", $ str ))
{
Print ("all strings are Chinese ");
}
Else {print ("Not all strings are Chinese ");
}
I understand the final correct expression for matching Chinese characters with regular expressions in UTF-8 encoding in php --/^ [\ x {4e00}-\ x {9fa5}] + $/u, refer to the above article and write the following test code (copy the following code and save it. PHP file)
<? Php $ action = trim ($ _ GET ['action']); if ($ action = "sub") {$ str = $ _ POST ['dir']; // if (! Preg_match ("/^ [". chr (0xa1 ). "-". chr (0xff ). "A-Za-z0-9 _] + $/", $ str) // GB2312 Chinese characters, letters, numbers, underscores, regular expressions if (! Preg_match ("/^ [\ x {4e00}-\ x {9fa5} A-Za-z0-9 _] + $/u", $ str )) // UTF-8 Chinese characters, letters, numbers, underscores, regular expressions {echo "<font color = red> [". $ str. "] contains illegal characters </font>";} else {echo "<font color = green> [". $ str. "] Completely legal, pass! </Font> ";}}? <Form method = "POST" action = "? Action = sub "> input characters (numbers, letters, Chinese characters, and underlines ): <input type = "text" name = "dir" value = ""> <input type = "submit" value = "submit"> </form>
GBK:
Copy codeThe Code is as follows:
Preg_match ("/^ [". chr (0xa1 ). "-". chr (0xff ). "A-Za-z0-9 _] + $/", $ str); // Regular Expression of GB2312 Chinese characters, letters, numbers, underscores (_)
The above content is PHP UTF-8 Regular Expression how to match all the content of Chinese characters, I hope you like.