Yesterday, I tried to batch process a bunch of previously downloaded files and matched the key content in the files with regular expressions for centralized processing. A problem occurs when operating files. PHP uses the mb_string function library to process windows-related Chinese characters.
Yesterday, I tried to batch process a bunch of previously downloaded files and matched the key content in the files with regular expressions for centralized processing. An error occurred while operating the file, that is, encoding in the windows operating system.
We all know that in windows (of course, the Chinese version), file name and file content encoding are gbk, while in the development process, the encoding in IDE is UTF-8, (I will not discuss why, etc,
Only consider how to convert the encoding into the same) so the Chinese character in the regular pattern string that I wrote in the UTF-8 encoding does not match correctly in the gbk encoded file.
At first, I had no way. I tried to change the encoding of the PHP script file to GBK. It can also be used, but it was too low to think of this method, so find out if there are any functions in PHP that can meet my needs.
At this time, I thought of the iconv () function used to process file names in windows. its function prototype is as follows:
string iconv ( string $in_charset , string $out_charset , string $str )Performs a character set conversion on the string str from in_charset to out_charset.
We often use:
$out_charset='utf-8';$fileName=iconv($fileName,$out_charset,'gbk');
To process the file name, change the file name from gbk to UTF-8 without changing the content.
Additional manual translation:
If you add // Transcoder after the output string $ out_charset, that is, $ out_charset = 'utf-8 // transtranscoder ', when you encounter a character that cannot be converted to a UTF-8, the program will automatically replace it with a UTF-8 character of a similar character;
If you add // IGNORE that is $ out_charset = 'utf-8 // IGNORE 'after the output string $ out_charset, when you encounter a character that cannot be converted to a UTF-8, the program automatically skips this character.
If you do not add anything, the replacement will be interrupted when you encounter a character that cannot be replaced with a UTF-8.
However, when I use this function for processing, the result is as follows:
This indicates that the maximum number of characters processed by the iconv () function is only 64. the general file name size, while the content of my file is obviously more than 64 characters.
No way, I had to look for other functions again.
Until I found the mb_string function library, which is generally integrated in the PHP environment, we can find it in phpinfo.
In the mb_string function, there is a mb_convert_encoding () function that can change the encoding of a string. its prototype is as follows:
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )Converts the character encoding of string str to to_encoding from optionally from_encoding.
The base type is similar to the iconv () function, but it does not modify the suffix of the output function, nor does it explicitly limit the length of the string.
In addition, we can see that $ from_encoding is optional and can automatically identify the source encoding.
Because you cannot find a specific character that cannot be transcoded, and you do not know how to handle a character that cannot be transcoded.
The whole file is processed through the mb_convert_encoding () function, so the problem is solved smoothly.
Finally, we will introduce the mb_string function library. its full name is Multibyte String. many of its methods are extended from PHP's own string function library. the function name is preceded by "mb _", in addition to the functions of the original function, these functions also add an optional $ encoding parameter at the end of the optional parameter. this parameter can specify the encoding method used by the function to process the string.
For example, the strpos () function finds the position of a string in another string.
Strpos ("Welcome to access", "ask", 0) returns 12 because the script is UTF-8 encoded, and after converting the string into UTF-8 encoding, each Chinese character occupies 3 bytes.
In the mb_strpos () function, mb_strpos ("Welcome to access", "ask", 0, 'utf-8') returns 4, it executes the string as a state that has been converted to a UTF-8.
While mb_strpos ("Welcome to visit", "ask", 0, 'gbk') will return 6
Of course, it has more features ~
The following describes how to enable PHP Mb_String in Windows.
A few days ago, I ran a Php program and needed to convert it to character encoding. However, I did not support the Mb_String extension on the server. I checked that the Php Extension Library contains the php_mbstring.dll file.
The following describes how to enable it.
1. make sure that you have the php_mbstring.dll file in Windows/system32. if not, just export it to Windows/system32 from the extensions directory of your Php installation directory.
2. find php. ini in the windows directory to open the editor, search for mbstring. dll, and find
; Extension = php_mbstring.dll
Remove the previous; sign to enable the support for the component.
3. restart the PHP service (if not, restart the computer)
4. Complete