Summary of the methods for intercepting Chinese strings in PHP. Program 1: PHP intercepts Chinese strings. because the homepage and vTigerCRM often encounter garbled characters when intercepting Chinese strings (using substr), today we can find a better method to intercept Chinese strings: PHP method for intercepting Chinese Strings
Because the homepage and vTigerCRM frequently contain garbled characters when intercepting Chinese strings (using substr), we can find a better method to intercept Chinese strings and share it with you here.
Copy to ClipboardReference: [www.bkjia.com] function msubstr ($ str, $ start, $ len ){
$ Tmpstr = "";
$ Strlen = $ start + $ len;
For ($ I = 0; $ I <$ strlen; $ I ++ ){
If (ord (substr ($ str, $ I, 1)> 0xa0 ){
$ Tmpstr. = substr ($ str, $ I, 2 );
$ I ++;
} Else
$ Tmpstr. = substr ($ str, $ I, 1 );
}
Return $ tmpstr;
}
Program 2: PHP intercepts the UTF-8 string to solve the problem of half character
Copy to ClipboardReference content: [www.bkjia.com]/********************************** ******************************
* PHP intercepts the UTF-8 string to solve the half character problem.
* English letters, numbers (halfwidth) are 1 byte (8 bits), and Chinese (fullwidth) are 3 bytes.
* @ Return refers to the retrieved string. when $ len is less than or equal to 0, the entire string is returned.
* @ Param $ str Source string
* $ Length of the substring on the left of len
**************************************** ************************/
Function utf_substr ($ str, $ len)
{
For ($ I = 0; $ I <$ len; $ I ++)
{
$ Temp_str = substr ($ str, 0, 1 );
If (ord ($ temp_str)> 127)
{
$ I ++;
If ($ I <$ len)
{
$ New_str [] = substr ($ str, 0, 3 );
$ Str = substr ($ str, 3 );
}
}
Else
{
$ New_str [] = substr ($ str, 0, 1 );
$ Str = substr ($ str, 1 );
}
}
Return join ($ new_str );
}
?>
Php UTF-8 string truncation
Copy to ClipboardReference: [www.bkjia.com] Function cutstr ($ string, $ length ){
Preg_match_all ("/[\ x01-\ x7f] | [\ xc2-\ xdf] [\ x80-\ xbf] | \ xe0 [\ xa0-\ xbf] [\ x80- \ xbf] | [\ xe1-\ xef] [\ x80-\ xbf] [\ x80-\ xbf] | \ xf0 [\ x90-\ xbf] [\ x80 -\ xbf] [\ x80-\ xbf] | [\ xf1-\ xf7] [\ x80-\ xbf] [\ x80-\ xbf] [\ x80-\ xbf]/", $ string, $ info );
For ($ I = 0; $ I $ Wordscut. = $ info [0] [$ I];
$ J = ord ($ info [0] [$ I]) & gt; 127? $ J + 2: $ j + 1;
If ($ j> $ length-3 ){
Return $ wordscut ."...";
}
}
Return join ('', $ info [0]);
}
$ String = "242432 objection: 456 the attack against a wide range of embassy places 7890 ";
For ($ I = 0; $ I {
Echo cutstr ($ string, $ I )."
";
}
?>
Truncates UTF-8 string functions.
To support multiple languages, strings in the database may be saved as UTF-8 encoding, and part of the string may need to be captured using php during website development. To avoid garbled characters, write the following UTF-8 string truncation function
For the principle of UTF-8, please refer to the UTF-8 FAQ
The characters encoded by the UTF-8 may be 1 ~ It consists of three bytes. the specific number can be determined by the first byte. (Theoretically it may be longer, but it is assumed that the length cannot exceed 3 bytes)
The first byte is greater than 224, which together with the second byte after it forms a UTF-8 character
The first byte is greater than 192 less than 224, and it is a UTF-8 character with the first byte after it
Otherwise, the first byte is an English character (including numbers and a small part of punctuation marks ).
Code previously designed for a website (also a function used to extract the length of the home page)
Copy to ClipboardReference: [www.bkjia.com] // $ Sourcestr is the string to be processed
// $ Cutlength is the truncation length (that is, the number of words)
Function cut_str ($ sourcestr, $ cutlength)
{
$ Returnstr = '';
$ I = 0;
$ N = 0;
$ Str_length = strlen ($ sourcestr); // Number of bytes of the string
While ($ n <$ cutlength) and ($ I <= $ str_length ))
{
$ Temp_str = substr ($ sourcestr, $ I, 1 );
$ Ascnum = Ord ($ temp_str); // Obtain the ascii code of the $ I character in the string
If ($ ascnum> = 224) // if the ASCII bit height is 224,
{
$ Returnstr = $ returnstr. substr ($ sourcestr, $ I, 3); // count three consecutive characters as a single character according to the UTF-8 encoding specification
$ I = $ I + 3; // The actual Byte count is 3.
$ N ++; // string length meter 1
}
Elseif ($ ascnum> = 192) // if the ASCII bit height is 192,
{
$ Returnstr = $ returnstr. substr ($ sourcestr, $ I, 2); // Count 2 consecutive characters as a single character according to the UTF-8 encoding specification
$ I = $ I + 2; // The actual Byte count is 2.
$ N ++; // string length meter 1
}
Elseif ($ ascnum >=65 & $ ascnum <= 90) // if it is a capital letter,
{
$ Returnstr = $ returnstr. substr ($ sourcestr, $ I, 1 );
$ I = $ I + 1; // The actual number of bytes is still counted as 1
$ N ++; // consider the overall appearance. uppercase letters are counted as a high character.
}
Else // In other cases, including lower-case letters and halfwidth punctuation marks,
{
$ Returnstr = $ returnstr. substr ($ sourcestr, $ I, 1 );
$ I = $ I + 1; // The actual number of bytes is 1.
$ N = $ n + 0.5; // lower-case letters and halfwidth punctuation and half-height character width...
}
}
If ($ str_length> $ cutlength ){
$ Returnstr = $ returnstr. "..."; // when the length is exceeded, add a ellipsis at the end.
}
Return $ returnstr;
}
Truncates UTF-8 string functions.
Copy to ClipboardReference: [www.bkjia.com] function FSubstr ($ title, $ start, $ len = "", $ magic = true)
{
If ($ len = "") $ len = strlen ($ title );
If ($ start! = 0)
{
$ Startv = ord (substr ($ title, $ start, 1 ));
If ($ startv >=128)
{
If ($ startv< 192)
{
For ($ I = $ start-1; $ I> 0; $ I --)
{
$ Tempv = ord (substr ($ title, $ I, 1 ));
If ($ tempv> = 192) break;
}
$ Start = $ I;
}
}
}
If (strlen ($ title) <= $ len) return substr ($ title, $ start, $ len );
$ Alen = 0;
$ Blen = 0;
$ Realnum = 0;
For ($ I = $ start; $ I {
$ Ctype = 0;
$ Cstep = 0;
$ Cur = substr ($ title, $ I, 1 );
If ($ cur = "&")
{
If (substr ($ title, $ I, 4) = "<")
{
$ Cstep = 4;
$ Length + = 4;
$ I + = 3;
$ Realnum ++;
If ($ magic)
{
$ Alen ++;
}
}
Else if (substr ($ title, $ I, 4) = "> ")
{
$ Cstep = 4;
$ Length + = 4;
$ I + = 3;
$ Realnum ++;
If ($ magic)
{
$ Alen ++;
}
}
Else if (substr ($ title, $ I, 5) = "&")
{
$ Cstep = 5;
$ Length + = 5;
$ I + = 4;
$ Realnum ++;
If ($ magic)
{
$ Alen ++;
}
}
Else if (substr ($ title, $ I, 6) = """)
{
$ Cstep = 6;
$ Length + = 6;
$ I + = 5;
$ Realnum ++;
If ($ magic)
{
$ Alen ++;
}
}
Else if (preg_match ("/& # (\ d + );? /I ", substr ($ title, $ I, 8), $ match ))
{
$ Cstep = strlen ($ match [0]);
$ Length + = strlen ($ match [0]);
$ I + = strlen ($ match [0])-1;
$ Realnum ++;
If ($ magic)
{
$ Blen ++;
$ Ctype = 1;
}
}
} Else {
If (ord ($ cur) >=252)
{
$ Cstep = 6;
$ Length + = 6;
$ I + = 5;
$ Realnum ++;
If ($ magic)
{
$ Blen ++;
$ Ctype = 1;
}
} Elseif (ord ($ cur) >=248 ){
$ Cstep = 5;
$ Length + = 5;
$ I + = 4;
$ Realnum ++;
If ($ magic)
{
$ Ctype = 1;
$ Blen ++;
}
} Elseif (ord ($ cur) >=240 ){
$ Cstep = 4;
$ Length + = 4;
$ I + = 3;
$ Realnum ++;
If ($ magic)
{
$ Blen ++;
$ Ctype = 1;
}
} Elseif (ord ($ cur) >=224 ){
$ Cstep = 3;
$ Length + = 3;
$ I + = 2;
$ Realnum ++;
If ($ magic)
{
$ Ctype = 1;
$ Blen ++;
}
} Elseif (ord ($ cur) >=192 ){
$ Cstep = 2;
$ Length + = 2;
$ I + = 1;
$ Realnum ++;
If ($ magic)
{
$ Blen ++;
$ Ctype = 1;
}
} Elseif (ord ($ cur) >=128 ){
$ Length + = 1;
} Else {
$ Cstep = 1;
$ Length + = 1;
$ Realnum ++;
If ($ magic)
{
If (ord ($ cur) >=65 & ord ($ cur) <= 90)
{
$ Blen ++;
} Else {
$ Alen ++;
}
}
}
}
If ($ magic)
{
If ($ blen * 2 + $ alen) = ($ len * 2) break;
If ($ blen * 2 + $ alen) = ($ len * 2 + 1 ))
{
If ($ ctype = 1)
{
$ Length-= $ cstep;
Break;
} Else {
Break;
}
}
} Else {
If ($ realnum = $ len) break;
}
}
Unset ($ cur );
Unset ($ alen );
Unset ($ blen );
Unset ($ realnum );
Unset ($ ctype );
Unset ($ cstep );
Return substr ($ title, $ start, $ length );
}
Because the homepage and vTigerCRM frequently contain garbled characters when intercepting Chinese strings (using substr), today we can find a better way to intercept Chinese characters...