ORACLE _ extract Chinese characters from strings (excluding fullwidth characters and Japanese and Korean characters)

Source: Internet
Author: User

Reprinted from: http://blog.csdn.net/atgc/article/details/2036799

Thanks to atgc, this problem has been studied for a day and finally solved !!

When extracting Chinese characters from the table, you need to consider character sets. Different character sets have different Chinese characters encoded.
Taking gb2312 as an example, write a function to accurately extract simplified Chinese characters from the table.

Assume that the database character set encoding is gb2312, and the environment variable (registry or other) Character Set is also gb2312 encoding.
The Chinese characters saved to the table are also gb2312 encoded.

That is, the Chinese character is dubyte, And the encoding range of simplified Chinese characters is
B0a1-f7fe
Convert to the 10th hexadecimal format
B0 A1 F7 fe
176,161-247,254

Let's take a look at the definition of the asciistr function.
Non-ASCII characters are converted to the form/xxxx, where XXXX represents a UTF-16 code unit.
However, this does not indicate that the character starting with "/" is a Chinese character.

Example:
SQL> select * from test;

Name
--------------------
Ah oo10 ha
Hello AA
Hello everyone, AA/
☆Sea 123
★ABC

The first record contains a solid pentagram.
Use the asciistr function to convert the data.
SQL> select name, asciistr (name) from test;

Name asciistr (name)
------------------------------------------
Ah oo10,/554aoo10/54c8
Hello AA/4f60/597daa
Hello everyone, AA // 5927/5 bb6/597daa/
☆Sea 123/2606/5927/6 d77123
★ABC/2605abc

We can see that the solid pentagram in the last record starts "/".
In this case, we cannot use asciistr (field) to determine whether there is.

My function is as follows. The basic idea is to determine whether the character encoding is within the Chinese character encoding range specified by gb2312.

create or replace function get_chinese(p_name in varchar2) return varchar2as  v_code     varchar2(30000) := '';  v_chinese  varchar2(4000)  := '';  v_comma    pls_integer;  v_code_q   pls_integer;  v_code_w   pls_integer;begin  if p_name is not null then     select replace(substrb(dump(p_name,1010),instrb(dump(p_name,1010),'ZHS16GBK:')),'ZHS16GBK: ','') into v_code from dual where rownum=1;     for i in 1..length(p_name) loop         if lengthb(substr(p_name,i,1))=2 then            v_comma  := instrb(v_code,',');            v_code_q := to_number(substrb(v_code,1,v_comma-1));            v_code_w := to_number(substrb(v_code,v_comma+1,abs(instrb(v_code,',',1,2)-v_comma-1)));            if v_code_q>=176 and v_code_q<=247 and v_code_w>=161 and v_code_w<=254 then               v_chinese := v_chinese||substr(p_name,i,1);            end if;            v_code := ltrim(v_code,'1234567890');            v_code := ltrim(v_code,',');         end if;         v_code := ltrim(v_code,'1234567890');         v_code := ltrim(v_code,',');     end loop;     return v_chinese;  else     return '';  end if;end;

Okay. Run some statements now.
SQL> select * from test;

Name
--------------------
Ah oo10 ha
Hello AA
Hello everyone, AA/
☆Sea 123
★ABC

5 rows selected.

1. List records with Chinese characters
SQL> select name from test where length (get_chinese (name)> 0;

Name
--------------------
Ah oo10 ha
Hello AA
Hello everyone, AA/
☆Sea 123

4 rows selected.

2. List records with Chinese characters and only list Chinese Characters

SQL> select get_chinese (name) from test where length (get_chinese (name)> 0;

Get_chinese (name)
---------------------------------------------------------------------------
Aha
Hi!
Hello everyone
Sea

4 rows selected.

It must be noted that gb2312 has a total of 6763 Chinese characters, I .e. 72*94-5 = 6763
Here I calculate 72*94, without subtracting the five, and the five are empty. And then subtract
================

Rewrite this function to extract non-Chinese or Chinese Characters
This function has two parameters. The first parameter indicates the string to be extracted, the second parameter is 1, indicating that the Chinese character is extracted, and the non-1 parameter indicates that the non-Chinese character is extracted.

create or replace function get_chinese(  p_name    in varchar2,  p_chinese in varchar2) return varchar2as  v_code         varchar2(30000) := '';  v_chinese      varchar2(4000)  := '';  v_non_chinese  varchar2(4000)  := '';  v_comma        pls_integer;  v_code_q       pls_integer;  v_code_w       pls_integer;begin  if p_name is not null then  select replace(substrb(dump(p_name,1010),instrb(dump(p_name,1010),'ZHS16GBK:')),'ZHS16GBK: ','') into v_code from dual where rownum=1;  for i in 1..length(p_name) loop      if lengthb(substr(p_name,i,1))=2 then         v_comma  := instrb(v_code,',');         v_code_q := to_number(substrb(v_code,1,v_comma-1));         v_code_w := to_number(substrb(v_code,v_comma+1,abs(instrb(v_code,',',1,2)-v_comma-1)));         if v_code_q>=176 and v_code_q<=247 and v_code_w>=161 and v_code_w<=254 then            v_chinese := v_chinese||substr(p_name,i,1);         else            v_non_chinese := v_non_chinese||substr(p_name,i,1);         end if;         v_code := ltrim(v_code,'1234567890');         v_code := ltrim(v_code,',');      else         v_non_chinese := v_non_chinese||substr(p_name,i,1);            end if;      v_code := ltrim(v_code,'1234567890');      v_code := ltrim(v_code,',');  end loop;  if p_chinese = '1' then     return v_chinese;  else     return v_non_chinese;  end if;  else     return '';  end if;end;

SQL> select * From;

Name
--------------------
We,
He (AI) is★Are
His/ah @

SQL> select get_chinese (name, 1) from;

Get_chinese (name, 1)
-----------------------------------------
We
He Aiah, you are.
His.

SQL> select get_chinese (name, 0) from;

Get_chinese (name, 0)
-----------------------------------------
,
()★
/@

SQL>

But later, I found that this function could not scan very clearly, as shown below: (but common words can be scanned out)

If you need to filter these distinctive words, you can:

The to_single_byte function can be used to forcibly convert the full-width to a half-width before computation. If the characters are in Chinese, the system will recognize that the dual-byte cannot be distinguished:
Select * From zk. cm_customer where lengthb (to_single_byte (cust_name)-lengthc (to_single_byte (cust_name)> = 2;

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.