(an old article, but also a bit of value, deliberately tidy up a bit.) The desensitization process has been carried out due to the customer project involved)
1 write in front of the words
Although there is a solution to this problem, I do not recommend that you provide it to the customer for the reasons shown here.
2 Problem description
2010.10.12, the business unit submitted a questionnaire (2010.09.14): In the XX Railway Bureau "payment notice sheet details import", because of the rare word "Jia" (The Dragon Next day), the system can not identify and error, so that the business has to suspend, hope to give a solution.
3 Problem Analysis 3.1 preliminary analysis
Uncommon characters garbled problem, we think in the production system is not exist (unless beyond the scope of the GBK character set), because the system has been running for more than 2 years, has entered some of the rare words in the GB2312.
In the development environment to verify, import the existence of rare words of personal basic information, no problem;
In the 21 test environment (Aix+weblogic+oracle), import the existence of rare words of personal basic information, there is a problem, and in the interface can not directly input.
Start analyzing the problem in the 21 test environment:
1, check AIX, ORACLE and other related character set landlord, no problem found;
2. After uploading the file, the information in the BLOB field is correct;
3, modify the startweblogic.sh file, when starting Java, add:
-DFILE.ENCODING=GBK-DDEFAULT.CLIENT.ENCODING=GBK, no effect;
4, modify the foreground BAT, increase the above parameters, no effect;
Based on the 2nd, verify that the environment is not a problem, the problem is in the imported code, may be read, the conversion caused by garbled.
Check the code. In the Xxx.xx.xxxx.PersonInfoInputBO class of the Batchprocess method, add debug output information, found in the actual save data before, there is no garbled, just after the following code execution, the data stored in the database is garbled:
if (! Bhaserr) {gethibernatetemplate (). Saveorupdateall (arr); }
At this point, there is no problem confirming the code, only the database-driven issue is suspected. Verify using the following methods:
1. Delete the Web-inf/lib/classes12.zip file.
2. Change the database connection pool drive type from Oracle to Weblogic (database: Dtp/[email protected]_db) in the EPRK_A environment on 17.
Re-import the data, the result is correct, no garbled phenomenon occurred.
Also, the word "" cannot be entered on the interface and is suspected to be related to the character set environment in the client program (JRE) landlord.
You can identify the problem with the Oracle JDBC driver. The JDBC driver for Oracle Thin (Ojdbc14.jar/classes12.jar) differs from the JDBC driver for other databases in character set landlord. The JDBC driver for other databases (such as db2/sqlserver, etc.) can specify a character set in the URL, but Oracle's JDBC does not, and the system, the user's current character set will be automatically used.
With Weblogic database-driven, this problem can be addressed and performance is improved, but third-party-driven security is questionable and requires large-scale testing.
The issue has not been resolved.
The above conclusions are problematic.
3.2 Further analysis
Other project team colleagues call to explain that there are also the same problems, I think this problem has a certain challenge, but also need a clear answer, because in the future will be met, and this is my relative gap period of serious contempt, so determined to solve the problem. Lasted a week, in countless checks, hair version, debugging, Google, Midnight Oil, hope, disappointment, despair ... After, finally had the conclusion, the wall cries bitterly.
In addition to documenting and honoring my nearly 50 working hours, I also want to alert the following basic points to me and colleagues who are interested in this document:
One, do not let go of any you think "should not be" hypothesis, must be verified;
Second, must have the global angle of view, to the question about before and after, each step carries on the meticulous verification;
Third, Microsoft, IBM, ORACLE all TM fucked! GB code more Fucked!!
Four, not brothers I am incompetent, is really the communist army is too cunning.
Five, a lot of things, you reasoned with him, he bullying with you, you bullying him, he reasoned with you. It's better to bullying. Alas.
In the preliminary analysis, the conclusion is the Oracle database-driven problem, but in WINDOWS, using the Oracle JDBC driver to connect the same database is no problem, so this conclusion is not correct, but the Oracle driver does not have the landlord character set problem.
Still suspect is the environmental problem, repeated attempts between GBK and GB18030, because of the need to release on AIX, the process took up almost 80% of the time, tried almost all the possibilities and permutations, the problem is not resolved.
Key points of 3.2.1 environmental inspection
1. Check the Oracle database character set:
SELECT * from Nls_database_parameters;
Note the red box. In addition, there are the following related SQL;
SELECT * from Nls_instance_parameters;
SELECT * from Nls_session_parameters;
Select Userenv (' language ') from dual;
Here's a question, zhs16gbk I understand the corresponding GBK code, check the database, GB18030 in the database should correspond to zhs32gb18030 encoding, but because the Oracle database installed and then change the character set there is a large risk, no attempt, the following SQL can view Orac All character sets supported by Le:
SELECT * from V$nls_valid_values V where v.parameter = ' CHARACTERSET ';
2. Check the AIX system character set (WINDOWS system does not check)
3. Check AIX User environment settings
Env
LANG, should be GB18030 or ZH_CN. GB18030 (note case)
Nls_language, should be american_america. ZHS16GBK, after testing, this parameter does not have a relationship, but the relevant documents are so required, so it is better to set up landlord.
4. JAVA
Add the following parameters to the server (TOMCAT, Weblogic), and to the client's corresponding JAVA-enabled corner book:
SUN jdk:-dfile.encoding=gb18030
IBM jdk:-dibm.stream.nio=true–dfile.encoding=gb18030
5. Check the environment at run time
Refer to the following code
Slightly
You can embed the above code in the background business processing class or elsewhere, and direct the output to logger, paying attention to the output values associated with encoding in the output information.
A very uninteresting detail: In Windows, the Sun.unicode value is Uncodelittle,aix and its value is unicodebig. Too much TM mess. Why is little in Windows? Is Big in AIX? Just because of "Microsoft"? Too TM has the only. Oh, I ' m a Big man!
BTW: Kim Jong Il is the son of Kim Il Sung. It is said to have been translated into E: The king who was Fxxking is the son of the king who had fxxked.
3.2.2 Download and install the IBM JDK
Temporarily ruled out the Oracle JDBC driver suspect, when tracking code, found that before calling JdbcTemplate save VO, can see the correct kanji (can be output in the logger. This is the biggest, wrong omission, in fact, it is garbled at this time, is the problem of hibernate code conversion, crazy tracking, analysis hibernate related code, found that although the Chinese characters, but its value is empty. Forward doubts about the IBM jdk problem.
Sun has never provided JDK for the IBM power series CPUs. Before JDK1.4, IBM developed the JDK on the basis of the sun JDK, which was modified to form IBM JDK,JDK 1.4 and later. Because there are only problems in the test environment, each time to send the version, it is inconvenient, decided to download the IBM JDK, on this machine, see can reproduce the problem.
The IBM JDK is too hard to download. Only JRE downloads are available on the official web site and must be installed on IBM machines, which will check the bios! fxxk!
NND, Dead Horse as a live horse medicine, I am but, I decided to download the IBM official website to provide all the JAVA-related things. Fortunately, the United States is always bullies, in IBM provided, the Eclipse-based development environment has IBM JDK 1.6, is a complete JDK, and no BIOS limitations. I have forgotten that the file name after download is:
Ibm_developmentpackage_for_eclipse_win32_3.0.0.zip
All right, Jimmy's got it.
3.2.3 with IBM JDK and embedded Tomcat
Change the JRE of the project in the development environment to the JRE provided in the IBM JDK, starting Tomcat, error:
Check it out, Tomcat's SLL set landlord default to SUN's X509. But the IBM JDK was changed to IbmX509.
How to set? It's really annoying. A wild try, the way to modify the following:
Modify StartServer:
1, in the appropriate bit landlord increase c.setattribute ("algorithm", "IbmX509");
2. Note the code in the main function that is related to XML processing, such as:
I don't know if that's enough. I guess not enough, also need to modify the IBM JDK related files, specifically changed where, I do not remember, do not want to remember, the final file as follows:
1, put the Xerces.jar file in the Ibm_sdk60\jre\lib directory;
2. Put the xerces.properties file in the above catalogue.
With the above corrections, the startserver can be used normally in the IBM JDK.
Debugging environment is ready, can be a little comfortable one o'clock to work.
3.3 Troubleshooting 3.3.1 Viewing the contents of a Database BLOB field
As described earlier, after uploading a file, the data in the BLOB field in the database does not look like a problem.
Information in the database BLOB:
View in hexadecimal:
So there is no doubt before the import operation there is a problem, to this point, can only go forward, according to the current information, suspected of uploading, there are problems, for the following reasons:
1, when tracking "import" code, found that although the word can be output to the log, but its value is empty;
2, GB18030 for GBK addition of Chinese characters (especially FE area), in Unicode has two codes;
3, JDK default encoding using Unicode (UTF8);
4. WINDOWS default encoding uses GBK.
3.3.2 Code conversion test in Oracle
SelectConvert('fart,'Zhs16gbk', 'Zhs16gbk') GBK, convert ('Booty'zhs32gb18030','zhs32gb18030') GB18030,Convert('fart,'UTF8', 'UTF8') UTF8, convert ('Booty'ZHS16GBK','UTF8') UTF2GBK,Convert('fart,'zhs32gb18030', 'Zhs16gbk') gbk2gb18030, convert ('Booty'zhs32gb18030','UTF8') utf2gb18030,Convert(Convert('fart,'UTF8', 'UTF8'), 'zhs32gb18030', 'UTF8') Utf2gb18030from dual;
Description: Convert (content, target character set, source character set)
Special attention gbk2gb18030, garbled. GBK is a subset of GB18030, and the transcoding between them should not be problematic according to the official description, but the actual transcoding "fart" is not a problem, "" There is a problem. Personal inference: The Oracle internal default character is Unicode (UTF8), and when transcoding, intermediate transitions are made using Unicode encoding, resulting in two encodings in Unicode, and only one encoded character in the GB18030 is garbled.
To view the encoding in the source file:
There's no problem! Alas, there seems to be no problem here, and the 16 binary encoding is the same as the Database BLOB field, but the key to the problem is here. I try to say what I understand, not necessarily to Kazakhstan:
GBK is not included in the word "", but in WINDOWS, UltraEdit can display normally, why? Because I have installed Office (or other) software that provides support for the GB18030 character set, the WINDOWS system is GBK encoded by default (WIN 7 is also)! The ability to display does not mean that the code is correct! And I did not specify the encoding format for the Txt file, when the display is not a problem, when the IBM JDK read into the file in a byte stream, the encoding format of the file is considered Unicode (UTF8), even if I explicitly require transcoding to GB18030, that is, from Unicode (UTF8) To GB18030, because it is Unicode in the two encodings corresponding to the GB18030 in the same, so garbled.
At this point, you suspect that the TXT file has a problem encoding format. Hey.
3.3.3 Modifying the source file encoding format
Use UltraEdit transcoding, do not support GB18030 ... Download EditPlus, support. After transcoding, the same content, different encoding format of the file size difference (GB18030 encoded file large number of bytes), and GB18030 encoded format file "" In Notepad is garbled. Garbled there is a door, not GBK, not Unicode, is GB18030.
Use EditPlus to open the TXT file, save as:
Note: By default, there is no GB18030 in the encoding box and you need to first use the "more ..." method to put the required encoding on the left side of the popup box:
The contents of the GB18030 file are not displayed correctly in Notepad:
Modify the source code accordingly
Xxx.xx.xxxx.mapping.DataMapping
Public Document Xslttransform (document SOURCEDOC, Mappingcontext context) {
Document Targetdoc=Xmlutils.getdocumentbuilder (). NewDocument (); System.out.println ("Begin XSLT Transform" +NewDate () + ":" +NewDate (). GetTime ());Try{mappingcontextlocal.set (context); Transformer Transformer=Template.newtransformer (); Transformer.setoutputproperty (outputkeys.encoding,"GB18030"); Transformer.transform (NewDomsource (SourceDoc),NewDomresult (Targetdoc)); } Catch(Transformerexception ex) {Throw NewEaiexception (ex);}returnTargetdoc;}
Add the code for the Red section.
PrivateDocument getsourcedocfromexcel (Object src, Worksheet page)throwsbiffexception, IOException, eaiexception {... } Else if(srcinstanceof byte[]) {byte[] Bytedata = (byte[]) src;if(Bytedata.length > 2 && bytedata[0] = = -48 && bytedata[1] = = 49) {Workbook= Workbook.getworkbook (NewBytearrayinputstream ((byte[]) (SRC)); } Else{iscsv=true; CSVData=NewString (byte[]) src, "GB18030"); }} Else{iscsv=true; CSVData=(String) src;} ... }
Modify the red part of the code.
Xxx.xx.xxxx.File_upBO
Private voidAfterinputsave (file_upvo vo, String path) {if(Datamapping.getmappingcontext ()! =NULL&& Datamapping.getmappingcontext (). geterrmsg ()! =NULL&& "". Equals (Datamapping.getmappingcontext (). Geterrmsg ())) {Try{vo.setfiles (Datamapping.getmappingcontext (). Geterrmsg (). GetBytes ("GB18030")); } Catch(unsupportedencodingexception e) {Throw Newruntimeexception (E.getmessage ());} Vo.setmemo ("Import data Failed"); } Else{Vo.setmemo ("Import data all Success"); }}
Modify the red part of the code.
3.3.4 Modifying JRE parameters
Specify the correct JRE parameters for StartServer, Loginui:
Mainly-dfile.encoding=gb18030, on the IBM JDK, it is best to add-dibm.stream.nio=true. Because the IBM JDK is stricter than the SUN JDK in IO exception handling, don't be so troublesome.
Note: If you use the Client, you must make the appropriate settings in Login.bat.
3.3.5 Run and test
No problem, upload, import, display is normal.
4 Summary of Points
1. The encoding format of the source file must be correct;
2. The encoding format of the JRE parameter must be correct;
3, JAVA code transfer code set landlord must be correct;
4, the operating system code must support GB18030;
5, the operating system user's language environment must be GB18030, including database NLS set landlord;
6, the database code must be ZHS16GBK. (What if it's zhs32gb18030?) There is no environmental test, it is also right to want to
7, in this case, because the source data is the TXT type, so the corresponding to the BLOB field is not appropriate, if the use of CLOB, the case is much simpler, the code is much simpler.
5 Experience
1, the standardization of Chinese characters, internationalization, there is a long way to go ah, and it is likely to go disorderly now, do not know how many feet are walking.
2. Oracle, can you really show Oracle?
6 Final conclusions and recommendations
First of all grumble, you said GB2312 in more than 6,000 Chinese characters are not enough to name, then brother Taiwan, GBK 21,000 Chinese characters can be enough? Do you have to GB18030? Why? Windows do not support Ah, a lot of input methods are not supported Ah, ID card, bank card and so on a series of need to save the information to the computer you may not be able to do it. Must be? I'll give you a name, sb!.
According to the experience of my previous industry, the name should not be too complicated. The best use of GB2312 words, too complex words, or even teachers do not know, or write up trouble. You think, the name of the children 10 a dozen strokes, the name of your child 100 strokes, the new semester sent under the 10来 this exercise book, the request immediately put the name on it, you believe you do not believe your children write cry also write bad? Commit AH.
To the point, although the problem is a solution, but I do not recommend that you provide to customers, for the following reasons:
1, Windows default encoding for GBK, many input methods are also;
2, even if the solution of this rare word, do not know that there are no other uncommon characters have problems? Kangxi has 47,000 characters in the dictionary, can all support it? No!
3, you jiabuzhu have strong people such as Wu Zetian, in the Cangjie, Xu Shen, etc. on the basis of their own word?
4, in the computer, there are too many dialects, there is too much water translation, and some dialects of the word, in other dialects do not exist-reasonable words, should, must exist.
5, to large-scale inspection and modify the code.
6, test coverage how to set? How do you define uncommon characters? Do we know how many rare words there are?
Conclusion: The system does not provide support for Chinese characters which are not included in GBK. Either open the input or use pinyin instead, or the customer wants an alternative.
For how to judge Chinese characters in the GBK, at present I can think of the more convenient but the customer to operate a more troublesome solution is this: we provide customers with accessories in the GBK Chinese character coding table, by the customer when needed to one by one contrast.
Do not know if the customer will receive? or automatically give up? Or do you want to do it automatically?
There may be a problem with my method or description, but my conclusion is that. Love the ground.
7 attached 7.1 Chinese character coding table
GBK Chinese character coding table. xls
GB18030 of Chinese character coding. xls
7.2 GB18030 Partial kanji (FE zone) for GBK additions
Added 80 Chinese characters and radicals, including 28 radicals and 52 Chinese characters. GBK encoding is from fe50-fe7e, fe80-fea0.
There are no such characters in Unicode when formulating GBK, so the code bit for the private zone is used, and the code bit for this 80 character is 0xe815-0xe864. Later, Unicode included 52 kanji in the "CJK Unified Kanji Extension A". Of the 28 radicals, 14 radicals were included in the CJK radical supplement area. So in, these characters all have two Unicode encodings.
A painful experience caused by the word ""