This document describes how to troubleshoot failed insertion of special characters when database data is inserted using the DBUnit test framework. I hope that I will be able to have some inspiration when I encounter similar problems.
Background:
In the unit test of the write-to-database Interaction module, the ext field in the database table needs to be written into the data before reading it for processing. The format of the ext field is key1CTRL ^ Dvalue1CTRL ^ CKey2CTRL ^ Dvalue2. DBUnit is a database testing framework based on junit extension. The data inserted into the database in this project is organized in xml format. Part of the xml file is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?> <Dataset>
In Java, the value of CTRL ^ D is (char) 4, and the value of CTRL ^ C is (char) 3. Here, the meaning of the string \ u0003 in Java code is popularized, that is, the Unicode value (char) 3 after escaping. In the unit test, we found that the data read from the database was originally changed from CTRL ^ C, that is, \ u0003 to 6 characters, which are \, u, 0, 0, 0, 3.
Troubleshooting Process:
Since the read value in the database is incorrect, the inserted value is incorrect. First, I don't know much about Java. I didn't take a closer look at the packages imported into the unit test. I didn't find that the junit framework and DBUnit were used. As a result, the Unit Test base class DBUnitBaseTest, developed by a colleague for a while, is regarded as a transcoding problem. No result. Only after seeing the source code of this base class DBUniteBaseTest can we know that DBUnit exists. Besides, we find that the base class does not do any special processing, that is, initializing DataSource Based on the configuration file, insert data to the corresponding table in the database based on the xml data file.
Later, I searched in google and used the keyword dbunit \ u0003. It was too specific, so I didn't find much useful information. I have been struggling to find a solution to the problem.
Later, some colleagues reminded me to use CDATA and checked the usage of CDATA. They had some ideas.
"CDATA is the keyword used in the XML document to tell the XML parser. This part of content does not need to be parsed and is used by other programs, such as JAVASCRIPT. All text in the XML document will be parsed by the parser, and only the text in the CDATA part will be ignored by the parser. "
Later, I saw a useful reference from the search result page: numeric character reference.
Because XML syntax uses some characters for tags and attributes it is not possible to directly use those characters inside XML tags or attribute values. to include special characters inside XM files you must use the numeric character reference instead of that character. the numeric character reference must be UTF-8 because the supported encoding for XML files is defined in the prolog as encoding = "UTF-8" and shocould not be changed.
The numeric character reference uses the format:
& # Nn;Decimal form
& # Xhh;Hexadeciaml form
So we have the following solutions:
1. Try to write the numeric character of \ u0004 directly, that is, & #4; failed. The error is: Character reference "& #4" is an invalid XML character
2. In \ u0004, only the \ is replaced by numeric character, that is, the & #92; u0004 mode is still 6 characters, which is the same effect as writing \ u0004 directly.
3. Use CDATA: ext = <! [CDATA ["postage \ u000410.0 \ u0003"]> it is found that the writing method may be incorrect. The error message is that the xml format is incorrect.
At this time, I feel like I am approaching the truth, that is, I feel that every time I search for something related to \ u0004, the scope is too small to find the answer to the question. Later, I talked to my colleagues about this problem. My colleagues mentioned that these control characters were not correctly encoded and I woke up. Directly search does xml support control characters and find the following:
Specifically, 0x1-0x1F and 0x7F-0x9F must be encoded as escapes in XML 1.1. The former were forbidden and the latter were optionally not-escaped in 1.0.
So we can see that when Scheme 1 is used, because XML1.0 does not support these control characters, an error is still reported, and the & #4 character is an invalid XML character. From the above search results, XML 1.1 supports these control characters, so I am very happy to change the xml version in the xml file from 1.0 to 1.1, and the result still reports an error:
Org. dbunit. dataset. DataSetException: Line 1: XML version "1.1" is not supported, only XML 1.0 is supported.
Finally, a simple and crude solution is provided: For this table field, DataSource is used directly, and Statement is used to execute an SQL Statement to update the data and update it to the desired field.
PS: Later I encountered the problem of garbled characters in the properties file of java. I checked the code of DBUnitTest, the basic class of unit test written by my colleagues, and found that the properties file is the prop. load (new FileInputStream (file) loaded through the Properties class )). After searching the definition of the load function of the Properties class, we found that
The input stream is in a simple line-oriented format as specified inload(Reader)
And is assumed to use the ISO 8859-1 character encoding;