Problem Recurrence:
1, PG Client:
postgres=# CREATE TABLE Text_test (ID int,info text);
CREATE TABLE
postgres=# insert INTO text_test values (1,e ' \0x00 ');
Error:invalid byte sequence for encoding "UTF8": 0x00
2. SQL Server generates data
CREATE TABLE Test_varchar (ID int,name varchar ());
INSERT into Test_varchar values (1, ' name ' + char (0));
INSERT into Test_varchar values (1, ' name ' + ');
You can then get the data through a Java program and insert it into PG, which also gets the error message:
Invalid byte sequence for encoding "UTF8": 0x00
First of all, we think that when the gb2312 is converted to UTF8, there is an error that cannot be converted. The UTF8 is variable length, 1-6 bytes. His coding rules are as follows:
Bits |
Last code point |
Byte 1 |
Byte 2
|
Byte 3
|
Byte 4
|
Byte 5
|
Byte 6
|
7 |
u+007f |
0xxxxxxx |
|
|
|
|
|
11 |
u+07f F
|
110xxxxx |
10xxxxxx |
|
|
|
|
16 |
U+ffff
|
1110xxxx |
10xxxxxx
|
10xxxxxx
|
|
|
|
21st |
U+1fffff
|
11110xxx |
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
|
|
|
u+3ffffff | " td>111110xx
10xxxxxx |
10xxxxxx |
10xxxxxx |
|
|
31 |
U+7fffffff
|
1111110x |
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
And 0x00 is in line with UTF8 rules. This makes us very surprised. And then we found that there were two points that confirmed the problem:
1,
PostgreSQL doesn ' t support storing NULL (\0x00) characters in text fields (this is obviously different from the database N Ull value, which is fully supported).
If you are need to store the NULL character, your must use a Bytea Field-which should store anything you want, but won ' t supp ORT text operations on it.
Given that PostgreSQL doesn ' t support it in text values, there ' no good way to get it to remove it. You are could import your data into Bytea and later convert it to the text using a special function (in Perl or something, maybe?) , but it ' s likely going to is easier to do this in preprocessing for you load it.
source:http://stackoverflow.com/questions/1347646/ postgres-error-on-insert-error-invalid-byte-sequence-for-encoding-utf8-0x0
2,
terminating character |
indicated by |
Tab |
\ t This is the default field terminator. |
NewLine character |
\ n This is the default row terminator. |
Carriage Return/line Feed |
\ r |
Backslash1 |
\\ |
Null Terminator (nonvisible Terminator)2 |
The |
Any printable character (control characters are not printable, except null, tab, newline, and carriage return) |
(*, A, T, L, and so on) |
String of up-printable characters, including some or all of the terminators listed |
(**\t**, end,!!!!!!!!!!, \t-\n, and "on") |
Source:http://msdn.microsoft.com/en-us/library/ms191485.aspx
As a result, we are sure that the PG-NULL processing is not the same as SQL Server processing, so there is an error here.
The PG-specific code that causes the problem is as follows (Src/backend/utils/mb/wchar.c's Pg_verify_mbstr_len):
if (!is_highbit_set (*MBSTR))
{
if (*mbstr!= ' ")
{
mb_len++;
mbstr++;
len--;
Continue;
}
if (NOERROR)
return-1;
Report_invalid_encoding (encoding, MBSTR, Len);
#define IS_HIGHBIT_SET (CH) (unsigned char) (CH) & Highbit)
#define Highbit (0x80)
The Report_invalid_encoding function is to return the error message, which is
Invalid byte sequence for encoding "UTF8": 0x00
And the real cause of this problem is:
!is_highbit_set (*MBSTR) when *mbstr for 0x00, and then to judge whether *mbstr is a, when for the, directly into the function report_invalid_encoding error.
So this problem occurs because PG and SQL Server treat null differently.
Processing scheme:
1, the SQL Server source data to modify the method,
Update:this seems to work:
Select * from TABLE
where UNICODE (SUBSTRING (Naughtyfield, LEN (Naughtyfield), 1)) = 0
so:
Update TABLE
SET Naughtyfield = SUBSTRING (Naughtyfield, 1, LEN (Naughtyfield)-1)
where UNICODE ( SUBSTRING (Naughtyfield, LEN (Naughtyfield), 1) = 0
source:http://stackoverflow.com/questions/3533320/ Sql-server-remove-end-string-character-0-from-data
2, the application to modify, to obtain SQL Server data, the data conversion, and the first method of the same.