Research on the use of text_factory attribute of Python sqlite3 module

Source: Internet
Author: User
Tags sqlite database

Writing this article originates from writing a script to write a CSV file (which uses GBK or UTF-8 encoding) into the SQLite database.

Python version: 2.7.9

The Sqlite3 module provides a method such as con = Sqlite3.connect ("d:\\text_factory.db3") to create the database (new library when the file does not exist), the database is encoded as UTF-8 by default, and the encoding is supported with special SQL statements

PRAGMA encoding = "UTF-8";
PRAGMA encoding = "UTF-16";
PRAGMA encoding = "Utf-16le";
PRAGMA encoding = "UTF-16BE";

However, the setup encoding must precede the main library, otherwise it cannot be changed. https://www.sqlite.org/pragma.html#pragma_encoding

Knowing the Text_factory attribute, you should all be aware of the following errors:

Sqlite3. Programmingerror:you must 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (l Ike text_factory = str). It is highly recommended, instead just switch your application to Unicode strings.

The main idea is to recommend that you convert the string into a Unicode string, you want to use a bytestring byte type string (such as ASCII, GBK,UTF-8), you need to add a statement text_factory = str.

Python has two types of strings. A standard string is a single-byte sequence of characters that allows binary data and embedded null characters to be included. A Unicode string is a sequence of double-byte characters, and one character is saved using two bytes, so there can be up to 65536 different Unicode characters. Although the latest Unicode standard supports up to 1 million different characters, Python does not now support this latest standard.

Default text_factory = Unicode, which originally thought this Unicode, str is a function pointer, but seemingly not, is <type ' Unicode ' > and <type ' str ' >

A test verification code is written below:

1 #-*-coding:utf-8-*-2 ImportSqlite33 " "4 GBK UNIC UTF-85 b8a3 798F E7 A6 8F Fu6 d6dd 5DDE E5 B7 9E State7 " "8 9con = Sqlite3.connect (": Memory:")Ten #con = sqlite3.connect ("d:\\text_factory1.db3") One #con.executescript (' PRAGMA encoding = ' UTF-16 '; ') ACur =con.cursor () -  -A_text ="Fu Zhou" theGb_text ="\XB8\XA3\XD6\XDD" -Utf8_text ="\xe7\xa6\x8f\xe5\xb7\x9e" -unicode_text= u"\u798f\u5dde" -  + Print 'Part 1:con.text_factory=str' -Con.text_factory =Str + Printtype (con.text_factory) ACur.execute ("CREATE TABLE table1 (city);") atCur.execute ("INSERT into table1 VALUES (?);", (A_text,)) -Cur.execute ("INSERT into table1 VALUES (?);", (Gb_text,)) -Cur.execute ("INSERT into table1 VALUES (?);", (Utf8_text,)) -Cur.execute ("INSERT into table1 VALUES (?);", (Unicode_text,)) -Cur.execute ("Select City from table1") -res =Cur.fetchall () in Print "--Result:%s"%(RES) -  to Print 'Part 2:con.text_factory=unicode' +Con.text_factory =Unicode - Printtype (con.text_factory) theCur.execute ("CREATE TABLE table2 (city);") *Cur.execute ("INSERT into Table2 VALUES (?);", (A_text,)) $ #Cur.execute ("INSERT into Table2" VALUES (?); ", (Gb_text,))Panax Notoginseng #Cur.execute ("INSERT into Table2" VALUES (?); ", (Utf8_text,)) -Cur.execute ("INSERT into Table2 VALUES (?);", (Unicode_text,)) theCur.execute ("Select City from Table2") +res =Cur.fetchall () A Print "--Result:%s"%(RES) the  + Print 'Part 3:optimizedunicode' -Con.text_factory =Str $Cur.execute ("CREATE TABLE table3 (city);") $Cur.execute ("INSERT into Table3 VALUES (?);", (A_text,)) - #Cur.execute ("INSERT into Table3" VALUES (?); ", (Gb_text,)) -Cur.execute ("INSERT into Table3 VALUES (?);", (Utf8_text,)) theCur.execute ("INSERT into Table3 VALUES (?);", (Unicode_text,)) -Con.text_factory =Sqlite3. OptimizedunicodeWuyi Printtype (con.text_factory) theCur.execute ("Select City from Table3") -res =Cur.fetchall () Wu Print "--Result:%s"%(RES) -  About Print 'Part 4:custom fuction' $Con.text_factory =LambdaX:unicode (x,"GBK","Ignore") - Printtype (con.text_factory) -Cur.execute ("CREATE TABLE table4 (city);") -Cur.execute ("INSERT into Table4 VALUES (?);", (A_text,)) ACur.execute ("INSERT into Table4 VALUES (?);", (Gb_text,)) +Cur.execute ("INSERT into Table4 VALUES (?);", (Utf8_text,)) theCur.execute ("INSERT into Table4 VALUES (?);", (Unicode_text,)) -Cur.execute ("Select City from Table4") $res =Cur.fetchall () the Print "--Result:%s"% (RES)

Printing results:

Part 1:con.text_factory=str<type ' type ' >--  result: [(' Fu Zhou ',), (' \xb8\xa3\xd6\xdd ',), (' \xe7\xa6\x8f\ Xe5\xb7\x9e ',), (' \xe7\xa6\x8f\xe5\xb7\x9e ',)]part 2:con.text_factory=unicode<type ' type ' >--  result: [(U ' Fu Zhou ',), (U ' \u798f\u5dde ',)]part 3:optimizedunicode<type ' type ' >--  result: [(' Fu Zhou ',), (U ' \u798f\ U5dde ',), (U ' \u798f\u5dde ',)]part 4:custom fuction<type ' function ' >--  result: [(U ' Fu Zhou ',), (U ' \u798f\ U5dde ',), (U ' \u7ec2\u5fd3\u7a9e ',), (U ' \u7ec2\u5fd3\u7a9e ',)]

Part 1:unicode is converted into utf-8,utf-8 and GBK, written to the database, and the GBK string is removed and displayed, it needs to be similar to ' GBK chars '. Decode ("cp936") The statement to parse print

Part 2: Default settings, note drop will produce the above classic error, the input range is limited to Unicode object or pure ASCII code

Part 3: Auto-optimized, ASCII as STR object, non-ASCII to Unicode object

Part 4:GBK is converted correctly, Utf-8 and Unicode are transferred to the default encoding Utf-8 storage, both ' \xe7\xa6\x8f\xe5\xb7\x9e ', when they are stored in the database.

IN[16]: Unicode (' \xe7\xa6\x8f\xe5\xb7\x9e ', ' GBK ')
OUT[16]: U ' \u7ec2\u5fd3\u7a9e '

The above results are obtained.

Next, use the software to see how the database is stored.

Use the official Sqlite3.exe and sqlitespy to view, sqlite3.exe because the command line interface, command line with the GBK display; Sqlitespy is shown with UTF, so GBK display garbled. This once again confirms that GBK is allowed to be stored in the database, the raw data is stored, and will not be forced into the database of the default encoding Utf-8 saved .

Connection. text_factory uses this property to control what objects we can get from the text type ( I: This also confirms that when writing to a database, you need to code yourself and not rely on this ). By default, this property is set to the Unicode,sqlite3 module to return a Unicode object for text. If you want to return the ByteString object, you can set it to Str.

For reasons of efficiency, there is also a method that returns a Unicode object only for non- ASCII data, and all other data returns the ByteString object. To activate it, set this property to Sqlite3. Optimizedunicode.

You can also set it to any other Callabel, receive a parameter of type BYTESTIRNG, and return the result object. "From http://www.360doc.com/content/11/1102/10/4910_161017252.shtml"

The above paragraph is an excerpt of the official document's Chinese version of the text_factory description.

In conclusion, I would like to talk about my views * and suggestions for use:

1) when the Sqlite3 module executes the INSERT, the raw data is written, the type is judged according to the Text_factory property before writing, and the default is to determine whether the write is a Unicode object;

2) When reading from a database using Fetchall (), the conversion is based on the Text_factory attribute.

3) Input string is GBK encoded Bytestring,decode to Unicode write, or add text_factory=str direct write, read out is still GBK, the premise requires database encoding for UTF-8, notice with Sqlitespy view is garbled.

4) Input string is UTF-8 encoded bytestring, can be set TEXT_FACTORY=STR direct write directly read out, Sqlitespy view normal display.

5) If it is not a high-performance scenario, transfer to Unicode before storage, the performance cost is also very small, the test data can not be found, like I say a day to study this line of code, it is better to let the machine run fraction seconds each time.

* (because there is no source code to view the Sqlite3 module, so just guess)

In addition, the attached database is set to UTF-16 encoding, resulting in a more chaotic, not recommended.

Part 1:con.text_factory=str<type ' type ' >--  result: [(' Fu Zhou ',), (' \xc2\xb8\xc2\xa3\xef\xbf\xbd\xef\xbf\ Xbd ',), (' \xe7\xa6\x8f\xe5\xb7\x9e ',), (' \xe7\xa6\x8f\xe5\xb7\x9e ',)]part 2:con.text_factory=unicode<type ' type ' >--  Result: [(U ' Fu Zhou ',), (U ' \u798f\u5dde ',)]part 3:optimizedunicode<type ' type ' >--  result: [(' Fu Zhou ',), (U ' \u798f\u5dde ',), (U ' \u798f\u5dde ',)]part 4:custom fuction<type ' function ' >--  result: [(U ' Fu Zhou ',), (U ' \u8d42\u62e2\u951f\u65a4\u62f7 ',), (U ' \u7ec2\u5fd3\u7a9e ',), (U ' \u7ec2\u5fd3\u7a9e ',)]

  

Research on the use of text_factory attribute of Python sqlite3 module

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.