Methods of processing str and Unicode in Python2015/03/25 ·Basic Knowledge ·3 Reviews · Pythonshare to: Source: liuaiqi627 's BlogIt is a headache to deal with Chinese in python2.x. Write this article on the net, the measurement is not homogeneous, and will be a bit wrong, so here intends to summarize an article.I will also learn in the future, and constantly revise this blog.This assumes that the reader already has the basic knowledge associated with
Question one:
Use the Save As for Windows Notepad to convert between GBK, Unicode, Unicode big endian, and UTF-8 encoding methods. Also is TXT file, how does Windows recognize the encoding way?
I found out earlier that Unicode, Unicode big endian and UTF-8 encoded TXT files are preceded by a few more bytes, namely FF
In SQL Server databases, data types are divided into two categories, Unicode data types and non-Unicode data types. In general, if the information stored in the database has multiple languages, I recommend that you use Unicode data types instead of non-Unicode data types.
First, the reasons for using
1. Review of three types of codes
ANSI string we are most familiar with, English occupies one byte, Chinese characters 2 bytes, ending with a \ 0, commonly used in TXT text files.Unicode string. Each character (Chinese character or English letter) occupies 2 bytes. In the VC ++ world, Microsoft prefers Unicode, such as wchar_t.Utf8 is a form of Unicode compression. English A is expressed as 0x0041 in
PHP how to achieve Unicode and Utf-8 encoding mutual conversion, unicodeutf-8. PHP how to achieve Unicode and Utf-8 encoding mutual conversion, unicodeutf-8 recently happens to use unicode encoding conversion, go to check the php library function, did not find a function can be on PHP how Unicode and Utf-8 encoding mut
3-character encoding model
Programmers often face complex problems, and the simplest way to reduce complexity is to divide and conquer them. Peter Constable describes the four-layer model of Character encoding in his article "character set encoding basics Understanding Character set encodings and legacy encodings. I think this statement can clearly show what happened in character encoding, so I will introduce it here.3.1 character range (Abstract character repertoire)
The first layer of charac
The previous article takes you to the visual studio--takes you out of the pit Dad's runtime library pit helps us understand the various types of C/s + + runtime libraries in Windows and its ins and outs, which is a particularly easy place to go astray in C + + development, We summarized and summed it up. In this article we will continue to explain another concept that is easily confused in C + + development-multibyte character sets and Unicode charact
One, Unicode character setUTF encoding, full name is Unicode Transformer format, which is the actual form of UCS (Universal mutiple-octet doded Character set, International standard ISO10646-defined universal Character set), Its classification is based on the number of bits occupied by its basic length, divided into three forms of utf-8/16/32. UTF can be said to be a collection of other character sets, it m
From ASCII code to Unicode
Double-byte Character setSo far, we've seen a 256 character character set (ASCII). But there are about 21,000 glyphs in China, Japan and South Korea. How to accommodate these languages and still maintain some compatibility with ASCII?The solution (if this is correct) is a double-byte character set (Dbcs:double-byte character set). DBCS starts with 256 code, just like ASCII. As with any code page that behaves well, the
NSI, UTF-8, Unicode, three encoded formats for character codes, one character can be encoded into ANSI, UTF-8, or Unicode format, and the three formats are only different in expression and represent the same content.
ANSI, UTF-8, Unicode
ANSI, UTF-8, Unicode, three encoding formats for character codes, one character
Unicode and UTF-8, unicodeutf-8
1. ASCII codeWe know that in a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 0000000 to 11111111.In the 1960s s, the United States deve
(Declaration: This
Article It is original. If it is reproduced, please indicate the author and the original link)
/* Author: Wu. Jian (WU Jian) English name: Sword
/* Date: 2007-12-13
/* Purpose: Knowledge Sharing
These days I encountered the problem of converting UTF-8 to gb2312, and in the embedded environment, there is no API available, check a lot of online information, most of them call interfaces provided by VC or Linux. Here I will summarize my work over the past two days.In general,
A good article on STR and Unicode
To sort out the Python code-related content
Note: The following discussion is for the python2.x version, py3k to be tried
Begin
When handling Chinese in Python, read files or messages, HTTP parameters, and so on
A run, found garbled (string processing, read-write file, print)
Then, most people's practice is to invoke Encode/decode for debugging, and not to think clearly why garbled
So the most common errors tha
Before the official content starts, let's first understand a basic concept, encoding character set.
Encoding character set: the encoding character set is a character set that assigns a unique number to each character. The core of the Unicode standard is an encoding character set. The letter "a" is encoded as 004116 and the character "character" is encoded as 20ac16. Unicode standards always use hexadecimal
lpuseddefachar char must be null when this value is setCp_utf8 UTF-8, which must be null for both lpdefaultchar and lpuseddefachar char
I think cp_acp and cp_utf8 are the most common ones. The former converts wide characters to ANSI and the latter to utf8.
Dwflags: Specifies how to process non-converted characters. However, if this parameter is not set, the function runs faster. I set it to 0. The following table lists the configurable values:Wc_no_best_fit_chars converts
software use the extended ASCII English tab to draw the table, such software used in the Chinese system, these tables will be mistaken as Chinese characters, garbled. In addition, because countries and regions have their own text coding rules, they conflict with each other, which brings the exchange of information between countries and regions of great trouble.Generation of 2.UNICODETo really solve this problem, can not start from the perspective of extended ASCII,
ANSI and Unicode encoding, TCHAR | LPSTR | LPCSTR | LPWSTR | LPCWSTR | LPTSTR | meaning of LPCTSTR, lpcstrlpwstr
One character can be expressed in 1-byte, that is, ANSI encoding;
A character can also be expressed in 2-bytes, that is, Unicode encoding (Unicode actually contains more content than 2-bytes ).
Visual C ++ supports char and wchar_t as the original data
Recently just to use the conversion of Unicode encoding, went to check the PHP library function, actually did not find a function can be Unicode encoding and decoding the string! No matter, can not find the words on their own realization of ...
The difference between Unicode and Utf-8 encoding
Unicode is a charac
Recently, just to use Unicode encoding conversion, I went to check the PHP library function, but did not find a function can be Unicode encoding and decoding of strings! Or, you can not find the words to achieve their own ... The difference between Unicode and Utf-8 encoding
Unicode is a character set, and UTF-8 is on
Python Unicode and Chinese Processing
From: http://hi.baidu.com/jackleehit/blog/item/ea93618e1051131cb31bbaac.html
Unicode in python is confusing and difficult to understand. This article strives to completely solve these problems;
1. Unicode, GBK, gb2312, UTF-8 relationships;
Http://www.pythonclub.org/python-basic/encode-detail this article write better, UTF-8 i
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.