Development Notes: how to deal with HTML Entity in Python
In some webpages, non-ASCII characters are stored in HTML Entity. In this representation, each character (UNICODE char)
& # + Unicode code +;
.
For example, the charger is
& #20805; & #30005; & #22120;
Among them, the Unicode code 20805,300 and 22120 are the three words "filling", "Electric", and "device" respectively.
For PAGE analysis, the program needs to know how to convert these HTML entitys into corresponding ones. in Java, we can use the decode method provided by org.html parser. util. Translate in htmlparsert to transcode (the function of translate class is very powerful ). In python, there seems to be no ready-made method. We need to implement a conversion function by ourselves. Below is a simple implementation.
Def decodehtmlentity (s ):
Import re
Result = s
Entityre = '(& # (// d {5 });)'
Entities = Re. findall (entityre, S)
For entity in entities:
Result = result. Replace (entity [0], unichr (INT (entity [1])
Return result. encode ('utf-8', 'ignore ')
The principle is to use the built-in unichr function of python to restore Unicode code to unicode characters.
However, this program can only process the decimal entity. If you need to convert the hexadecimal encoding, You need to modify it accordingly.
_________________________________________________
ASCII code
Bytes ------------------------------------------------------------------------------------
7-digit (00 ~ 7f ). 32 ~ 127 represents a character. 32 is a space and 32 is a control character (invisible ).
The 8th bits are not used. Many people around the world have developed different uses for the meaning of this position. For example, the OEM character set in ibm pc.
At last, we reached a consensus on the use of less than 128 characters and formulated the ASCII standard.
More than 128 characters may have different interpretations. These different interpretations are called code pages.
There are even code pages used to interpret multiple languages on the same computer.
At the same time, something went crazy in Asia. The character set in the Asian language is usually thousands of characters and eight characters are insufficient to express them.
DBCS (Double Byte Character Set) is a messy system.
In this system, some characters occupy 1 byte and some 2 bytes. In this way, it is easy to parse forward in the string, but regressing is troublesome.
Programmers are advised not to use s ++ or s -- to forward and backward, but to use some functions, such as Windows ansinext and
Ansiprev. Because these functions know what is going on.
These different assumptions (code page) have no problem on a single machine. With the development of Internet, strings must be moved from one machine to another.
On another machine, this creates a problem. So Unicode appears.
Unicode
Bytes ---------------------------------------------------------------------------------------
Unicode is a brave achievement. It integrates every reasonable text system on the planet into a single character set.
Many people still have the misunderstanding that Unicode is only as simple as 16 characters, each character occupies 16 characters, so there are a total of 65536 possible characters.
However, this is wrong. It doesn't matter, because this is a common mistake most people make.
In fact, Unicode understands characters in a completely different way, which we must understand.
So far, we have considered that a character corresponds to bits stored on disks or in memory. For example: A-> 0100 0001
In Unicode, a character actually corresponds to something called Code Point.
For example, the character "a" is an abstract concept (Original: platonic, prepaid, ideal.
Both Times New Roman, Helvetica, and other fonts represent the same character. But it is different from lowercase letter.
But in other languages, such as Hebrew, German, and Arabian, whether different glyphs of the same letter represent meanings
The same is controversial. After a long period of debate, these are finally confirmed.
Every abstract letter in each alphabet is assigned a number, such as U + 0645. This is called Code Point.
U +: Unicode. The number is hexadecimal.
You can use the charmap command to view all these encodings. (In Windows 2000/XP). or visit the Unicode website (http://www.unicode.org)
There is no limit on the size of the number of the Code Point in UNICODE, and it has already exceeded 65535. Therefore, not every character can be stored in two bytes.
Then, a string "hello" is represented as five code points in UNICODE:
U + 0048 U + 0065 U + 006c U + 006c U + 006f
It's just a number. However, we have not yet mentioned how to express this information in a disk or email. This is what we will refer to as encoding.
Encodings (encoding)
-------------------------------------------------------------------------
The original unicode encoding uses two bytes to represent a single character. "Hello" indicates:
00 48 00 65 00 6C 00 6C 00 6f
In fact, there is also a representation:
48 00 65 00 6C 00 6C 00 6f 00
There are two different modes in which high bytes are in front or low bytes are in front. This depends on the mode in which the CPU runs faster. So there are both.
There are two different Unicode representation methods. In order to distinguish between them, people adopt a strange way:
Add feff (UNICODE byte order mark, Unicode byte order mark) before each Unicode string ).
If you exchange the high and low order, an fffe will be added. In this way, the person who reads this string knows to exchange every two adjacent bytes.
However, at the beginning, not every Unicode string had this flag.
This looks good. But programmers started to complain, "Look at those !". Because some Americans use English. In English, the U + 00FF or above is rarely used.
Characters, some people cannot bear to use double storage space to store each character.
For these reasons, many people decide to ignore Unicode, and at the same time, things get worse.
The UTF-8. UTF-8 was then developed as another system for saving Unicode code points.
Each U + number occupies 8 bits in memory. In the UTF-8, any 0 ~ The Code Point of 127 occupies one byte.
Only 128 and greater occupy 2, 3, and 6 bytes.
As shown in:
The smallest number in hexadecimal notation the largest number of bytes in memory.
------------------------------------------------------------------------------
00000000 running 007f 0 vvvvvvv
00000080 000007ff 110 vvvvv 10 vvvvvv
00000800 0000 FFFF 1110 vvvv 10 vvvvvv 10 vvvvvv
00010000 001 fffff 11110vvv 10 vvvvvv 10 vvvvvv 10 vvvvvv
00200000 03 ffffff 111110vv 10 vvvvvv 10 vvvvvv 10 vvvvvv 10 vvvvvv
04000000 7 fffffff 1111110 V 10 vvvvvv 10 vvvvvv 10 vvvvvv 10 vvvvvv 10 vvvvvvvv
This looks good. The English characters are the same as those in ASCII. So Americans are not aware of any mistakes. Only other countries in the world need high bytes.
Specifically, the Unicode code point of the "hello" string is u + 0048 U + 0065 U + 006c U + 006c U + 006f, which is stored as 48 65 6C 6C 6f.
It has the same meaning as ASCII, ANSI, and any OEM character set on this planet.
Now, if you want to represent the accent character or Greek, you need to use multiple bytes to represent a code point. But Americans do not mind this.
(Another benefit of UTF-8 is that the old string handler uses a byte of 0 to represent null-terminator without truncating the string)
So far, we have introduced three Unicode Representation Methods:
The traditional double byte representation method, known as UCS-2 (because there are 2 bytes) or UTF-16 (because there are 16 digits)
And you have to figure out whether it is high in front, or high in the rear of the UCS-2.
There is also a new UTF-8. If your program only uses English, it will still work normally.
There are actually a bunch of other methods to encode Unicode:
There is a UTF-7, this encoding method is mostly the same as the UTF-8, but ensure that the high must be 0.
So if you have to transmit Unicode through some e-mail system and these systems think that 7-bit is enough, it will work fine with the UTF-7.
There is also a UCS-4 that stores each code point in 4 bytes. Its advantage is that each character is saved to the same length. But obviously, the disadvantage is that too much storage space is wasted.
So now you have to think about every character as an abstract Unicode code point. They can also be encoded in any old way.
For example, you can encode the Unicode string Hello (U + 0048 U + 0065 U + 006c U + 006c U + 006f)
ASCII, ancient OEM Greek, or Greek ANSI, etc. Some strings cannot be displayed!
That is to say, if you want to indicate a unicode code point that does not have a corresponding encoding, it is usually displayed as? Or a small white box.
Some commonly used English code: Windows-1252 (Windows 9x standard for Western European languages)
And ISO-8859-1, aka Latin-1 (also valid for any Western European language)
If you use these encodings to try to store Russian characters, you will get a bunch?
UTF 7, 8, 16, and 32 all have the advantages of correctly storing any code point.
The simplest and most important concepts
========================================================== ======================================
It makes no sense to specify the encoding used by a string.
Do not assume that "plain" text (plain text) is ASCII.
There is no "plain text.
If you have a string in memory, in a file, or in an email message, you must know its encoding. Otherwise, you cannot properly explain or display the information to the user.
All the stupid questions such as "My webpage cannot be displayed normally" or "email message cannot be displayed normally" are because they didn't tell you the encoding they actually use,
UTF-8 or ascii or ISO 8859-1 or Windows 1252 ?? Naturally, it cannot be properly explained and displayed, or even the end of the string is unknown.
So how can we retain the encoding flag to indicate the encoding of the string? There are some basic methods.
For example, add the following content to the form header:
Content-Type: text/plain; charset = "UTF-8"
For a web page, the web server sends an HTTP header similar to Content-Type along with the web page itself.
(Not in HTML, but as a response header sent before the HTML page)
There is a problem in doing so. If your web server has multiple sites at the same time, the sites are mixed by programs developed in different languages by multiple different people. The web server will not be known,
What encoding is used for each file. In this way, the correct Content-Type header cannot be sent.
If you can record Content-Type information in each HTML file, it is very convenient. But it seems crazy, because you still don't know what encoding method to use.
How can I read the encoding information of this file?
Fortunately, in almost every encoding ~ The 127 characters are interpreted as the same. So you can write this in every HTML file:
<HTML>
<Head>
<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8">
However, note that this meta tag must be placed in the front of the head to ensure that no problem occurs. Because when the web server reads this, it will stop parsing,
Then re-parse the page using the read encoding method.
So what if Content-Type is not found in the meta tag or HTTP headers as a web browser?
IE does this:
First try to guess, according to the specific bytes appear in the typical encoding frequency of various languages.
If the encoding settings are not normal, you can use the View | encoding menu to try different encoding methods. (Of course, not everyone knows how to do this)
In VB, COM, Windows NT/2000/XP, the default string type is UCS-2 (2 bytes.
In the C ++ code, we can define the string as wchar_t (wide char), and replace the STR series functions with the functions of the WCS series.
Such as wcscat, wcslen, instead of strcat, strlen.
In C code, to create a UCS-2 string, add a "L", such as l "hello"
For Web pages, it is best to be unified to use UTF-8 encoding. This encoding has been supported by various web browsers for many years.