The character set of the Frontend learning HTTP

Source: Internet
Author: User
Tags abstract language ord rfc unpack in domain

Previous words

HTTP messages can host content in any language, as if it could host images, movies, or any type of media. For HTTP, the entity body is just a container for binary information. In order to support international content, the server needs to tell the client the alphabet and language of each document, so that the client can correctly unpack the information in the document into characters and present the content to the user, and to implement this function, the character set to be introduced in detail is needed.

Header Overview

The server informs the client document of the alphabet and language through the charset parameter and Content-language header in the HTTP protocol's content-type header. These headers describe what is in the "information box" of the entity body, how to convert the content to the appropriate characters so that it appears on the screen and what language the words are in.

At the same time, the client needs to tell the server user what language to understand and what Alphabet encoding algorithm is installed on the browser. The client sends the Accept-charset header and Accept-language header, informing the server that it understands which character set encoding algorithm and language, and in which order of precedence

These accept headers in the following HTTP messages may be issued by a native French speaker. He prefers to use his native language, but also speaks a little English, his browser supports iso-8859-1 Western European character set encoding and UTF-8 Unicode character set encoding

ACCEPT-LANGUAGE:FR, en;q=0.8 accept-charset:iso-8859-l, Utf-8

The parameter "q=0.8" is a quality factor (quality factor), stating that the English priority (0.8) is lower than French (the default value is 1.0)

Encoding process

The value of the HTTP character set describes how to convert the binary code of the entity content to characters in a particular alphabet. Each character set tag names an algorithm that converts a binary code to a character (and vice versa). Character set tokens are standardized in the MIME character set registries maintained by the IANA. Many of these character sets are outlined in Appendix H

The following Content-type header informs the receiver that the content of the transmission is an HTML file that is communicated to the recipient using the CharSet parameter to convert the binary code in the content to a character with the decoding algorithm of the iso-8859-6 Arabic character set:

content-type:text/html; Charset=iso-8859-6

ISO-8859-6 's coding algorithm maps 8-bit ranges to Latin and Arabic letters, as well as numbers, punctuation, and other symbols. For example, in, the value of the highlighted binary is 225, which is mapped in iso-8859-6 to the Arabic letter "FEH" (pronounced like the English letter F)

[note] Unlike Chinese, Japanese, there are only 28 characters in Arabic, 8-bit spaces have 256 different values, enough to accommodate Latin characters, Arabic characters, and other symbols

Some character encodings, such as UTF-8 and ISO-2022-JP, are more complex, and they are variable-length (variable-length) encodings, which means that the number of bits per character is variable. This type of encoding allows the use of extra bits to represent an alphabet with a large number of characters (such as Chinese and Japanese), with only a few bits to represent the standard Latin characters

We want to convert the binary code in the document to a character for display on the screen. But because there are a lot of different alphabet, there are many different ways to encode characters into binary code (these methods have advantages and disadvantages), we need a standard method to describe and apply the binary code to the character of the decoding algorithm

It takes two steps to convert a binary code to a character, as shown in

In Figure A, the binary code in the document is converted to a character code that represents a specific number of characters in a particular coded character set. In this example, the decoded character code is number 225.

In Figure B, the character code is used to select a specific element from the encoded character set. In Iso-8859-6, the value 225 corresponds to the Arabic letter "FEH". The algorithm used in steps a and B depends on the MIME charset tag

The key goal of the internationalized character system is to isolate the semantics (Letters) and representations (graphical display forms). HTTP only concerns the transmission of character data and associated language and character set labels. The display of the character shape is done by the user's graphical display software (including browser, operating system, font, etc.), as shown in C

If the client uses the wrong character set parameter, the client displays some strange characters that are garbled. Suppose the browser obtains a value of 225 from the body (binary is 11100001)

If the browser considers the subject to be encoded with iso-8859-1 Western European characters, it will display a lowercase Latin letter "a" with an accent

If the browser uses iso-8859-6 Arabic encoding, it will display the Arabic letter "FEH"

If the browser uses iso-8859-7 Greek encoding, it will display the lowercase Greek alphabet "Alpha"

If the browser uses iso-8859-8 Hebrew encoding, it will display the Hebrew letter "BET"

"Normalized MIME charset value"

A specific character encoding scheme and a specific coded character set are combined into a MIME character set (MIME charset). HTTP uses standardized MIME charset tags in the content-type and Accept-charset headers. The value of the MIME CharSet is registered in the IANA

The following table lists some of the MIME charset encoding schemes used by the document and the browser

[note] For full registered character set content please go here

The Web server sends the MIME character set token to the client by using the CharSet parameter in the Content-type header

content-type:text/html; Charset=iso-2022-jp

If the character set is not explicitly listed, the receiver may want to try to infer the character set from the document content. For HTML content, you can find the character set in the <meta http-equit= "Content-type" > tags that describe charset

The following example shows how the HTML meta tag sets the character set to Japanese encoding iso-2022-jp. If the document is not of the HTML type, or if there is no meta content-type tag in it, the software can try to scan the actual text to see if it can identify common patterns of language and encoding to infer character encoding

In the past few decades, thousands of character codecs have been developed. Most clients cannot support all of these different character encodings and mapping systems

An HTTP client can use the Accept-charset request header to explicitly tell the server what character systems it supports. The Accept-charset header value lists the character encoding schemes supported by the client. For example, the following HTTP request header indicates that the client accepts a Unicode-compliant system with a iso-8859-1 and UTF-8 length of the Western European character system. The server can freely choose one of these two character encoding schemes to return content

Accept-charset:iso-8859-1, Utf-8

[note] There is no content-charset such a response header and Accept-charset request header match. In order to be compatible with the MIME standard, the response character set is brought back by the server via the charset parameter of the Content-type response header. Asymmetry is too bad, but the information needed is all there.

Encoding syntax

Term

Here are 8 terms for an electronic character system that should be understood

1, character

Characters are letters, numbers, punctuation, ideographs (such as Chinese), symbols, or other textual forms of writing "atoms." Pioneered by the unified character set (Universal Character set, UCS, whose unofficial name is Unicode3), a series of standardized text names have been developed for many characters in multiple languages, and they are commonly used to conveniently name characters without conflict with other characters

2. Font style

A stroke pattern that describes a character or a unique graphical shape. If a character is written in many different ways, there are multiple glyphs

3. Characters after encoding

A unique numeric number assigned to the character so that we can manipulate it.

4. Code Space

The integer range that you plan to use for character code values

5. Code width

The number of digits used for each fixed-size character code

6, character repertoires

A specific working character set, equivalent to a subset of all characters

7. Character set after encoding

The encoded character set that makes up the character repertoires (a number of characters from a global character) and assigns each character a code in the code space. In other words, it maps the digitized character code to the actual character

8. Character encoding scheme

An algorithm that encodes a digitized character code into a series of binary codes (and can be decoded accordingly). Character encoding schemes can be used to reduce the amount of data (compression) required to identify characters, to resolve transport restrictions, to unify overlapping coded character sets

"Bad naming."

Technically, the charset tag in mime (used in the charset parameter of Content-type header and Accept-charset header) is not a character set at all. The charset value in MIME is named a complete set of algorithms that map data bits to unique characters. It is a combination of both the character encoding scheme (character encoding scheme) and the coded character set (coded character set)

Because the standard for character encoding schemes and coded character sets has been published, the use of the term is sloppy and confusing. The following is an introduction to http/1.1 's authors on how they use these Terms

The term "character set" in this document refers to a method that converts a series of 8-bit bytes into a series of characters. Note: The term "character set" is often referred to as "character encoding". However, because HTTP and MIME share the same registration information, it is important that the terminology be shared.

The IETF also uses non-standard terminology in RFC 2277:

The term "character set" is used in this document to represent a set of rules that convert a series of 8-bit bytes into a series of characters, such as a combination of coded character sets and character encoding schemes. This is the same as the use of identifiers in the mime "charset=" parameter, and is registered in the IANA character set registry. (Note that this is not the term used in other standard subjects, such as the ISO.)

[note] Even worse, the charset tag in MIME is often selected from the name of a specific coded character set or the name of the encoding scheme. For example, Iso-8859-1 is an encoded character set (it assigns digitized code to a collection containing 256 European characters), but MIME uses the CharSet value iso-8859-1 to represent a 8-bit encoding of the encoded character set. This imprecise term is not a deadly question, but when reading a standard document, you need to keep a clear mind about its assumptions.

So, when reading a standard document, stay awake so that you know exactly what it is defined to be

Characters

Characters are the most basic building blocks for writing. Characters can represent letters, numbers, punctuation, ideographic symbols (such as in Chinese), mathematical symbols, or other basic units of writing

Characters and fonts are independent of style. Several variants of the same character (named Latin SMALL letter A in the UCS) are displayed. Although the pattern and style of their strokes are very different, but native speakers of Western European languages can immediately recognize that these 5 shapes are the same character

In many written systems, the same character will have different stroke shapes depending on the position of one character in the word. For example, 4 strokes in each represent a character Arabic letter AIN

Figure A shows how Ain is written as a single character. Figure D shows the case where Ain was at the beginning of the word. Figure C shows the case of Ain in the middle of the word, and Figure B shows the case where Ain is at the end of the word.

Glyph

Don't confuse characters with glyphs. The character is the only, abstract language "atom". Glyphs are specific ways to use each character when it is drawn. Depending on the art form and technique, each character can have many different glyphs.

Also, do not confuse characters with representations. In order to make calligraphy more beautiful, a lot of handwriting and fonts allow people to beautifully ligatures the adjacent characters, called a pen (ligatures), so that two characters are smoothly connected together. English-speaking authors often combine F and I as fi pens, while Arabic authors often combine the characters "LAM" and "Alif" into a very elegant pen

Here's a general rule: if you use one glyph instead of another, the meaning of the text changes, and the glyphs are different characters. Otherwise, they are representations of different styles of the same character

"Coded Character Set"

According to the definition of RFC 2277 and 2130, the encoded character set maps integers to characters. Coded character sets are often implemented in arrays, indexed by code values. The elements of an array are characters

Let's look at some important coded character set standards, including the historic Us-ascii character set, the ASCII iso-8859 extension, the JIS X 0201 character set in Japanese, and the unified character set (Universal Character set, Unicode )

1, Us-ascii: ancestor of all character sets

ASCII is the most famous post-coded character set, which was standardized in 1968 by ANSI in standard X3.4, "American Standard Information Interchange Code" (American Standards Code for information Interchange). The ASCII code value is only from 0 to 127, so you can overwrite the code space with just 7 binary codes. The recommended name for ASCII is Us-ascu, which can be distinguished from some of the internationalized variants of the 7-bit character set. The character set used for HTTP messages (header, URI, etc.) is Us-ascii

2, iso-8859

The iso-8859 character set standard is a 8-bit superset of US-ASCII, which adds some internationalized written characters using a high-level binary code. Additional space provided by additional binaries (128 more code) is not big enough, not even enough for all European characters to use, let alone Asian characters. As a result, iso-8859 has customized different character sets for different regions, as shown below

Iso-8859-1    Western European languages (for example, English, French) iso-8859-2    Central and Eastern European languages (for example, Czech, Polish) iso-8859-3    Southern European languages iso-8859-4    Nordic languages (for example, Latvia, Lithuania, Greenland}iso-8859-5 Slavic    languages (e.g. Bulgaria, Russia, Serbia) iso-8859-6    Arabic iso-8859-7    Greek iso-8859-8    Hebrew iso-8859-9    Turkish iso-8859-10   and Scandinavian languages (e.g., Iceland, Inuit) iso-8859-15   changes to iso-8859-1, including new euro characters

Iso-8859-1, also known as Latin1, is the default character set for HTML. It can be used to represent text in most Western European languages. Since the new Euro symbol is included in the iso-8859-15, there have been some uses to replace the iso-8859-1 and as a discussion of the default encoding of the HTTP character set. However, since Iso-8859-1 has been widely adopted, it is not possible to change to iso-8859-15 in a very short time.

3, JIS X 0201

JIS X 0201 is a minimal character set that extends ASCII to Japanese half-width katakana characters. Half-width Katakana characters were first used in the Japanese telegraph system. JISX 0201 is often referred to as JIS Roman,jis as an abbreviation for "Japanese Industrial standard" (Japanese Industrial standards)

4, JIS X 0208 and JIS X 0212

The Japanese includes thousands of characters from several written language systems. Although you can barely use the 63 basic Katakana characters in JIS X 0201, the actual use requires a much more complete character set than this one.

The JIS X 0208 character set is the first multi-word festive character set, which defines 6,879 encoded characters, most of which are Japanese characters from Chinese. JIS X 0212 Character set expanded by 6,067 characters

5. UCS

UCS (Universal Character set, unified character set) is a universal standardization achievement that integrates all the characters of the world into a single coded character set. The UCS is defined by ISO 10646. Unicode is a commercially federated organization that adheres to the UCS standard. UCS has a code space that can hold more than million characters, but the base set has only about 50,000 characters

"Character encoding scheme"

The character encoding scheme specifies how to package the code number of a character into a content bit, and how to unpack it back to a character code at the other end.

There are 3 main types of character encoding schemes:

1, Fixed width

Fixed-width encoding uses a fixed number of bits to represent each encoded character. They can be processed quickly, but they may waste space

2, variable width (non-modal)

Variable-width encoding uses a different number of bits for different character code numbers. For characters commonly used characters, this reduces the number of bits required and preserves compatibility with the traditional 8-bit character set while allowing multibyte to represent international characters

3. Variable width (modal)

The modal encoding uses a special "escape" mode to switch between different modes. For example, you can use modal encodings to have multiple, overlapping character sets in text. Modal codes are more complex to handle, but they can effectively support complex writing systems.

Let's look at some common coding schemes

1, 8-bit

The 8-bit fixed-width identity encoding encodes each character code into its corresponding 8-bit binary value. It can only support character sets with a code range of 256 characters. The iso-8859 Character Set family series uses 8-bit identity encoding

2, UTF-8

UTF-8 is a popular character encoding scheme designed for UCS, and UTF represents the UCS transformation Format (UCS transformation format). UTF-8 uses a non-modal, variable-width encoding for character code values, and a high position of the first byte indicates the number of bytes used for the encoded character, and each subsequent byte required contains a 6-bit code value

If the highest bit of the encoded 1th byte is 0, and the length is 1 bytes, the remaining 7 bits contain the code for the character. The wonderful result is that it is compatible with ASCII (but not compatible with the iso-8859 series because the iso-8859 series uses the highest bit)

For example, a character code of zero (ASCII "Z") is encoded as 1 bytes (01011010), while code 5073 (13-bit binary value 1001111010001) is encoded as 3 bytes: 11100001 10001111 10010001

3, ISO-2022-JP

ISO-2022-JP is a widely used encoding in Japanese documents on the Internet. It is wide and modal, with all values not exceeding 128 to avoid compatibility issues with software that does not support 8-bit characters

The encoding context is always set to one of 4 preset character sets and is toggled between character sets using a special escape sequence (escape sequence). The initial state of the ISO-2022-JP uses the US-ASCII character set, which can be switched to the JIS X 0201 (Jis-roman) character set or to a much larger JIS X 0208-1978 and JIS X 0208-1983 character set using an escape sequence of 3 bytes

These escape sequences are listed in the following table. In fact, the Japanese text is Esc  @ & #x6216; E S C " > @ or Esc @ or esc b, starting with ESC (B or ESC ( J End

In Us-ascii or Jis-roman mode, each character uses a single byte. When using a larger JISX 0208 series of character sets, each character code uses 2 bytes. The encoding limits the range of bytes sent to 33~126

4, EUC-JP

EUC-JP is another popular Japanese encoding. EUC stands for "Extended UNIX Code" (extended UNIX codes), which was originally developed to support Asian characters on Unix operating systems.

Similar to ISO-2022-JP, EUC-JP encoding is also variable, allowing the use of several standard Japanese character sets. But unlike ISO-2022-JP, EUC-JP coding is not modal. No escape sequences can be switched between different modes

EUC-JP supports 4 coded character sets: JIS X 0201 (Jis-roman, some Japanese replacements for ASCII), JIS X 0208, half width katakana (63 characters first used in Japanese telegraph system) and JIS X 0212

The Code JIS Roman (which is compatible with ASCII) uses 1 bytes and uses 2 bytes for JIS X 0208 and half width katakana, and 3 bytes for JIS X 0212. It's a bit of a waste of space, but it's easy to handle.

The following table summarizes the pattern of this encoding

Language tags

Language markers are standardized string phrases for naming spoken languages

Names need to be standardized, otherwise, some people will mark the French document French, while others will use Francis, others may use France, and some lazy people may use fra or even F. Standardize the language tag to avoid these confusion

The mark of the English language is en, the mark of the German language is DE, the Korean mark is Ko, and so on. Language tags can describe the regional variants and dialects of the language, such as the Brazilian Portuguese mark is pt-br, American English mark is en-us, the Chinese language of Hunan dialect is marked Zh-xiang. There's even a standard language tag I-klingon is a description of Clingen.

The Content-language header field of the entity describes the target audience language of the entity. If the content is primarily for the French audience, its Content-language header field will contain:

Content-language:fr

The Content-language header is not limited to text documents. Audio clips, movies, and applications are likely to target a specific language audience. Any media type that targets a specific language audience can have a content-language header. In, audio files are marked for Navajo (Navajo) listeners

If your content is for multiple-language audiences, you can list multiple languages. As suggested in the HTTP specification, a translation of the "Treaty of Waitangi" (Treaty of Waitangi), which is written in both English and Maori, can be described as:

Content-language:mi, en

However, it is not considered to be multilingual audiences only in the presence of multiple languages in the entity. Introductory language for beginners, such as "A first Lesson in Latin" (Latin lesson one), is clearly intended for the English audience and should only be described by EN

Most of us know at least one language. HTTP allows us to send language restrictions and preferences to the Web server. If the Web server has a resource version in multiple languages, it will be able to express the content in our most preferred language.

Client requests Spanish content:

Accept-language:es

Multiple language tags can be placed in the Accept-language header to enumerate all supported languages and their precedence (from left to right). The client prefers English, but also accepts Swiss German (standard language mark is de-ch) or other German variant (Mark is de):

Accept-language:en, De-ch, de

Client uses Accept-language header and Accept-charset header to request understandable content

In RFC 3066, the standardized syntax for language tags is recorded in "tags for the identification of Languages" (Identity language tags). Can be expressed in language notation: general language classifications (e.g. ES for Spanish), country-specific languages (such as EN-GB for British English), dialects of languages (e.g. No-bok refers to Norwegian written language), and regional languages (e.g. Sgn-us-ma for American Marth'as Vineyard I.) Standardized non-variant languages (e.g. I-navajo); non-standard language (e.g. X-snowboarder-slang)

A language tag has one or more parts, separated by a hyphen, called a child tag:

The first child tag is called the Master tag, its value is normalized, the second child tag is optional, follows its own naming standard, and the other trailing child tags are unregistered

The master tag can contain only letters (A-Z). Subsequent sub-tags can contain letters and numbers, up to 8 characters in length

An example is given in

All tags are case-insensitive, that is, the tag en and en are equivalent. However, it is customary to use all lowercase to denote a general language, and to use all caps to denote a particular country. For example, FR represents all languages classified as French, while Fr represents the country France

The values for the first and second language sub-tags are defined by a variety of standard documents and related maintenance organizations. IANA manages the list of standard language tags in accordance with the rules outlined in RFC 3066

If the language tag is made up of standard country and language values, the tag does not need to be specifically registered. Only language markers that cannot be made up of standard country and language values need to be specifically registered with the IANA

The first child tag is usually a standardized language notation, selected from the standard set of languages in ISO 639. However, you can also use the letter I to identify the name registered in the IANA, or to use X for private or extended names, following the various rules

If the first child tag contains 2 characters, it is the language code from the ISO 639 and 639-1 standards, and if it contains 3 characters, it is the language code from the ISO 639-223 standard and its extension, and if it is the letter I, the language tag is explicitly registered in the IANA, and if it is the letter x, Indicates that the language tag is private, non-standard, or extended child tags

Some examples are given in the following table

The second sub-mark is usually a standardized national notation, selected from the country code and the regional standard set in ISO 3166. But it can also be another string registered in the IANA, and here are the rules.
If the second child tag contains 2 characters, it is the country defined in ISO 3166, and if it contains 3-8 characters, it may be a value registered in the IANA, and if it is a single character, it is illegal.

Some of the country codes in ISO 3166 are listed in the table below

Except for a maximum of 8 characters (Letters and numbers), there is no special rule for the third and subsequent sub-tags

Internationalized URI

Until today, URIs have not provided enough support for internationalization. In addition to a few (poorly defined) exceptions, URIs are now made up of a subset of the US-ASCII characters. People are trying to make the path to the hostname and URL contain a richer set of characters, but until now, these standards have not been widely accepted and deployed

URI designers want everyone in the world to share URIs by email, phone, bulletin board, or even radio. They also want URIs to be easy to use and remember, but these two goals are conflicting

To make it easy for people around the world to enter, manipulate, and share URIs, designers have chosen a very limited subset of characters commonly used characters for URIs (Letters, numbers, and a few special symbols in the basic Latin alphabet). Most software and keyboards in the world support this small set of

Unfortunately, if the character set is restricted, the URI cannot be easily used and memorized by people around the world. A large part of the world does not even know the Latin alphabet, and they can hardly remember the URI as an abstract pattern.

The designers of URIs feel that it is more important to make sure that the copied ability (transcribability) and sharing ability of resource identifiers is better than making them composed of the most meaningful characters. As a result, today's URIs are basically made up of restricted subsets of ASCII characters.

A subset of the US-ASCII characters allowed in a URI, which can be divided into reserved, unreserved, and escaped characters. The unreserved characters can be used for any part of the URI that allows it to appear. Reserved characters have special meanings in many URIs, so they are not generally used

All unreserved, reserved, and escaped characters are listed in the following table

Escape

Uri escaping provides a safe way to insert reserved characters inside URIs and characters that are not originally supported (such as various whitespace). Each escape is a set of 3 character sequences followed by a percent (%) of two hexadecimal digits. These two hexadecimal numbers represent a us-ascii character code.

For example, to insert a blank (ASCII 32) in the URL, you can escape% 20, because 20 is the hexadecimal representation of 32. Similarly, if you want to insert a percent sign and don't want it to be escaped, you can enter the hexadecimal value of the ASCII code that%25,25 is the percent sign

Shows how a conceptual character in a URI is converted to the code byte of a character in the current character set. When a URI needs to be processed, the escape is reversed, generating the bytes of the ASCII code that they represent

During internal processing, the HTTP application should remain escaped while the URI is being transmitted and forwarded. HTTP applications should only escape URIs when they need data. More importantly, the application should ensure that any URIs are not reversed 2 times, because they may encode the percent sign when escaping, and then once again, it will result in data loss.

It should be noted that the value to be escaped should itself be within the range of the Us-ascii code value (0~127). Some applications attempt to use escaped values to represent extended characters in Iso-8859-l (code range 128-
255). For example, a Web site server may incorrectly encode file names that contain international characters by escaping. Doing so is wrong and may cause problems for other applications

For example, the file name Sven olssen.html (contains a vowel tone) may be encoded as sven%20%d6lssen.html by the Web server. It is right to encode the space as%20, but technically it is illegal to encode O as%d6 because the code D6 (decimal value 214) falls outside the ASCII code range. ASCII only defines codes with a maximum value of 0x7f (decimal value 127)

"Modal Switching"

Some URIs also use sequences of ASCII characters to represent characters in other character sets. For example, you might use ISO-2022-JP encoding to insert "ESC (J", switch to the Jis-roman character set, and "ESC (B") to switch back to the ASCII character set. This can work in some localized environments, but this approach is not well defined and there is no standardized scheme to identify the specific encoding used by the URL. As the author of RFC 2396 says:

For primitive sequences of characters that contain non-ASCII characters, the situation is complex. If it is possible to work with multiple character sets, the Internet protocol that transmits a 8-byte sequence of character sequences expects to have a way to identify the character set used [RFC 2277]

However, there is no means of making this recognition available in the common URI syntax. Individual URI schemes can request a single character set, define a default character set, or provide a way to indicate the character set used. Expect future changes to this specification to provide a systematic approach to character encoding in URIs

Currently, URIs are not very friendly to internationalized applications. The portability target of a URI is more important than the goal of language flexibility. People are doing their best to make URIs more internationalized, but in the short term, HTTP applications should stick with ASCII. It's been there since 1968, so it's not too bad to use it.

Precautions

The HTTP header must consist of characters in the Us-ascii character set. However, not all clients and servers have implemented this correctly and may occasionally receive some illegal characters with code values greater than 127

Many HTTP applications use the operating system and library routines to handle characters (such as character Library CType in Unix), but not all of these libraries support character codes outside of the ASCII range (0-127)

In some cases (generally, older implementations), these libraries may return incorrect results or crash the application when entering non-ASCII characters. Assuming that the message contains illegal data, read the document carefully before using these character libraries to process HTTP messages

The HTTP specification explicitly defines a valid GMT date format, but you know that not all Web servers and clients adhere to these rules. For example, we have seen that the month in the header of the invalid HTTP date (date) sent by the Web server is expressed in the local language

HTTP applications should try to tolerate some irregular dates and not crash at the time of reception. But not all the sent date can be interpreted correctly, if the date can not be resolved, the server should be handled with caution

DNS currently does not support the use of internationalized characters in domain names. Standardization of multi-lingual domain names is now in progress, but has not yet been extensively deployed

The character set of the Frontend learning HTTP

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.