First of all, if you look at the following code is somewhat confusing, it is recommended to read the following msdn on the platform call of the specified character set article: http://msdn.microsoft.com/zh-cn/library/7b93s42f.aspx
Learning. net (C
The use of HTTP header accept-charset?
Accept-charset accept-encoding accept-language in HTTP headers do they work much? Feel the Content-type function is very big.
A program that encounters a person with an easy-to-write page that returns HTML code.
Spring mvc3.1 @ ResponseBody annotation generates a large number of Accept-Charset
Spring 3 MVC uses @ ResponseBody and generates a large response header (the Accept-Charset will reach 4 K +), because StringHttpMessageConverter by default.
Differences between contenttype, charset, and pageencoding
========================================================= ==================
The contenttype attribute specifies the HTTP content type of the response. If contenttype is not specified, the
HTTP Header
Accept-charset represents the character set supported by the browser
Example: Accept-charset: ISO-8859-1, UTF-8; q = 0.7, *; q = 0.3
This browser prefers the following character sets
1) favorite ISO-8859-1
2) then UTF-8
3) The
The original question is as follows:
Http://topic.csdn.net/u/20080902/02/a6445aa1-2e6b-45c6-a47c-79009718c0fa.html
The contents of an HTML Web page are roughly as follows:
CSDN首页 ... .....
I use the following statement to crawl a page
Introducedjava中使用Charset来表示编码对象 this class defines methods for Creating decoders and Encoders and for retrieving The various names associated with a charset. Instances of this class is immutable. This class also defines static Methods
When doing dealspider, you must know the charset of the page, and then convert it into a UTF-8, and finally use the regular expression of glib for matching and search. Curl itself does not provide such a function. Previously, we saw the following in
Use java. nio. charset. CharsetDecoder to automatically recognize character sets, charsetdecoder
The methods for automatically recognizing character sets that can be found on the Internet are studied. The effective method is to use the third-party
Imread)
Http: // 127.0.0.1/bom.html
Set header to: Content-Type: text/html; charset = UTF-8
Page content:
Specifically, bom.html is encoded as unicode. That is, the BOM on this page is ff fe.
Use IE Chrome Opera Firefox to access this page.
We can
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.