Curl programming: What is the non-ASCII platform and how to obtain the charset of the page

Source: Internet
Author: User
When doing dealspider, you must know the charset of the page, and then convert it into a UTF-8, and finally use the regular expression of glib for matching and search. Curl itself does not provide such a function. Previously, we saw the following in the curl_easy_setopt man Manual: curlopt_conv_to_network, curlopt_conv_from_network, so that the two can be automatically transcoded. Later, we found that they cannot. These two are only used for non-ASCII platforms.
What is a non-ASCII platform? In short, not all computer systems use ASCII codes, such as the mainframe machine of IBM. On such machines, because ASCII code is not used, and text protocols such as HTTP and FTP must use ASCII code, there is a conversion problem. These two options are for this purpose, not charset conversion.
For this question, refer to the answer provided by foreigners:

> I wanna convert all HTTP responses to UTF-8 because, you know, not all
> Web pages are written in UTF-8. I skimmed the manual of "curl_easy_setopt ",
> Seems "curlopt_conv_to_network_function ",
> "Curlopt_conv_from_network_function" do helps.

Not really. The purpose of that functionality is for platforms that do not
Speak ASCII natively to provide a way to make the protocols we use that are
ASCII-based to still work fine.

For non-ASCII Platform issues, this Wiki makes it clear: http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchange_Code

If you want to obtain charset, you can obtain it in the HTTP response header. In this case, you only need to curl_easy_setopt, overload the headerfunction, and then there is such content: "Content-Type: text/html; charset = UTF-8 ".

However, in some cases, the charset in the header is not necessarily the character set in HTML. The HTML contains a meta tag and can also define charset. However, this is not necessarily accurate. In addition, in order to analyze the content in HTML, We Need To precognition charset. Therefore, it is also suggested that a charset function should be judged based on byte, which is the most scientific. As Joel said, charset does not have a fixed rule. Therefore, charset detection is mostly based on some experience and is not 100% accurate. Therefore, there is no perfect solution to this problem. charset detection is just the best method. This is why IE and Firefox sometimes display garbled characters on the page-that is, the charset automatic detection is wrong.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.