When doing dealspider, you must know the charset of the page, and then convert it into a UTF-8, and finally use the regular expression of glib for matching and search. Curl itself does not provide such a function. Previously, we saw the following in the curl_easy_setopt man Manual: curlopt_conv_to_network, curlopt_conv_from_network, so that the two can be automatically transcoded. Later, we found that they cannot. These two are only used for non-ASCII platforms.
What is a non-ASCII platform? In short, not all computer systems use ASCII codes, such as the mainframe machine of IBM. On such machines, because ASCII code is not used, and text protocols such as HTTP and FTP must use ASCII code, there is a conversion problem. These two options are for this purpose, not charset conversion.
For this question, refer to the answer provided by foreigners:
> I wanna convert all HTTP responses to UTF-8 because, you know, not all
> Web pages are written in UTF-8. I skimmed the manual of "curl_easy_setopt ",
> Seems "curlopt_conv_to_network_function ",
> "Curlopt_conv_from_network_function" do helps.
Not really. The purpose of that functionality is for platforms that do not
Speak ASCII natively to provide a way to make the protocols we use that are
ASCII-based to still work fine.
For non-ASCII Platform issues, this Wiki makes it clear: http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchange_Code
If you want to obtain charset, you can obtain it in the HTTP response header. In this case, you only need to curl_easy_setopt, overload the headerfunction, and then there is such content: "Content-Type: text/html; charset = UTF-8 ".
However, in some cases, the charset in the header is not necessarily the character set in HTML. The HTML contains a meta tag and can also define charset. However, this is not necessarily accurate. In addition, in order to analyze the content in HTML, We Need To precognition charset. Therefore, it is also suggested that a charset function should be judged based on byte, which is the most scientific. As Joel said, charset does not have a fixed rule. Therefore, charset detection is mostly based on some experience and is not 100% accurate. Therefore, there is no perfect solution to this problem. charset detection is just the best method. This is why IE and Firefox sometimes display garbled characters on the page-that is, the charset automatic detection is wrong.