Construction of web crawler (i.)

Source: Internet
Author: User

Long time no blog, this period has been busy to become a dog, half a year so did not have to do a summary otherwise white busy. Next there may be a series of summaries, all about the directional crawler (after several months to know the noun) construction method, the realization platform is node. js.

background

The General crawler logic is this, given an initial link, the link to download the page to save, and then analyze the link in the page, to find out if the target link check has been requested, if not requested to put in the request queue, the page after the download is completed to the indexer index, In this way, a set of document libraries are provided for use by search engines. I didn't need that at the time, but I grabbed the data from a few sites and exported the specified fields as structured files that would eventually be analyzed in Excel . The latter may need only the full volume of the site's product data, others do not need, so meet do not need to save.

I'll keep track of issues and ideas in chronological order.

    1. Coding
    2. redirect
    3. Concurrent

Problem

Coding encountered the first problem, but also where all the problems encountered, to completely solve the coding problem, you need to clarify the three questions:

    1. Why garbled?
    2. How do I find the correct encoding for content?
    3. How do I transcode?

Why is garbled, of course, because the current string uses the encoding is not "correct", so-called "correct" is the string in the memory of what encoding to use to store. For a chestnut, the "You" word of the UTF8 code [1] When the memory value:

This time using ASCII to decode the correct text is not available, because the largest ASCII also know 7e[2]. The reason Nodejs print non-UTF8 string garbled is because it uses UTF8 by default, so you must find the correct encoding of the source to convert to UTF8.

So how do I find the correct code for the content? Look at the HTTP protocol! In the HTTP header there is a content-type field[3], which can specify the Media-type supported by the protocol, including pictures, text, audio, etc., we only look at the text, and is the HTML in the text, usually followed by the character set encoding. As an example:

content-type:text/html; Charset=iso-8859-4

You can see the text type and character set specified in HTML, and when we receive the return, we only need to read the Content-type block in the return header, which solves most of the problems. But not yet, some websites don't tell you what character set is in the HTTP header, for example:

Content-type:text/html

Baidu.com returned is such a head, the browser how to know how to parse the page, because the browser is open and not garbled, a closer look at the returned HTML file will find a META tag in the document to refer to specifying [4]:

When the HTTP-EQUIV attribute are specified on a meta-element, the element is a pragma directive. You can use this element to simulate a HTTP response header, but only if the server doesn ' t send the corresponding real h Eader; You can ' t override the HTTP header with a meta http-equiv element.

In this case, the priority in the protocol header is higher, so the correct parsing order is to first find the encoding from the protocol header, and find no http-equiv tags to look for meta from the document. (There are also two articles about UTF-8 and Unicode detailing ([5],[6]).)

Once you have found the correct code, you can find the class library to do the transcoding. For example, from GBK encoding to UTF-8, you only need to use their encoding correspondence to do the corresponding substitution. In Nodejs with Iconv and Iconv-lite can do a good job, Iconv is a C + + implementation of the local library, installation will be some difficulties, iconv-lite is a pure JavaScript implementation of the transcoding, installation is relatively simple, The disadvantage is that there may be no full code and no official version.

Thank you for the reminder of the end of the situation, add a little more. Encoding can be obtained through detection, but it is not the right thing to do in the web, which is often forced to use without knowing the encoding of the source, while the HTTP protocol and the HTML document specify the character encoding! The detection probability model has the error, Nodejs has a library called the Jschardet, the module is used in the Node-webcrawler, the practice tells me the probability of error is very big. Thanks again, welcome to other ideas.

[1]. http://zh.wikipedia.org/wiki/UTF-8

[2]. Http://zh.wikipedia.org/wiki/ASCII

[3]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

[4]. Http://www.w3.org/wiki/HTML/Elements/meta

[5]. A probe into the file encoding format

[6]. Unicode detailed

Construction of web crawler (i.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.