Crawler Configuration Prerequisites: jquery|queryselector| Cheerio DOM Node Select dry Set

Source: Internet
Author: User

Author: fbysss

QQ: Wine bar Bar I scattered

Blog:blog.csdn.net/fbysss

Statement: This article by fbysss Original, reprint please indicate the source


Preface

Crawling a Web page is a time-consuming and tedious task. Because the Web page format is different, it is difficult to rely entirely on machine automatic recognition.

In general, we can use the CSS selector to select the DOM node and extract what we need from the entire page.

The front end of the most familiar should be jquery. If jquery is bad, you can use the native Document.queryselectorall directly, and most browsers now support it.

In the case of a Nodejs crawler, the DOM is typically parsed using the Cheerio module (which can be understood as the back-end jquery).

Although Cheerio is highly imitated jquery, there are some differences, and some features have not yet been implemented. Try to update to the latest version.

Instead of listing all the expressions, focus on some DOM selections and related working methods.

The front-end examples are also expressions that are supported by Cheerio, and the test environment is chrome.

An expression

1. Simple Expression

Document.queryselectorall ("div");//Label
Document.queryselectorall (". ClassA");//Single Class
Document.queryselectorall ("#idA");//id Selector

2. Hierarchical Relationship

Document.queryselectorall ("div span");
Document.queryselectorall ("div span. ClassA");

3. Multiple Classes

<div class= "ClassA classb" >

Document.queryselectorall (". Classa.classb");
Document.queryselectorall ("class=[' ClassA classb ')")//in brackets if there are spaces, you must quote

4. Multiple selection (or relationship)

<div class= "ClassA" >
<div class= "CLASSB" >

Document.queryselectorall (". Classa,.classb");

<div id= "#article" >
<div class= "description" >

"#article,. Description" No, Cheerio can only use ". Description, #article"

5. Next element

Document.queryselectorall. (' H1+figure ')  Select the figure element after H1

6. There are special symbols

<div class= "Tw:classa" >
Document.queryselectorall ("Div[class=tw:classa]");

7. Wildcard characters

Document.queryselectorall ("input[id^= ' Code ')"), all input tags that the//id property starts with code
document.queryselectorall ("Input[id $= ' Code ']; The//id property has all input tags
document.queryselectorall ("input[id*= ' Code") ending with code;//id property contains all input labels for code


8. Not equal to

Document.queryselectorall ("Div:not ([id^=\" blog\ "])")/Not equal to

Cheerio Example:

var cheerio=require ("Cheerio");
var str= ' <div class=redcolor>fbysss</div> <div class=bluecolor></div> <div class= yellowcolor></div> <div class=normal></div> ';
var $=cheerio.load (str);
var len = $ ("Div:not ([class$=\" color\ "])"). Length;
Console.log ("Len is:", Len);
9. Ignore Case

and with the example above

var len = $ ("Div:not ([class$=\" color\ "])"). Length;

var len = $ ("Div:not ([class$=\" color\ "I])"). Length; --\ is an escape character, and when multiple nested, single quotation marks are exhausted, you need

Reference:

Http://stackoverflow.com/questions/5671238/css-selector-case-insensitive-for-attributes

10. Select text Node

$ (elem)
  . Contents ().
  Filter (function () {return
    This.nodetype = = 3;//node.text_node
  });

11. Select Annotation Node

var cheerio = require (' Cheerio ');
var str = " <!--comment node text--> <span><!--Comment In span--><ul><!--comment in ul--></ul></span>  <!--comment node 2-->  ';
var $ = cheerio.load (str);
$.root (). Find (' * "). Contents (). filter (function () {return  This.nodetype = 8;}). length;//Note that here is the 2
var str2 = "<div id= ' domroot ' >" +str+ "</div>";
$ = cheerio.load (STR2);
$.root (). Find (' * "). Contents (). filter (function () {return  This.nodetype = 8;}). Length;//4. If you want to remove all annotation nodes, you need to add a root node before the string
$.root (). Find (*) can also be used directly with $ (*)



problems displayed in 12.chrome:

In the latest version of Chrome, the console display has changed. Select a node, if you use Document.queryselectorall, to return a node, you need to use [0] to line, otherwise it is to return a large node, very intuitive. For example: Document.queryselectorall ("div") [0]


For multiple nodes, this is the only way to write:

var cates = Document.queryselectorall ('. Article-title-link ');
[].foreach.call (Cates,function (Cate) {Console.log (CATE.HREF)}) or [].foreach.call (Cates,function (Cate) { Console.log (Cate)})


And a better way to do it:
Document.queryselectorall ('. Article-title-link '). ForEach (function (Cate) {Console.log (cate)})

other problems encountered to resolve:

1.chrome suddenly couldn't get anything out. Turns out to be an option problem. At the top, tick all, the original choice of error so do not show.

2.JQuery: The above example, the front end is to Document.queryselectorall, the back end to Cheerio examples.

If you want to use jquery on the front end, you need to be aware that some Web sites can, and others, that are related to the jquery version that the browser supports, and the jquery version of the different Web sites referenced.

If you do not have a choice, you can first test in console: $ (). jquery can check the current jquery version

Try running the following code:

var JQ = document.createelement (' script ');
JQ.SRC = "Http://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js";
document.getElementsByTagName (' head ') [0].appendchild (JQ);
Jquery.noconflict ();

For HTTPS Web pages, HTTP should be modified to https otherwise it may be an error: Vm273:4 Mixed content:the page at ' https://countrycode.org/' is loaded over HTTPS, but Requested an insecure script ' http://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js '. This request has been blocked; The content must is served over HTTPS.

Conclusion

The above are I in the crawler configuration process, summed up the experience, I believe that the web crawler to engage in the work of students, will have a very good help.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.