Author: fbysss
QQ: Wine bar Bar I scattered
Blog:blog.csdn.net/fbysss
Statement: This article by fbysss Original, reprint please indicate the source
Preface
Crawling a Web page is a time-consuming and tedious task. Because the Web page format is different, it is difficult to rely entirely on machine automatic recognition.
In general, we can use the CSS selector to select the DOM node and extract what we need from the entire page.
The front end of the most familiar should be jquery. If jquery is bad, you can use the native Document.queryselectorall directly, and most browsers now support it.
In the case of a Nodejs crawler, the DOM is typically parsed using the Cheerio module (which can be understood as the back-end jquery).
Although Cheerio is highly imitated jquery, there are some differences, and some features have not yet been implemented. Try to update to the latest version.
Instead of listing all the expressions, focus on some DOM selections and related working methods.
The front-end examples are also expressions that are supported by Cheerio, and the test environment is chrome.
An expression
1. Simple Expression
Document.queryselectorall ("div");//Label
Document.queryselectorall (". ClassA");//Single Class
Document.queryselectorall ("#idA");//id Selector
2. Hierarchical Relationship
Document.queryselectorall ("div span");
Document.queryselectorall ("div span. ClassA");
3. Multiple Classes
<div class= "ClassA classb" >
Document.queryselectorall (". Classa.classb");
Document.queryselectorall ("class=[' ClassA classb ')")//in brackets if there are spaces, you must quote
4. Multiple selection (or relationship)
<div class= "ClassA" >
<div class= "CLASSB" >
Document.queryselectorall (". Classa,.classb");
<div id= "#article" >
<div class= "description" >
"#article,. Description" No, Cheerio can only use ". Description, #article"
5. Next element
Document.queryselectorall. (' H1+figure ') Select the figure element after H1
6. There are special symbols
<div class= "Tw:classa" >
Document.queryselectorall ("Div[class=tw:classa]");
7. Wildcard characters
Document.queryselectorall ("input[id^= ' Code ')"), all input tags that the//id property starts with code
document.queryselectorall ("Input[id $= ' Code ']; The//id property has all input tags
document.queryselectorall ("input[id*= ' Code") ending with code;//id property contains all input labels for code
8. Not equal to
Document.queryselectorall ("Div:not ([id^=\" blog\ "])")/Not equal to
Cheerio Example:
var cheerio=require ("Cheerio");
var str= ' <div class=redcolor>fbysss</div> <div class=bluecolor></div> <div class= yellowcolor></div> <div class=normal></div> ';
var $=cheerio.load (str);
var len = $ ("Div:not ([class$=\" color\ "])"). Length;
Console.log ("Len is:", Len);
9. Ignore Case
and with the example above
var len = $ ("Div:not ([class$=\" color\ "])"). Length;
var len = $ ("Div:not ([class$=\" color\ "I])"). Length; --\ is an escape character, and when multiple nested, single quotation marks are exhausted, you need
Reference:
Http://stackoverflow.com/questions/5671238/css-selector-case-insensitive-for-attributes
10. Select text Node
$ (elem)
. Contents ().
Filter (function () {return
This.nodetype = = 3;//node.text_node
});
11. Select Annotation Node
var cheerio = require (' Cheerio ');
var str = " <!--comment node text--> <span><!--Comment In span--><ul><!--comment in ul--></ul></span> <!--comment node 2--> ';
var $ = cheerio.load (str);
$.root (). Find (' * "). Contents (). filter (function () {return This.nodetype = 8;}). length;//Note that here is the 2
var str2 = "<div id= ' domroot ' >" +str+ "</div>";
$ = cheerio.load (STR2);
$.root (). Find (' * "). Contents (). filter (function () {return This.nodetype = 8;}). Length;//4. If you want to remove all annotation nodes, you need to add a root node before the string
$.root (). Find (*) can also be used directly with $ (*)
problems displayed in 12.chrome:
In the latest version of Chrome, the console display has changed. Select a node, if you use Document.queryselectorall, to return a node, you need to use [0] to line, otherwise it is to return a large node, very intuitive. For example: Document.queryselectorall ("div") [0]
For multiple nodes, this is the only way to write:
var cates = Document.queryselectorall ('. Article-title-link ');
[].foreach.call (Cates,function (Cate) {Console.log (CATE.HREF)}) or [].foreach.call (Cates,function (Cate) { Console.log (Cate)})
And a better way to do it:
Document.queryselectorall ('. Article-title-link '). ForEach (function (Cate) {Console.log (cate)})
other problems encountered to resolve:
1.chrome suddenly couldn't get anything out. Turns out to be an option problem. At the top, tick all, the original choice of error so do not show.
2.JQuery: The above example, the front end is to Document.queryselectorall, the back end to Cheerio examples.
If you want to use jquery on the front end, you need to be aware that some Web sites can, and others, that are related to the jquery version that the browser supports, and the jquery version of the different Web sites referenced.
If you do not have a choice, you can first test in console: $ (). jquery can check the current jquery version
Try running the following code:
var JQ = document.createelement (' script ');
JQ.SRC = "Http://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js";
document.getElementsByTagName (' head ') [0].appendchild (JQ);
Jquery.noconflict ();
For HTTPS Web pages, HTTP should be modified to https otherwise it may be an error: Vm273:4 Mixed content:the page at ' https://countrycode.org/' is loaded over HTTPS, but Requested an insecure script ' http://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js '. This request has been blocked; The content must is served over HTTPS.
Conclusion
The above are I in the crawler configuration process, summed up the experience, I believe that the web crawler to engage in the work of students, will have a very good help.