[Open source. NET Cross-platform Data acquisition crawler framework: Dotnetspider] [II] The most basic, the most free way to use
The previous article introduced the crawler's frame design from the beginning, highlighting how to use the crawler.
Data extraction Definition
It was also suggested that using attribute+ model to define extraction rules was too fancy and not very practical. In fact, he may not have seen my design carefully, my core extraction is not a attrbiute+ model, but rather a JSON-like definition format that allows for a variety of nested, complex scenarios imaginable. Refer to the earliest version of the definition (the latest version has been modified, the design ideas have not changed)
"Entities": [{"Targeturls": [{"Sourceregion": "", "Values": ["Http://shu.taobao.com/top/[\\d]+/search"]} ], "expression": "//div[contains (@class, ' mod ')]", "multi": true, "selector": "XPat H "," schema ": {" databasename ":" Alibaba "," tablename ":" Industry_rank " }, "Identity": "Industry_rank", "fields": [{"datatype": "Strin G (+) "," Expression ":"./h3[1] "," Name ":" category "," selector ": "XPath", "Multi": false}, {"datatype": { "Fields": [{"DataType": "String (100)", "ExPression ":".//a[1] "," name ":" keyword "," selector ":" XPath " , "Multi": false}, { "DataType": "string", "expression": "./@data-rise", "Name": "Rise", "selector": "XPath", "multi": F Alse}]}, "expression": ".//ol/l I "," multi ": True," name ":" Results "," selector ":" XPath " } ] } ]
- Entities is an array that indicates that a page can extract multiple data objects
- The first field of the first entity is the data type of string (100) (Common data type)
- The second field of the first entity is a data object
Therefore, the analysis of reptiles is very liberalized. The extraction of the attrbiute+ model is first converted to the above definition and then passed to the analytic class, I design this analytic class reason is also considering the cross-language possibility, as long as you can pass the correct JSON, I can parse into a correct crawler. So as long as interested people write their own language provider, in fact, is to write a few classes serialized into JSON to pass over it.
Is it flexible enough?
It is also suggested that the attrbiute+ model is not flexible enough to meet most of the situation. In fact, the most flexible is the use of the core library, namely core DLL, in this project, the implementation of the basic crawler logic, URL scheduling, de-weight, HTML selector, basic downloader, multi-line program control and so on. That means you have to be free and flexible.
How to use the core library
As we said in the previous article, implementing a complete business crawler requires 4 large modules: A downloader (already implemented), a URL scheduler (already implemented), a data pump (which needs to be implemented by itself), a data store (which needs to be implemented by itself), so you can complete a crawler with only 2 modules to implement.
Defining Data Objects
public class Youkuvideo {public string Name {get; set;} public string Volume {get; set;} }
Implementation of data extraction
You just have to implement the Ipageprocessor interface.
public class Mypageprocessor:ipageprocessor {public site site { get; set; } public void Process (Page page) { var totalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']"). Nodes (); list<youkuvideo> results = new list<youkuvideo> (); foreach (Var videoelement in totalvideoelements) { var video = new Youkuvideo (); Video. Name = Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]"). GetValue (); Video. Volume = Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]"). GetValue (); Video. Volume = video. Volume.replace ("\ R", ""); Results. ADD (video); } Page. Addresultitem ("Videoresult", results); } }
There's 4 points to note here.
- Public site site {get; set;} is required and does not require market capitalization and will be set at initialization of the Spider class
- When the Page object is passed, it is already loading the downloaded HTML into the selectable attribute, so you just need to call Seletable's interface and pass in the appropriate query Xpath,css, Jsonpath,regex can query the value you think of, and selectable is a loop that can be called
- Selectable's GetValue passes the result to HTML tags when passed to True
- Put the objects you have assembled here, such as the Youkuvideo list above, into the page's Resultitem, and specify a key
Data access
Just to implement IPipeline, where we need to use the key to deposit data in pageprocessor, through this key to take out the data object, and then you want to save this data file or Mysql,mssql,mongodb by yourself to achieve
public class Mypipeline:ipipeline { private string _path; Public Mypipeline (string path) { if (string. IsNullOrEmpty (path)) { throw new Exception ("XXXX"); _path = path; if (! File.exists (_path)) { file.create (_path); } } public void Process (Resultitems resultitems, Ispider spider) { foreach (Youkuvideo entry in resultitems.results["Videoresult"]) { file.appendalltext (_path, Jsonconvert.serializeobject (entry)); } } public void Dispose () { } }
Run crawler
After we have written the two required modules above, we can run the crawler. The first thing you need to do is download the latest Dotnetspider code and compile it and move the DLL to the output folder under the solution folder after the compilation is successful. We create an empty console program that references the necessary DLLs such as
Add the above 3 classes to project
Run the code as follows, you need to note that you have to add StartURL, this is the first start URL of the crawler, if you can initially calculate all the paging URL, you can also be fully initialized here.
public static void Main () { Httpclientdownloader downloader = new Httpclientdownloader (); var site = new site () {encodingname = "UTF-8"}; Site. Addstarturl ("http://www.youku.com/v_olist/c_97_g__a__sg__mt__lg__q__s_1_r_0_u_0_pt_0_av_0_ag_0_sg__pr__h__d_1_ P_1.html "); Spider spider = Spider.create (site, new Mypageprocessor (), New Queueduplicateremovedscheduler ()). Addpipeline (New Mypipeline ("Test.json")). Setthreadnum (1); Spider. Run (); }
F5 run the project with the following results
How to page and extract target pages
The above example crawls only one page, so how do you extract the page-flipping URLs or other target pages from the pages? We only need to parse your target page in pageproccessor and add it to the list of the Page object's targetrequests. We make the following changes:
public class Mypageprocessor:ipageprocessor {public site site {get; set; The public void Process (Page page) {var totalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']"). Nodes (); list<youkuvideo> results = new list<youkuvideo> (); foreach (Var videoelement in totalvideoelements) {var video = new Youkuvideo (); Video. Name = Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]"). GetValue (); Video. Volume = Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]"). GetValue (); Video. Volume = video. Volume.replace ("\ R", ""); Results. ADD (video); } page. Addresultitem ("Videoresult", results); foreach (var url in page. Selectable.selectlist (Selectors.xpath ("//ul[@class = ' yk-pages ')"). Links (). Nodes ()) { Page. Addtargetrequest (the new Request URL. GetValue (), 0, null)); } } }
Re-run the program, we can see that the crawler constantly began to page down to crawl down the
Summarize
The most basic and flexible way to use this is over. Isn't it simple? As for a friend mentioned to use a lot of if else, replace what you can do in the Pageprocessor this assembly, extraction of work. Then I personally feel that there is no perfect thing, flexible may need more code, and attrbibute+ model of the inflexible is not useless, at least I use down 70%-80% can cope, not to mention on attribute can also configure a variety of formatter, Of course, it is related to the structure of most of the objects I crawl. Let's get a little bit of the chapter behind it.
- HTTP Header, cookie settings, post usage
- Parsing of JSON data
- Configuration-based usage (extension project)
- Webdriverdownloader use, including basic login, manual Login
- Distributed use
The frame of a reptile