[Open source. NET Cross-platform Data acquisition crawler framework: Dotnetspider] [II] The most basic, the most free way to use

Last Update:2016-05-25 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article introduced the crawler's frame design from the beginning, highlighting how to use the crawler.

Data Extraction Definition

It was also suggested that using attribute+ model to define extraction rules was too fancy and not very practical. In fact, he may not have seen my design carefully, my core extraction is not a attrbiute+ model, but rather a JSON-like definition format that allows for a variety of nested, complex scenarios imaginable. Refer to the earliest version of the definition (the latest version has been modified, the design ideas have not changed)

"Entities": [        {            "Targeturls": [                {                    "Sourceregion": "",                    "Values": [                        "Http://shu.taobao.com/top/[\\d]+/search"                    ]                }            ],            "Expression": "//div[contains (@class, ' mod ')]",            "Multi":true,            "Selector": "XPath",            "Schema": {                "DatabaseName": "Alibaba",                "TableName": "Industry_rank"            },            "Identity": "Industry_rank",            "Fields": [                {                    "DataType": "String (100)",                    "Expression": "./h3[1]",                    "Name": "Category",                    "Selector": "XPath",                    "Multi":false                },                {                    "DataType": {                        "Fields": [                            {                                "DataType": "String (100)",                                "Expression": ".//a[1]",                                "Name": "Keyword",                                "Selector": "XPath",                                "Multi":false                            },                            {                                "DataType": "String (100)",                                "Expression": "./@data-rise",                                "Name": "Rise",                                "Selector": "XPath",                                "Multi":false                            }                        ]                    },                    "Expression": ".//ol/li",                    "Multi":true,                    "Name": "Results",                    "Selector": "XPath"                }            ]        }    ]

Entities is an array that indicates that a page can extract multiple data objects
The first field of the first entity is the data type of string (100) (Common data type)
The second field of the first entity is a data object

Therefore, the analysis of reptiles is very liberalized. The extraction of the attrbiute+ model is first converted to the above definition and then passed to the analytic class, I design this analytic class reason is also considering the cross-language possibility, as long as you can pass the correct JSON, I can parse into a correct crawler. So as long as interested people write their own language provider, in fact, is to write a few classes serialized into JSON to pass over it.

Is it flexible enough?

It is also suggested that the attrbiute+ model is not flexible enough to meet most of the situation. In fact, the most flexible is the use of the core library, namely core DLL, in this project, the implementation of the basic crawler logic, URL scheduling, de-weight, HTML selector, basic downloader, multi-line program control and so on. That means you have to be free and flexible.

How to use the core library

As we said in the previous article, implementing a complete business crawler requires 4 large modules: A downloader (already implemented), a URL scheduler (already implemented), a data pump (which needs to be implemented by itself), a data store (which needs to be implemented by itself), so you can complete a crawler with only 2 modules to implement.

Defining Data Objects

     Public class Youkuvideo    {        publicstringgetset;}          Public string Get Set ; }    }

Implementation of data extraction

You just have to implement the Ipageprocessor interface.

     Public classMypageprocessor:ipageprocessor { PublicSite Site {Get;Set; }         Public voidProcess (Page page) {vartotalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']")).            Nodes (); List<YoukuVideo> results =NewList<youkuvideo>(); foreach(varVideoelementinchtotalvideoelements) {                varVideo =NewYoukuvideo (); Video. Name= Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]")).                GetValue (); Video. Volume= Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]")).                GetValue (); Video. Volume= video. Volume.replace ("\ r",""); Results.            ADD (video); } page. Addresultitem ("Videoresult", results); }    }

There's 4 points to note here.

Public site site {get; set;} is required and does not require market capitalization and will be set at initialization of the Spider class
When the Page object is passed, it is already loading the downloaded HTML into the selectable attribute, so you just need to call Seletable's interface and pass in the appropriate query Xpath,css, Jsonpath,regex can query the value you think of, and selectable is a loop that can be called
Selectable's GetValue passes the result to HTML tags when passed to True
Put the objects you have assembled here, such as the Youkuvideo list above, into the page's Resultitem, and specify a key

Data access

Just to implement IPipeline, where we need to use the key to deposit data in pageprocessor, through this key to take out the data object, and then you want to save this data file or Mysql,mssql,mongodb by yourself to achieve

     Public classMypipeline:ipipeline {Private string_path;  PublicMypipeline (stringpath) {            if(string. IsNullOrEmpty (Path)) {Throw NewException ("XXXX"); } _path=path; if(!file.exists (_path))            {file.create (_path); }        }         Public voidProcess (Resultitems resultitems, Ispider spider) {foreach(Youkuvideo entryinchresultitems.results["Videoresult"]) {File.appendalltext (_path, Jsonconvert.serializeobject (entry)); }        }         Public voidDispose () {}}

Run crawler

After we have written the two required modules above, we can run the crawler. The first thing you need to do is download the latest Dotnetspider code and compile it and move the DLL to the output folder under the solution folder after the compilation is successful. We create an empty console program that references the necessary DLLs such as

Add the above 3 classes to project

Run the code as follows, you need to note that you have to add StartURL, this is the first start URL of the crawler, if you can initially calculate all the paging URL, you can also be fully initialized here.

         Public Static voidMain () {Httpclientdownloader Downloader=NewHttpclientdownloader (); varsite =NewSite () {encodingname ="UTF-8" }; Site. Addstarturl ("http://www.youku.com/v_olist/c_97_g__a__sg__mt__lg__q__s_1_r_0_u_0_pt_0_av_0_ag_0_sg__pr__h__d_1_p_1.html "); Spider Spider= Spider.create (Site,NewMypageprocessor (),NewQueueduplicateremovedscheduler ()). Addpipeline (NewMypipeline ("Test.json")). Setthreadnum (1); Spider.        Run (); }

F5 run the project with the following results

How to page and extract target pages

The above example crawls only one page, so how do you extract the page-flipping URLs or other target pages from the pages? We only need to parse your target page in pageproccessor and add it to the list of the Page object's targetrequests. We make the following changes:

     Public classMypageprocessor:ipageprocessor { PublicSite Site {Get;Set; }         Public voidProcess (Page page) {vartotalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']")).            Nodes (); List<YoukuVideo> results =NewList<youkuvideo>(); foreach(varVideoelementinchtotalvideoelements) {                varVideo =NewYoukuvideo (); Video. Name= Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]")).                GetValue (); Video. Volume= Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]")).                GetValue (); Video. Volume= video. Volume.replace ("\ r",""); Results.            ADD (video); } page. Addresultitem ("Videoresult", results); foreach(varwr.inchPage. Selectable.selectlist (Selectors.xpath ("//ul[@class = ' yk-pages ')")). Links (). Nodes ()) {page. Addtargetrequest (NewRequest (URL. GetValue (),0,NULL)); }        }    }

Re-run the program, we can see that the crawler constantly began to page down to crawl down the

Summarize

The most basic and flexible way to use this is over. Isn't it simple? As for a friend mentioned to use a lot of if else, replace what you can do in the Pageprocessor this assembly, extraction of work. Then I personally feel that there is no perfect thing, flexible may need more code, and attrbibute+ model of the inflexible is not useless, at least I use down 70%-80% can cope, not to mention on attribute can also configure a variety of formatter, Of course, it is related to the structure of most of the objects I crawl. Let's get a little bit of the chapter behind it.

HTTP Header, cookie settings, post usage
Parsing of JSON data
Configuration-based usage (extension project)
Webdriverdownloader use, including basic login, manual Login
Distributed use

[Open source. NET Cross-platform Data acquisition crawler framework: Dotnetspider] [II] The most basic, the most free way to use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More