[Open source. NET Cross-platform Data acquisition crawler framework: Dotnetspider] [II] The most basic, the most free way to use

Source: Internet
Author: User
Tags xpath

The previous article introduced the crawler's frame design from the beginning, highlighting how to use the crawler.

Data Extraction Definition

It was also suggested that using attribute+ model to define extraction rules was too fancy and not very practical. In fact, he may not have seen my design carefully, my core extraction is not a attrbiute+ model, but rather a JSON-like definition format that allows for a variety of nested, complex scenarios imaginable. Refer to the earliest version of the definition (the latest version has been modified, the design ideas have not changed)

"Entities": [        {            "Targeturls": [                {                    "Sourceregion": "",                    "Values": [                        "Http://shu.taobao.com/top/[\\d]+/search"                    ]                }            ],            "Expression": "//div[contains (@class, ' mod ')]",            "Multi":true,            "Selector": "XPath",            "Schema": {                "DatabaseName": "Alibaba",                "TableName": "Industry_rank"            },            "Identity": "Industry_rank",            "Fields": [                {                    "DataType": "String (100)",                    "Expression": "./h3[1]",                    "Name": "Category",                    "Selector": "XPath",                    "Multi":false                },                {                    "DataType": {                        "Fields": [                            {                                "DataType": "String (100)",                                "Expression": ".//a[1]",                                "Name": "Keyword",                                "Selector": "XPath",                                "Multi":false                            },                            {                                "DataType": "String (100)",                                "Expression": "./@data-rise",                                "Name": "Rise",                                "Selector": "XPath",                                "Multi":false                            }                        ]                    },                    "Expression": ".//ol/li",                    "Multi":true,                    "Name": "Results",                    "Selector": "XPath"                }            ]        }    ]
    1. Entities is an array that indicates that a page can extract multiple data objects
    2. The first field of the first entity is the data type of string (100) (Common data type)
    3. The second field of the first entity is a data object

Therefore, the analysis of reptiles is very liberalized. The extraction of the attrbiute+ model is first converted to the above definition and then passed to the analytic class, I design this analytic class reason is also considering the cross-language possibility, as long as you can pass the correct JSON, I can parse into a correct crawler. So as long as interested people write their own language provider, in fact, is to write a few classes serialized into JSON to pass over it.

Is it flexible enough?

It is also suggested that the attrbiute+ model is not flexible enough to meet most of the situation. In fact, the most flexible is the use of the core library, namely core DLL, in this project, the implementation of the basic crawler logic, URL scheduling, de-weight, HTML selector, basic downloader, multi-line program control and so on. That means you have to be free and flexible.

How to use the core library

As we said in the previous article, implementing a complete business crawler requires 4 large modules: A downloader (already implemented), a URL scheduler (already implemented), a data pump (which needs to be implemented by itself), a data store (which needs to be implemented by itself), so you can complete a crawler with only 2 modules to implement.

Defining Data Objects
     Public class Youkuvideo    {        publicstringgetset;}          Public string Get Set ; }    }

Implementation of data extraction

You just have to implement the Ipageprocessor interface.

     Public classMypageprocessor:ipageprocessor { PublicSite Site {Get;Set; }         Public voidProcess (Page page) {vartotalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']")).            Nodes (); List<YoukuVideo> results =NewList<youkuvideo>(); foreach(varVideoelementinchtotalvideoelements) {                varVideo =NewYoukuvideo (); Video. Name= Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]")).                GetValue (); Video. Volume= Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]")).                GetValue (); Video. Volume= video. Volume.replace ("\ r",""); Results.            ADD (video); } page. Addresultitem ("Videoresult", results); }    }

There's 4 points to note here.

    1. Public site site {get; set;} is required and does not require market capitalization and will be set at initialization of the Spider class
    2. When the Page object is passed, it is already loading the downloaded HTML into the selectable attribute, so you just need to call Seletable's interface and pass in the appropriate query Xpath,css, Jsonpath,regex can query the value you think of, and selectable is a loop that can be called
    3. Selectable's GetValue passes the result to HTML tags when passed to True
    4. Put the objects you have assembled here, such as the Youkuvideo list above, into the page's Resultitem, and specify a key
Data access

Just to implement IPipeline, where we need to use the key to deposit data in pageprocessor, through this key to take out the data object, and then you want to save this data file or Mysql,mssql,mongodb by yourself to achieve

     Public classMypipeline:ipipeline {Private string_path;  PublicMypipeline (stringpath) {            if(string. IsNullOrEmpty (Path)) {Throw NewException ("XXXX"); } _path=path; if(!file.exists (_path))            {file.create (_path); }        }         Public voidProcess (Resultitems resultitems, Ispider spider) {foreach(Youkuvideo entryinchresultitems.results["Videoresult"]) {File.appendalltext (_path, Jsonconvert.serializeobject (entry)); }        }         Public voidDispose () {}}

Run crawler

After we have written the two required modules above, we can run the crawler. The first thing you need to do is download the latest Dotnetspider code and compile it and move the DLL to the output folder under the solution folder after the compilation is successful. We create an empty console program that references the necessary DLLs such as

Add the above 3 classes to project

Run the code as follows, you need to note that you have to add StartURL, this is the first start URL of the crawler, if you can initially calculate all the paging URL, you can also be fully initialized here.

         Public Static voidMain () {Httpclientdownloader Downloader=NewHttpclientdownloader (); varsite =NewSite () {encodingname ="UTF-8" }; Site. Addstarturl ("http://www.youku.com/v_olist/c_97_g__a__sg__mt__lg__q__s_1_r_0_u_0_pt_0_av_0_ag_0_sg__pr__h__d_1_p_1.html "); Spider Spider= Spider.create (Site,NewMypageprocessor (),NewQueueduplicateremovedscheduler ()). Addpipeline (NewMypipeline ("Test.json")). Setthreadnum (1); Spider.        Run (); }

F5 run the project with the following results

How to page and extract target pages

The above example crawls only one page, so how do you extract the page-flipping URLs or other target pages from the pages? We only need to parse your target page in pageproccessor and add it to the list of the Page object's targetrequests. We make the following changes:

     Public classMypageprocessor:ipageprocessor { PublicSite Site {Get;Set; }         Public voidProcess (Page page) {vartotalvideoelements = page. Selectable.selectlist (Selectors.xpath ("//li[@class = ' yk-col4 mr1 ']")).            Nodes (); List<YoukuVideo> results =NewList<youkuvideo>(); foreach(varVideoelementinchtotalvideoelements) {                varVideo =NewYoukuvideo (); Video. Name= Videoelement.select (Selectors.xpath (".//li[@class = ' title ']/a[1]")).                GetValue (); Video. Volume= Videoelement.select (Selectors.xpath (".//ul[@class = ' info-list ']/li[3]")).                GetValue (); Video. Volume= video. Volume.replace ("\ r",""); Results.            ADD (video); } page. Addresultitem ("Videoresult", results); foreach(varwr.inchPage. Selectable.selectlist (Selectors.xpath ("//ul[@class = ' yk-pages ')")). Links (). Nodes ()) {page. Addtargetrequest (NewRequest (URL. GetValue (),0,NULL)); }        }    }

Re-run the program, we can see that the crawler constantly began to page down to crawl down the

Summarize

The most basic and flexible way to use this is over. Isn't it simple? As for a friend mentioned to use a lot of if else, replace what you can do in the Pageprocessor this assembly, extraction of work. Then I personally feel that there is no perfect thing, flexible may need more code, and attrbibute+ model of the inflexible is not useless, at least I use down 70%-80% can cope, not to mention on attribute can also configure a variety of formatter, Of course, it is related to the structure of most of the objects I crawl. Let's get a little bit of the chapter behind it.

    • HTTP Header, cookie settings, post usage
    • Parsing of JSON data
    • Configuration-based usage (extension project)
    • Webdriverdownloader use, including basic login, manual Login
    • Distributed use

[Open source. NET Cross-platform Data acquisition crawler framework: Dotnetspider] [II] The most basic, the most free way to use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.