Golang Web crawler Framework gocolly/colly A

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Golang web crawler framework gocolly/colly a

Gocolly go github 3400+ star, ranked go version of the crawler program top. gocolly Fast and elegant, on a single core can be initiated every second Span style= "Font-family:calibri" >1k above request; A set of interfaces in the form of a callback function that can implement any type of crawler; dependency goquery library can be like jquery same select element.

Gocolly 's official website is http://go-colly.org/, which provides detailed documentation and sample code. Install colly:

Go get-u github.com/gocolly/colly/...

  

To import a package in your code:

Import "Github.com/gocolly/colly"

the main body of the colly is the Collector Object, which manages network traffic and is responsible for performing additional rollback functions while the job is running. To use colly , you first initialize the Collector:

c: = colly. Newcollector ()

You can attach a variety of different types of fallback functions to colly to control the collection job or get information. Add back function:

C.onrequest (func (R *colly). Request) {    fmt. Println ("visiting", R.url)}) C.onerror (func (_ *colly. Response, err Error) {    log. Println ("Something went wrong:", Err)}) C.onresponse (func (R *colly). Response) {    fmt. Println ("visited", R.url)}) c.onhtml ("A[href]", func (E *colly. HtmlElement) {    e.request.visit (e.attr ("href")}) c.onhtml ("tr td:nth-of-type (1)", Func (E *colly. HtmlElement) {    fmt. Println ("first column of a table row:", E.text)}) c.onscraped (func (R *colly). Response) {    fmt. Println ("Finished", R.url)})

  

The return function is called in the following order:

1. ONrequest

Called before the request is initiated

2. OnError

Called when an error occurs during the request

3. Onresponse

Called when a reply is received

4. onhtml

in the Onresponse is called after, if the received content is HTML

5. onscraped

in the Onhtml is called after

officially provided by Basic Sample code:

Package main import (    "FMT"     "github.com/gocolly/colly") func main () {    //instantiate default collector    c: = colly. Newcollector ()     //Visit only domains:hackerspaces.org, wiki.hackerspaces.org    c.alloweddomains = []string{] Hackerspaces.org "," wiki.hackerspaces.org "}     //On every a element which have href attribute call callback    c.onhtml ("A[href]", func (E *colly. HtmlElement) {        Link: = e.attr ("href")        //Print link        fmt. Printf ("Link found:%q-%s\n", E.text, link)        //Visit Link found on page        //Only those links is visited WH Ich is in Alloweddomains        c.visit (E.request.absoluteurl (link))    })     //Before making a Request print " Visiting ... "    C.onrequest (func (R *colly). Request) {        fmt. Println ("Visiting", R.url. String ())    })     //Start scraping on https://hackerspaces.org    c.visit ("https://hackerspaces.org/")}

  

The instance program accesses only links within the hackerspaces.org domain,onhtml The selector for the function back to A[href], and selects the page with an href property of a type element to continue crawling after the link is found. Some of the results of the run are as follows:

PS e:\mygo\src\github.com\gocolly\colly\_examples\basic>. \basic.exevisiting https://hackerspaces.org/Link found:"Navigation"#column-Onelink found:"Search"-#searchInputLink found:""/File:Cbase07.jpgVisiting https://hackerspaces.org/file:cbase07.jpgLink found:"Navigation"#column-Onelink found:"Search"-#searchInputLink found:"File"#fileLink found:"File History"-#filehistoryLink found:"File Usage"-#filelinksLink found:""/images/e/ec/cbase07.jpgvisiting https://hackerspaces.org/images/e/ec/cbase07.jpgLink found:"800x600 pixels"/images/thumb/e/ec/cbase07.jpg/800px-cbase07.jpgvisiting https://hackerspaces.org/images/thumb/e/ec/cbase07.jpg/800px-cbase07.jpg

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.