Go Language crawler

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Before writing crawlers are used in the Python language, recently found that the go language is also very convenient to write, the following brief introduction.
This is not to say that the crawler is not a lot of resources in the network to continue to crawl, but just grasp through the means of the program are certain pages to achieve specific information capture. Can be easily divided into two parts: crawl Web pages, to parse the page.
Crawl Web pages. The general is to send an HTTP Get/post request to the server, get response. The HTTP package provided by GO can be implemented very well.
Get method:

1
RESP, err: = http. Get ("http://www.legendtkl.com")

Post method:

1
2
RESP, err: = http. Post ("http://example.com/upload", "image/jpg", &buf)
RESP, err: = http. Postform ("http://example.com/form", url.) values{"key": {"Value"}, "id": {"123"}})

Of course, if you want to set up the HTTP client more specifically, you can build a client.

1
2
3
Client: = &http. client{
Checkredirect:redirectpolicyfunc,
}

The client structure is as follows:

 type  Client struct  { 
Transport roundtripper //transport specifies the mechanism by which HTTP request is made
Checkredirect func (req *request, via []*request) error //if nil, use default, namely consecutive REQUEST
jar Cookiejar //specify the cookie Jar
Span class= "line" > Timeout time. Duration //request timeout
}
TD class= "Code" >
 1  
2
3
4
5
6

The HTTP header is the key-value that we need to submit when we post, as defined below

1
type Map [string] []string

Provides operations such as add, Del, set, and so on.

HTTP request, we directly in front with get to the server address requests, for better processing, you can use Do () to send HTTP request.

1
2
3
4
5
6
7
func (c *client) Do (req *request) (resp *response, error)
Client: = &http. client{
...
}
Req, Err: = http. Newrequest ("GET", "http://example.com", nil)
' w/' Wyzzy "')
RESP, err: = client. Do (req)

The data structure within the request is the content of HTTP, which is not covered here.

These just get the content of the Web page, more importantly, the content of the page parsing. We're going to use goquery, like jquery, to parse the HTML content very easily. Let's take a look at an example of how to crawl an embarrassing encyclopedia.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
Package Main

Import (
"FMT"
"Github.com/puerkitobio/goquery"
"Log"
)

func Examplescrape () {
doc, err: = Goquery. NewDocument ("http://www.qiushibaike.com")
if err! = Nil {
log. Fatal (ERR)
}
Doc. Find (". Article"). Each (func (i int, S *goquery. Selection) {
if s.find (". Thumb"). Nodes = Nil && S.find (". Video_holder"). Nodes = = Nil {
content: = S.find (". Content"). Text ()
FMT. Printf ("%s", content)
}
})
}

Func Main () {
Examplescrape ()
}

The program works as follows.

The core of Goquery is the Find function, the prototype is as follows

1
func string) *selection

The returned selection data structure is as follows

1
2
3
type struct {
Nodes []*html. Node
}

The HTML package is Golang.org/x/net/html,node structure as follows

1
2
3
4
5
6
7
8
type struct {
Parent, FirstChild, LastChild, prevsibling, NextSibling *node
Type NodeType
Dataatom Atom. Atom
String
String
Attr []attribule
}

The

is used to parse the HTML page, and the second if in the code above is used to come up with videos and pictures on the wiki.
Goquery content is very rich, it is convenient to parse, we can query through Godoc.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.