This is a creation in Article, where the information may have evolved or changed.
Before writing crawlers are used in the Python language, recently found that the go language is also very convenient to write, the following brief introduction.
This is not to say that the crawler is not a lot of resources in the network to continue to crawl, but just grasp through the means of the program are certain pages to achieve specific information capture. Can be easily divided into two parts: crawl Web pages, to parse the page.
Crawl Web pages. The general is to send an HTTP Get/post request to the server, get response. The HTTP package provided by GO can be implemented very well.
Get method:
1
|
RESP, err: = http. Get ("http://www.legendtkl.com")
|
Post method:
1 2
|
RESP, err: = http. Post ("http://example.com/upload", "image/jpg", &buf) RESP, err: = http. Postform ("http://example.com/form", url.) values{"key": {"Value"}, "id": {"123"}})
|
Of course, if you want to set up the HTTP client more specifically, you can build a client.
1 2 3
|
Client: = &http. client{ Checkredirect:redirectpolicyfunc, }
|
The client structure is as follows:
type Client struct {
Transport roundtripper //transport specifies the mechanism by which HTTP request is made
Checkredirect func (req *request, via []*request) error //if nil, use default, namely consecutive REQUEST
jar Cookiejar //specify the cookie Jar
Span class= "line" > Timeout time.
Duration //request timeout
}
1 2 3 4 5 6 | TD class= "Code" >
The HTTP header is the key-value that we need to submit when we post, as defined below
1
|
type Map [string] []string
|
Provides operations such as add, Del, set, and so on.
HTTP request, we directly in front with get to the server address requests, for better processing, you can use Do () to send HTTP request.
1 2 3 4 5 6 7
|
func (c *client) Do (req *request) (resp *response, error) Client: = &http. client{ ... } Req, Err: = http. Newrequest ("GET", "http://example.com", nil) ' w/' Wyzzy "') RESP, err: = client. Do (req)
|
The data structure within the request is the content of HTTP, which is not covered here.
These just get the content of the Web page, more importantly, the content of the page parsing. We're going to use goquery, like jquery, to parse the HTML content very easily. Let's take a look at an example of how to crawl an embarrassing encyclopedia.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21st 22 23 24
|
Package Main
Import ( "FMT" "Github.com/puerkitobio/goquery" "Log" )
func Examplescrape () { doc, err: = Goquery. NewDocument ("http://www.qiushibaike.com") if err! = Nil { log. Fatal (ERR) } Doc. Find (". Article"). Each (func (i int, S *goquery. Selection) { if s.find (". Thumb"). Nodes = Nil && S.find (". Video_holder"). Nodes = = Nil { content: = S.find (". Content"). Text () FMT. Printf ("%s", content) } }) }
Func Main () { Examplescrape () }
|
The program works as follows.
The core of Goquery is the Find function, the prototype is as follows
1
|
func string) *selection
|
The returned selection data structure is as follows
1 2 3
|
type struct { Nodes []*html. Node }
|
The HTML package is Golang.org/x/net/html,node structure as follows
1 2 3 4 5 6 7 8
|
type struct { Parent, FirstChild, LastChild, prevsibling, NextSibling *node Type NodeType Dataatom Atom. Atom String String Attr []attribule }
|
The
is used to parse the HTML page, and the second if in the code above is used to come up with videos and pictures on the wiki.
Goquery content is very rich, it is convenient to parse, we can query through Godoc.