This is a creation in Article, where the information may have evolved or changed.
Golang web crawler Frame gocolly/colly Four
Crawler by acting, the more like the performance of the browser, crawling data is easier, this is my years of experience in the crawler. Review of the personal reptile experience, a total of three stages: the first stage, 09 years or so began to contact the crawler, at that time due to project needs, to visit the major international social networking sites, facebook,myspace,filcker,youtube and so on, the international name of the social network has climbed, Most websites provide restful APIs, some features do not have APIs, you can only use the HTTP Capture tool to analyze the protocol, their own crawl, the domestic Youku, potatoes, xiaonei, web version of QQ, Web mail and so on have also climbed; then write the demo in C #, the project is C + +, So it's going to be converted into managed C + + code. The first stage of the main experience is the cookie management, more difficult to make a cookie with the help of WebBrowser control.
The second stage is about 13 years, do is financial data analysis software and network robot, crawler programming language still with the help of C #, the package of HttpWebRequest and Httpwebresponse,cookie management rely on Cookiecontainer, HTML analysis by Htmlagilitypack, verification code identification by their own pre-packaged tesseract, protocol analysis by Fiddler, element selection by browser debugger, this set of Kung fu in the hands of the basic can swim network, the realization of the robot roam freely in the blog, Weibo, automatic message, Posts, comments, major financial websites, SSE, SZSE, Giant Tide network, interactive platform and so on data crawling.
The third stage is now. More than four years later, re-learning to look at the crawler technology, found that the weapon is more powerful: go language, Goquery,colly,chromedp,webloop, and other powerful language and tools to make the crawler simpler and more efficient.
Years of reptile experience summed up the opening sentence. There are three types of known crawlers: one, parsing the HTTP protocol, constructing the request, and using the browser control to get the cookie, the page element, the call JS script, etc. phantomjs, webloop belong to this class; The third class is the direct manipulation of the browser, CHROMEDP belongs to this class Microsoft also provides the control of Internet Explorer COM interface, long ago written in C + +, more difficult to use, the code is very disgusting, need more conditions to judge. Constructs the request directly quickly, the browser control and the manipulation browser reliable security, may eliminate many unnecessary protocol analysis, the JS script analysis, but the speed is slow, loads many useless data, the picture and so on; the second to third is better than the first mix, as long as the performance is more secure than the browser, or simply manipulate the browser, IP will not be blocked as long as the server's human threshold is determined. If the single IP is insufficient, set the proxy to switch.
Learning, and constantly arming themselves with new weapons. Below to contribute a small example, crawl the stock list of the AB shares of SSE, simply show down acting. (hahaha)
This page provides download function, a-Share download address http://query.sse.com.cn/security/stock/downloadStockListFile.do?csrcCode=&stockCode= &areaname=&stocktype=1
B-Share Download address http://query.sse.com.cn/security/stock/downloadStockListFile.do?csrcCode=&stockCode=&areaName= &stocktype=2
You get this address, you start visit.
C.visit ("Http://query.sse.com.cn/security/stock/downloadStockListFile.do?csrcCode=&stockCode=&areaName =&stocktype=1 ")
UserAgent is set to Chrome.
C.useragent = "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 "
Found no, the program will error,
2018/£ Forbidden
It is not possible to open this URL directly in the browser address bar, which will report "Error 403:srve0190e: File not found:/error/error_cn.jsp"
The server has made some restrictions, open the Fiddler and look at the protocol.
There are a lot of cookies in the request, the first feeling is probably without cookies, so the use of CHROMEDP to open the page, and then call Ajax to request, just start Ajax without a cookie also requested success,
Later found the key is the request in the head of "Referer", with Referer on the line.
Simply complete all the heads, more like the browser, this will not suffer:
C.onrequest (func (R *colly). Request) {r.headers.set ("Host", "query.sse.com.cn") R.headers.set ("Connection", "Keep-alive") R.headers.set ("Accept "," */* ") R.headers.set (" Origin "," http://www.sse.com.cn ") R.headers.set (" Referer "," http://www.sse.com.cn/ Assortment/stock/list/share/")//Key Header if not, the error R is returned. Headers.set ("accept-encoding", "gzip, deflate") R.headers.set ("Accept-language", "zh-cn,zh;q=0.9")})
Enclose the full code to save the stock to a CSV file
Package Sseimport ("Encoding/csv" "OS" "Strings" "github.com/gocolly/colly")/*getstocklista get the Shanghai Stock Exchange stock List A-share */func Getstocklista (SaveFile String) (err error) {stocks, err: = Getstocklist ("http://query.sse.com.cn/security/stock/ Downloadstocklistfile.do?csrccode=&stockcode=&areaname=&stocktype=1 ") if err! = Nil {return Err}err = Savestocklist2csv (stocks, saveFile) RETURN}/*GETSTOCKLISTB get the Shanghai Stock Exchange stock List B-Share */func Getstocklistb (SaveFile string) (err Error) {stocks, err: = Getstocklist ("http://query.sse.com.cn/security/stock/downloadStockListFile.do?csrcCode= &stockcode=&areaname=&stocktype=2 ") if err! = Nil {return err}err = savestocklist2csv (stocks, SaveFile) Return}func savestocklist2csv (stocklist string, file string) (Err error) {vals: = strings. Split (stocklist, "\ n") F, err: = OS. Create (file) if err! = Nil {return Err}defer f.close () FW: = CSV. Newwriter (f) For _, Row: = Range Vals {rsplits: = strings. Split (row, "\ T") Rsplitsrslt: = Make ([]string, 0) for _, SP: = range Rsplits {TriMSp: = Strings. Trim (sp, "") If Len (Trimsp) > 0 {rsplitsrslt = append (rsplitsrslt, Trimsp)}}if len (rsplitsrslt) > 0 {err = fw. Write (RSPLITSRSLT) if err! = Nil {return ERR}}}FW. Flush () return}func getstocklist (URL string) (Stocklist string, err error) {//get http://query.sse.com.cn/security/ Stock/downloadstocklistfile.do?csrccode=&stockcode=&areaname=&stocktype=1 HTTP/1.1//Host: Query.sse.com.cn//connection:keep-alive//accept: */*//origin:http://www.sse.com.cn//user-agent:mozilla/5.0 ( Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36//referer:http://www.sse.com.cn/ Assortment/stock/list/share///accept-encoding:gzip, deflate//accept-language:zh-cn,zh;q=0.9 ' c: = colly. Newcollector () c.useragent = "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 "C.onrequest (func (R *colly). Request) {r.headers.set ("Host", "query.sse.com.cn") R.headers.set ("Connection", "keep-alive")R.headers.set ("Accept", "*/*") R.headers.set ("Origin", "http://www.sse.com.cn") R.headers.set ("Referer", "http.// Www.sse.com.cn/assortment/stock/list/share/")//Key Header if not, the error R is returned. Headers.set ("accept-encoding", "gzip, deflate") R.headers.set ("Accept-language", "zh-cn,zh;q=0.9")}) C.onresponse ( Func (Resp *colly. Response) {stocklist = string (resp. Body)}) C.onerror (func (Resp *colly. Response, errhttp error) {err = errhttp}) Err = c.visit (URL) return}
Func main () {var err errorerr = SSE. Getstocklista ("E:\\ssea.csv") if err! = Nil {log. Fatal (err)}err = SSE. GETSTOCKLISTB ("E:\\sseb.csv") if err! = Nil {log. Fatal (ERR)}}
Reprint Please specify source: http://www.cnblogs.com/majianguo/p/8186429.html