Weekend boring, have some fun ...
#coding:utf-8import requestsfrom bs4 import beautifulsoupimport randomimport time# Crawl Required Content user_agent = ["mozilla/5.0 (windows nt 10.0; wow64)", ' Mozilla/ 5.0 (windows nt 6.3; wow64) ', ' mozilla/5.0 (windows nt 6.1) AppleWebKit/537.11 (khtml, like gecko) chrome/23.0.1271.64 safari/537.11 ', ' mozilla/5.0 (windows nt 6.3; wow64; trident/7.0; rv:11.0) like gecko ', ' mozilla/5.0 (windows nt 5.1) AppleWebKit/537.36 (khtml, like gecko) chrome/28.0.1500.95 safari/537.36 ', ' mozilla/5.0 (windows nt 6.1; wow64; trident/7.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; . net4.0c; rv:11.0) like gecko) ', ' mozilla/5.0 (windows; u; windows nt 5.2) gecko/ 2008070208 firefox/3.0.1 ', ' mozilla/5.0 (windows; u; windows nt 5.1) gecko/20070309 firefox/ 2.0.0.3 ', ' Mozilla/5.0 (windows; u; windows nt 5.1) gecko/20070803 firefox/1.5.0.12 ', ' opera/9.27 (windows nt 5.2; U; ZH-CN) ', ' mozilla/5.0 (macintosh; ppc mac os x; u; en) Opera 8.0 ', ' opera/8.0 (Macintosh; ppc mac os x; u; en) ', ' mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.8.1.12) gecko/20080219 firefox/2.0.0.12 navigator/9.0.0.6 ', ' mozilla/4.0 (compatible; msie 8.0; windows nt 6.1; win64; x64; trident/4.0) ', ' mozilla/4.0 (compatible; msie 8.0; Windows nt 6.1; trident/4.0) ', ' Mozilla/5.0 (compatible; msie 10.0; windows nt 6.1; wow64; trident/6.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729 ; media center pc 6.0; infopath.2; . net4.0c; . net4.0e) ', ' Mozilla/5.0 (WINDOWS NT 6.1; WOW64) AppleWebKit/537.1 (Khtml, like gecko) maxthon/4.0.6.2000 chrome/26.0.1410.43 safari/537.1 ', ' mozilla/5.0 (compatible; msie 10.0; windows nt 6.1; wow64; trident/6.0; slcc2; .net clr 2.0.50727; . net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; infopath.2; . net4.0c; . net4.0e; qqbrowser/7.3.9825.400) ', ' mozilla/5.0 (windows nt 6.1; wow64; rv:21.0) gecko/20100101 Firefox/21.0 ', ' mozilla/5.0 (windows nt 6.1; wow64) AppleWebKit/537.1 (khtml, like Gecko) chrome/21.0.1180.92 safari/537.1 lbbrowser ', ' mozilla/5.0 (compatible; msie 10.0; windows nt 6.1; wow64; trident/6.0; bidubrowser 2.x) ', ' mozilla/5.0 (WINDOWS NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, LIKE GECKO) chrome/20.0.1132.11 taobrowser/3.0 safari/536.11 ']moduledic={' ranklist_a ': 111, ' Ranklist_b ' : 4,}for module in moduledic: for page in range (1, Moduledic[module]): url= ' http://quote.stockstar.com/stock/' +str (module) + ' _3_1_ ' +str (page) + '. html ' try: global response response=requests.post (url, headers={"User-agent": Random.choice (User_agent)}) #定制请求头 except : print "Continue" response.encoding = ' gb2312 ' html = response.text soup = beautifulsoup (HTML, ' lxml ') time.sleep (random.randrange) #每抓一页随机休眠几秒, values can be changed according to the actual situation Datalist=[] for i in soup.find_all (' tr '): for j in i.find_all (' TD '): datalist.append ( j.string) try: data = datalist[0] + " " + datalist[1] + " + datalist[2] + " " + datalist[3] +" " + datalist[4" + " " + datalist[5] + " + datalist[6] + " " + datalist[7]+ " " + datalist[8] + " " + datalist[9] + " " + datalist[10] + " " + datalist[11] print data except: continue datalist=[]
Part:
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/9D/D5/wKioL1mHIZTCzPNoAAGZqOmLUQM367.png-wh_500x0-wm_ 3-wmp_4-s_2482180387.png "title=" Securities star. png "alt=" wkiol1mhiztczpnoaagzqomluqm367.png-wh_50 "/>
Originally want to save in the database, the latter used for data analysis, suddenly not interested in the first.
Just want to say: Most of the site anti-crawler strategy basically did not do, if I want to, may also be a day or two can be the whole site to climb down, the above also took half an hour. The data is not money? Is it the equivalent of an indirect de-library to climb down completely?
This article is from the "Shangwei Super" blog, please make sure to keep this source http://9399369.blog.51cto.com/9389369/1954076
Python crawler crawls the Securities Star website