Python crawler crawls the Securities Star website

Source: Internet
Author: User

Weekend boring, have some fun ...

#coding:utf-8import requestsfrom bs4 import beautifulsoupimport randomimport  time# Crawl Required Content user_agent = ["mozilla/5.0  (windows nt 10.0; wow64)",  ' Mozilla/ 5.0  (windows nt 6.3; wow64) ',                ' mozilla/5.0  (windows nt 6.1)  AppleWebKit/537.11  (khtml,  like gecko)  chrome/23.0.1271.64 safari/537.11 ',                ' mozilla/5.0  (windows nt 6.3; wow64;  trident/7.0; rv:11.0)  like gecko ',                ' mozilla/5.0  (windows nt 5.1)  AppleWebKit/537.36  (khtml,  like gecko)  chrome/28.0.1500.95 safari/537.36 ',                ' mozilla/5.0  (windows nt 6.1; wow64; trident/7.0; slcc2; .net  clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media  center pc 6.0; . net4.0c; rv:11.0)  like gecko) ',                ' mozilla/5.0  (windows; u; windows nt 5.2)  gecko/ 2008070208 firefox/3.0.1 ',                ' mozilla/5.0  (windows; u; windows nt 5.1)  gecko/20070309 firefox/ 2.0.0.3 ',               ' Mozilla/5.0   (windows; u; windows nt 5.1)  gecko/20070803 firefox/1.5.0.12 ',                ' opera/9.27  (windows nt  5.2; U; ZH-CN) ',               ' mozilla/5.0  (macintosh; ppc mac os x; u; en)  Opera 8.0 ',                ' opera/8.0  (Macintosh;  ppc mac os x; u; en) ',                ' mozilla/5.0  (windows; u; windows nt 5.1; en-us;  rv:1.8.1.12)  gecko/20080219 firefox/2.0.0.12 navigator/9.0.0.6 ',                ' mozilla/4.0  (compatible; msie 8.0;  windows nt 6.1; win64; x64; trident/4.0) ',                ' mozilla/4.0  (compatible; msie 8.0;  Windows nt 6.1; trident/4.0) ',               ' Mozilla/5.0   (compatible; msie 10.0; windows nt 6.1; wow64; trident/6.0;  slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729 ;  media center pc 6.0; infopath.2; . net4.0c; . net4.0e) ',               ' Mozilla/5.0   (WINDOWS NT 6.1; WOW64)  AppleWebKit/537.1  (Khtml, like gecko)   maxthon/4.0.6.2000 chrome/26.0.1410.43 safari/537.1  ',                ' mozilla/5.0  (compatible; msie 10.0;  windows nt 6.1; wow64; trident/6.0; slcc2; .net clr 2.0.50727; . net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; infopath.2; . net4.0c; . net4.0e; qqbrowser/7.3.9825.400) ',                ' mozilla/5.0  (windows nt 6.1; wow64; rv:21.0)  gecko/20100101  Firefox/21.0  ',               ' mozilla/5.0  (windows nt 6.1; wow64)  AppleWebKit/537.1  (khtml, like  Gecko)  chrome/21.0.1180.92 safari/537.1 lbbrowser ',                ' mozilla/5.0  (compatible; msie 10.0; windows  nt 6.1; wow64; trident/6.0; bidubrowser 2.x) ',                ' mozilla/5.0  (WINDOWS NT 6.1; WOW64)  AppleWebKit/536.11  (KHTML, LIKE GECKO)  chrome/20.0.1132.11 taobrowser/3.0 safari/536.11 ']moduledic={' ranklist_a ': 111, ' Ranklist_b ' : 4,}for module in moduledic:    for page in range (1, Moduledic[module]):         url= ' http://quote.stockstar.com/stock/' +str (module) + ' _3_1_ ' +str (page) + '. html '         try:             global response                    response=requests.post (url,  headers={"User-agent": Random.choice (User_agent)})   #定制请求头          except :            print  "Continue"          response.encoding =  ' gb2312 '          html = response.text         soup = beautifulsoup (HTML,   ' lxml ')         time.sleep (random.randrange)           #每抓一页随机休眠几秒, values can be changed according to the actual situation          Datalist=[]        for i in  soup.find_all (' tr '):             for j in i.find_all (' TD '):                datalist.append ( j.string)             try:                 data = datalist[0] +   "     "  + datalist[1] +      " +  datalist[2] +  "     " + datalist[3] +"      " + datalist[4"  +  "    "  + datalist[5] +      " +  datalist[6] + "    "  + datalist[7]+  "    "  +  datalist[8] + "    "  + datalist[9] + "    "  + datalist[10] +  "    "  + datalist[11]                 print data             except:                 continue             datalist=[]

Part:

650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/9D/D5/wKioL1mHIZTCzPNoAAGZqOmLUQM367.png-wh_500x0-wm_ 3-wmp_4-s_2482180387.png "title=" Securities star. png "alt=" wkiol1mhiztczpnoaagzqomluqm367.png-wh_50 "/>

Originally want to save in the database, the latter used for data analysis, suddenly not interested in the first.

Just want to say: Most of the site anti-crawler strategy basically did not do, if I want to, may also be a day or two can be the whole site to climb down, the above also took half an hour.  The data is not money? Is it the equivalent of an indirect de-library to climb down completely?

This article is from the "Shangwei Super" blog, please make sure to keep this source http://9399369.blog.51cto.com/9389369/1954076

Python crawler crawls the Securities Star website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.