As a graduate dog want to study the land transfer aspects of information, need every land to sell data. I would like to announce the results of the land deal from China Land Market Network ( http://www. landchina.com/default.a Spx?tabid=263&comname=default
Click on each of the land, in the detailed page after the jump to download the "Land use" "Deal price" "Supply Way" "Project location" and other information, due to a total of more than 1 million land deal information, manual search is impossible, want to ask can be downloaded with a crawler? and estimated difficulty and time consuming? Kneel down and thank you.
Reply content:
#!/usr/bin/env python#-*-coding:utf-8-*-import requestsfrom bs4 import beautifulsoupimport timeimport randomimport sys def get_post_data (URL, headers): # Visit a webpage to get post required information data = {' Tab_querysubmitsortdata ': ', ' Tab_row Buttonactioncontrol ': ',} try:req = Requests.get (URL, headers=headers) except Exception, E:prin T ' get BaseURL failed, try Again! ', e sys.exit (1) try:soup = BeautifulSoup (Req.text, "Html.parser") Tab_queryconditionitem = soup.find (' input ', id= "tab_queryconditionitem270"). Get (' value ') # Print TAB _queryconditionitem data[' tab_queryconditionitem '] = tab_queryconditionitem tab_querysortitemlist = Soup.fin D (' input ', id= "tab_querysort0"). Get (' value ') # print tab_querysortitemlist data[' tab_querysortit Emlist '] = tab_querysortitemlist data[' tab_querysubmitorderdata '] = tab_querysortitemlist __EVENTVALIDATION = Soup.find (' Input ', id= ' __eventvalidation '). Get (' value ') # print __eventvalidation data[' __eventvalidation '] = __eventv Alidation __viewstate = soup.find (' input ', id= ' __viewstate '). Get (' value ') # Print __viewstate data[' _ _viewstate '] = __viewstate except Exception, E:print ' get post data failed, try Again! ', e sys.exit (1) return datadef get_info (URL, headers): req = Requests.get (URL, headers=headers) soup = BeautifulSoup (Req.text, "HT Ml.parser ") items = soup.find (' table ', id=" mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1 ") # Required Information Composition dictionary I NFO = {} # administrative division = items.find (' span ', id= ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r1_c2_ctrl ' ). Get_text (). Encode (' utf-8 ') info[' xingzhengqu '] = Division # project Position location = Items.Find (' span ', id= ' main Modulecontainer_1855_1856_ctl00_ctl00_p1_f1_r16_c2_ctrl "). Get_text (). Encode (' utf-8 ') info[' xiangmuweizhi '] = Location # area (hectares) square = iTems.find (' span ', id= "Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r2_c2_ctrl"). Get_text (). Encode (' utf-8 ') I nfo[' Mianji ' = square # Land Use Purpose = items.find (' span ', id= ' mainmodulecontainer_1855_1856_ctl00_ctl00_p1_ F1_r3_c2_ctrl "). Get_text (). Encode (' utf-8 ') info[' tudiyongtu '] = purpose # Land supply mode Source = Items.Find (' span ' , id= "Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r3_c4_ctrl"). Get_text (). Encode (' utf-8 ') info[' Gongdifangshi '] = source # deal Price (million yuan) prices = Items.Find (' span ', id= "mainmodulecontainer_1855_1856_ctl00_ctl00 _p1_f1_r20_c4_ctrl "). Get_text (). Encode (' utf-8 ') info[' chengjiaojiage '] = Price # print Info # with the unique value of the electronic regulatory number when key, the required letter Value as dictionary all_info = {} key_id = Items.Find (' span ', id= "mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r1_ C4_ctrl "). Get_text (). Encode (' Utf-8 ') all_info[key_id] = info return all_infodef get_pages (baseurl, headers, Post_dat A, date): print ' Date ', Date # full post Data post_data[' tab_querysubmitconditiondata ' = post_data[' tab_queryconditionitem '] + ': ' + Date page = 1 While True:print ' page {0} '. Format (page) # Take a break to prevent the web from being recognized as a crawler robot Time.sleep (Random.random () * 3 ) post_data[' tab_querysubmitpagerdata '] = str (page) req = Requests.post (BaseURL, Data=post_data, Headers=hea DERs) # print Req soup = BeautifulSoup (Req.text, "html.parser") items = soup.find (' table ', id= "Tab_co Ntenttable "). Find_all (' tr ', onmouseover=true) # Print items for item in Items:print Item.find (' TD '). Get_text () link = item.find (' a ') if Link:print item.find (' a '). Text url = ' http://www.landchina.com/' + item.find (' a '). Get (' href ') print get_info (URL, headers) Else:print ' no content, this ten-days-over ' return break page + = 1i F __name__ = = "__main__": # time.time () BaseURL = ' http://www.landchina.com/default.aspx?tabid=263 ' headers = {' user-agent ': ' Moz illa/5.0 (Macintosh; Intel Mac OS X 10_11_1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.71 safari/537.36 ', ' Host ': ' Www.lan Dchina.com '} Post_data = (Get_post_data (baseurl, headers)) date = ' 2015-11-21~2015-11-30 ' get_pages (BaseURL, Headers, post_data, date)
Unsolicited, know the first answer, the same as the senior graduate Dog
Before help the teacher climbed this information, from 1995-2015 has 170多万条, forget next time takes more than 40 hours to climb finish. I didn't go on climbing until 2000. At that time to write code when just learn the crawler, do not understand the principle, found that the page click on the next page and change the date, the URL will not change, the URL will not change, the URL is not changed Orz, for beginners do not know why. Later on to find ways to learn a little selenium, use it to simulate browser operation, change the date, click on the next page what can be achieved. The advantage is the simple rough, the disadvantage is overkill, occupy the system too many resources. Then later, learned a little grasping the bag technology, know that the original date and page changes are through the POST request. This afternoon, the program modified a bit, using post instead of the original selenium. Nonsense not to say, on the code.
#-*-coding:gb18030-*-' Landchina climb up! ' Import requestsimport csvfrom bs4 import beautifulsoupimport datetimeimport reimport osclass Spider (): Def __init__ (SE LF): self.url= ' http://www.landchina.com/default.aspx?tabid=263 ' #这是用post要提交的数据 self.postdata={' tab_ Queryconditionitem ': ' 9f2c3acd-0256-4da2-a659-6949c4671a2a ', ' tab_querysortitemlist ': ' 282:False ', #日期 ' tab_querysubmitconditiondata ': ' 9F2C3ACD-0256-4DA2-A659-6949C4671A2A: ', ' Tab_querysubmitorderdata ': ' 282:false ', #第几页 ' Tab_querysubmitpagerdata ': '} self.rowname=[u ' administrative Region ', U ' Electronic regulatory number ', U ' project name ', U ' project location ', U ' area (hectares) ', U ' land source ', U ' land use ', U ' supply way ', U ' Land use life ', U ' Industry classification ', U ' land level ', U ' deal price (million yuan) ', U ' land use right person ', U ' agreed volume rate lower limit ', U ' agreed to time of settlement ', U ' agreed to start time ', U ' agreed on the time of completion ', U ' actual start time ', U ' actual completion time ', U ' approval unit ', U ' contract date '] #这是要抓取的数据, I grabbed the self.info=[' mainmodulecontainer except for the four items in the staging convention._1855_1856_ctl00_ctl00_p1_f1_r1_c2_ctrl ', #0 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r1_c4_ctrl ', #1 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r17_c2_ctrl ', #2 ' mainmodulecontainer_1855_ 1856_ctl00_ctl00_p1_f1_r16_c2_ctrl ', #3 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r2_c2_ctrl ', #4 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r2_c4_ctrl ', #5 #这条信息是土地来源, crawled down by numbers, and it was converted to get a land source , it doesn't matter, I didn't get ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r3_c2_ctrl ', #6 ' Mainmodulecont Ainer_1855_1856_ctl00_ctl00_p1_f1_r3_c4_ctrl ', #7 ' mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r19_c2_ct RL ', #8 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r19_c4_ctrl ', #9 ' main Modulecontainer_1855_1856_ctl00_ctl00_p1_f1_r20_c2_ctrl ', #10 ' mainmodulecontainer_1855_1856_ctl00_ctl00_p1_ F1_r20_c4_ctrl ', #11 # # ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f3_r2_c1_0_ctrl ', # # ' mainmodulecontainer_1855_18 56_ctl00_ctl00_p1_f3_r2_c2_0_ctrl ', # # ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f3_r2_c3_0_ctrl ', # # ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f3_r2_c4_0_ctrl ', ' mainmodulecontainer_1855_1856_ Ctl00_ctl00_p1_f1_r9_c2_ctrl ', #12 ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f2_r1_c2_ctrl ', ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f2_r1_c4_ctrl ', ' mainmodulecontainer_1855_1856_ctl00_ctl0 0_p1_f1_r21_c4_ctrl ', ' mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r22_c2 ', ' Mainmodulec Ontainer_1855_1856_ctl00_ctl00_p1_f1_r22_c4_ctrl ', ' mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r10_c2_ Ctrl ', ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r10_c4_ctrl ', ' MAINM odulecontainer_1855_1856_ctl00_ctl00_p1_f1_r14_c2_ctrl ', ' Mainmodulecontainer_1855_1856_ctl00_ctl00_p1_f1_r14_c4_ctrl '] #第一步 def Handledate (self,year,month,day): #返回日期数据 ' return date format%y-%m-%d ' Date=datetime.date (Year,month, Day) # Print date.datetime.datetime.strftime ('%y-%m-%d ') return date #日期对象 def timedelta (Self,year,month) : #计算一个月有多少天 date=datetime.date (year,month,1) try:date2=datetime.date (date.year,date. Month+1,date.day) except:date2=datetime.date (date.year+1,1,date.day) datedelta= (date2-date). da Ys return Datedelta def getpagecontent (self,pagenum,date): #指定日期和页数, open the corresponding Web page, get content Postdata=s Elf.postData.copy () #设置搜索日期 querydate=date.strftime ('%y-%m-%d ') + ' ~ ' +date.strftime ('%y-%m-%d ') postdat a[' Tab_querysubmitconditiondata ']+=querydate #设置页数 postdata[' tab_querysubmitpagerdata ']=str (pageNum) #请求网页 r=requests.Post (self.url,data=postdata,timeout=30) r.encoding= ' gb18030 ' pagecontent=r.text# f=open (' content.html ', ' W ') # F.write (Content.encode (' GB18030 ')) # F.close () return pagecontent# second step def getallnum (Self,da TE): #1无内容 2 only has 1 pages 3 1-200 pages 4 200 pages above firstcontent=self.getpagecontent (1,date) if u ' not retrieving relevant data ' in F Irstcontent:print date, ' have ', ' 0 page ' return 0 pattern=re.compile (U '
Total (. *?) Page. *?') Result=re.search (pattern,firstcontent) If Result==none:print date, ' have ', ' 1 page ' return 1 if Int (Result.group (1)) <=200:print date, ' has ', int (Result.group (1)), ' page ' retur n Int (Result.group (1)) Else:print date, ' Have ', ' page ' return 200# third step def getlinks (self , pagenum,date): ' Get all Links ' pagecontent=self.getpagecontent (pagenum,date) links=[] Pattern =re.compile (U ', re. S) Results=re.findall (pattern,pagecontent) for the result in Results:links.append (' HTTP://WWW.LANDCH Ina.com/default.aspx?tabid=386 ' +result) return links def getalllinks (self,allnum,date): Pagenum=1 Alllinks=[] While Pagenum<=allnum:links=self.getlinks (pagenum,date) alllinks+=links print ' Scrapy link from page ', pagenum, '/', allnum pagenum+=1 print date, ' has ', Len (alllinks), ' Lin K return alllinks #第四步 def getlinkcontent (self,link): ' Open the link to get the linkcontent ' r=reques Ts.get (link,timeout=30) r.encoding= ' gb18030 ' linkcontent=r.text# f=open (' linkcontent.html ', ' W ') # F.write (Linkcontent.encode (' GB18030 ')) # F.close () return linkcontent def getInfo (self,linkcontent): "Get every item ' s info" data=[] Soup=beautifulsoup (linkcontent) for item in Self.info: If Soup.find (id=item) ==none:s= ' Else:s=soup.find (id=item). String If s==none:s= ' Data.append (Unicode (S.strip ())) return Data D EF Saveinfo (self,data,date): Filename= ' landchina/' +datetime.datetime.strftime (date, '%Y ') + '/' + Datetime.datetime.strftime (date, '%m ') + '/' +datetime.datetime.strftime (date, '%d ') + '. csv ' if os.path.exists ( FileName): mode= ' ab ' else: Mode= ' WB ' Csvfile=file (Filename,mode) writer=csv.writer (csvfile) if mode== ' WB ': WRITER.WR Iterow ([Name.encode (' GB18030 ') for name in Self.rowname]) Writer.writerow ([D.encode (' GB18030 ') for D in data]) Csvfile.close () def mkdir (self,date): #创建目录 path = ' landchina/' +datetime.datetime.strftime (date, '% Y ') + '/' +datetime.datetime.strftime (date, '%m ') isexists=os.path.exists (path) if not ISEXISTS:OS.M Akedirs (PATH) def saveallinfo (self,alllinks,date): for (I,link) in Enumerate (alllinks): Linkcontent=d Ata=none linkcontent=self.getlinkcontent (link) data=self.getinfo (linkcontent) Self.mkdir ( Date) Self.saveinfo (data,date) print ' Save info from link ', i+1, '/', Len (alllinks)
You can go to the arrow Hand cloud Crawler development platform
Look. In the cloud simply a few lines JS can be implemented crawler, if this is too lazy to do can also contact the official custom, any site can crawl, in short, is a very convenient crawler infrastructure platform. This data is structured so clearly that it is easy to capture this data. Through many years of data processing experience, you can give you the following several suggestions:
1. Multithreading
2. Prevent IP from being blocked
3. Storage of large unstructured data with MONGDB
Learn more about the access Code Kejida data introduction page: http://www. Tanmer.com/bigdata
I grabbed the end of the site contract, or better catch. After grasping the generated table, note that the selection column of the asynchronous region and other content, need to his JS download down formation asynchronous request. Submit your data. The request has an ID on his home page. Seems to be such a thing, last year, do not remember, I have the source can give you to share. Write in Java I was a reptile small white, ask, not to say can't crawl ASP page?
The address of the detailed content page is "default.aspx?tabid=386&comname=default&wmguid=75c725 ... ", the site is read in the Default.aspx page of the database display details, not to say that the data in the database is not read?"