Now there are many ways to write web crawlers, such as node. js or go, or even PHP, I chose Python because of many tutorials, can learn from the system, because the light knows how to use the HTML selector to crawl the page is not enough, I also want to learn some of the crawler process common pits, And some considerations, such as modifying the browser header and other tips.
Code comments are very detailed, in fact, just read the source directly.
The purpose of this crawler is very simple, climbed to a real estate site name + price + 1 images Download (simple test file download function), in order to analyze the price trend after the use of the other server to not add too much pressure, I only chose to crawl 3 pages.
Let me tell you a few things that need attention:
#记得修改发送的Headers
I heard that the default sends the past is the head with Python information, it is easy to be detected by the other site is a crawler robot, resulting in IP is sealed, so it is best to let their own crawler like a human point, but this code can only play a general concealment, there is really a site to prevent reptiles, you are also cheat, on the code:
headers = {"User-agent": "mozilla/5.0" (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (khtml, like Gecko) Chrome ", " Accept ":" Text/html,application/xhtml+xml, application/xml;q=0.9,image/webp,*/*;q=0.8 "},
#html的选择器, I'm using pyquery instead of BeautifulSoup.
Many books recommend BeautifulSoup, but as a person accustomed to jquery, BeautifulSoup's grammar is a bit of a mouthful, and seems to not support: First-child and other advanced complex CSS selector mode, or support, but I did not find , and it's not very carefully read the document.
Then I find a bit of information online, found a lot of people recommend Pyquery this library, himself down with a bit, found that really comfortable, so decisively adopted.
#爬虫思路
The idea is actually simple:
1. Find the list page of a property and analyze the URL structure of the second third page;
2. Gets the URL of all list entry information for each list page, which is stored in the Python set () set to remove duplicate URL information.
3. By getting the URL of the house, go to the Details page, and then crawl to valuable field information, such as the piece of text.
4. At present, I only carry out simple print data, do not save the obtained data as a local JSON or CSV format, after doing it, to is done.
Here is the full code code:
#获取页面对象from urllib.request Import urlopenfrom urllib.request import urlretrievefrom pyquery import pyquery as pq# Modify the request header module, Simulate live access Import Requestsimport time# introduce system object import os# your own profile, rename config-sample.py to config.py and fill in the corresponding values to import config# Define the link collection so that the link repeats pages = set () session = requests. Session () BaseUrl = ' http://pic1.ajkimg.com ' Downloaddir = ' images ' #获取所有列表页连接def getallpages (): PageList = [] i = 1 while (I < 2): NewLink = ' http://sh.fang.anjuke.com/loupan/all/p ' + str (i) + '/' Pagelist.append (NewLink) i = i + 1 return pagelistdef Getabsoluteurl (BASEURL, source): If Source.startswith ("http://www."): Ur L = "//" +SOURCE[11:] elif source.startswith ("/http"): url = source elif source.startswith ("www."): url = "http://" +source[4 "Else:url = baseurl+"/"+source if BaseUrl not in Url:return None Return url# The path inside this function is written according to its own real situation, convenient after data import def getdownloadpath (BaseUrl, Absoluteurl, downloaddirectory): PAth = Absoluteurl.replace ("www.", "") path = Path.replace (BaseUrl, "") path = Downloaddirectory+path Directory = Os.path.dirname (path) if not os.path.exists (directory): os.makedirs (directory) return path# get all connections to the current page def ge Titemlinks (URL): global pages; #先判断是否能获取页面 try:req = session.get (url, headers = config.value[' headers ') #这个判断只能判定是不是404或者500的错误, if DNS is not resolvable, is impossible to determine except IOError as E:print (' can not reach the page. Print (e) else:h = PQ (req.text) #获取第一页的所有房子模块 houseitems = h ('. Item-mod ') #从 The module extracts the information we need, such as the URL of the detail page, the price, the thumbnail, and so on #我倾向只获取详情页的URL, and then gets more information in the details page for Houseitem in Houseitems.items (): Houseurl = Houseitem.find ('. Items-name '). attr (' href ') #print (Houseurl) Pages.Add (houseurl) # Get the various fields of the details page, where you can let the user edit the Def getitemdetails (URL): #先判断是否能获取页面 try:req = session.get (url, headers = Config.value [' Headers ']) #这个判断只能判定是不是404或者500的错误,If DNS cannot be resolved, it is impossible to determine the except IOError as E:print (' can not reach the page. ') print (e) else:time.sleep (1) H = PQ (req.text) #get title Houseprice = h (' H1 '). Tex T () if H (' h1 ')! = None Else ' None ' #get price Houseprice = h ('. Sp-price '). Text () if H ('. Sp-price ')! = None E LSE ' none ' #get image URL houseimage = h ('. Con a:first-child img '). attr (' src ') Houseimageurl = Getabs Oluteurl (BASEURL, houseimage) if houseimageurl! = None:urlretrieve (Houseimageurl, Getdownloadpath (BaseU RL, Houseimageurl, Downloaddir)) # If Bsobj.find (' em ', {' class ', ' sp-price '}) = = None: # houseprice = ' None ' # else: # Houseprice = bsobj.find (' em ', {' class ', ' Sp-price '}). text; # if Bsobj.select ('. Con a:first-child. Item img ') = = None: # housethumbnail = ' None ' # Else: # Housethumbnail = Bsobj.select ('. Con a:first-child. Item img '); #start to run theCodeallpages = Getallpages () for I in Allpages:getitemlinks (i) #此时pages should be filled with the contents of many URLs for I in Pages:getitemdetails (i ) #print (pages)