Currently, Web crawlers can be written in many ways, such as Node. js, Go, or even PHP. the reason why I chose Python is that there are many tutorials and I can study it systematically, because it is not enough to use the Html selector to crawl the page, I also want to learn some common pitfalls in the crawling process, as well as some precautions, such as tips such as modifying the browser Header. Currently, Web crawlers can be written in many ways, such as Node. js, Go, or even PHP. the reason why I chose Python is that there are many tutorials and I can study it systematically, because it is not enough to use the Html selector to crawl the page, I also want to learn some common pitfalls in the crawling process, as well as some precautions, such as tips such as modifying the browser Header.
The code comments are very detailed. In fact, you only need to read the source code directly.
The purpose of this crawler is very simple: to crawl the name of a real estate website + price + download of 1 picture (simply test the file download function) for later analysis of house price trends, in order not to put too much pressure on the other server, I chose to crawl three pages.
Let me talk about some important knowledge points here:
# Remember to modify the sent Headers
I heard that all the requests sent by default are headers with python information. it is easy to be checked by the website of the other party as a crawler bot, which causes the IP address to be blocked. Therefore, it is best to make your crawler programs look like humans, however, this code can only be used for general concealment. some websites really want to prevent crawlers. you can't lie to it either:
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"},
# Html selector. I use pyquery instead of beautifulsoup.
Beautifulsoup is recommended in many books, but as a person who is used to jquery, beautifulsoup syntax is a bit tricky, and it does not seem to support advanced and complex css selector modes such as first-child, or support, but I did not find it, nor did I carefully read the document.
Then I found some information on the internet and found that many people recommended the pyquery database. I used it myself and found it was really comfortable, so I decided to use it.
# Crawler ideas
The idea is actually very simple:
1. find the list page of a property and analyze the URL structure on the second and third pages;
2. get the URL of all list entries on each list page and store them in the set () set of python to remove duplicate URL information.
3. go to the details page through the obtained House URL, and then crawl valuable field information, such as text.
4. At present, I only need to print the data in simple mode. I have not saved the obtained data in the local json or CSV format. so do it later, to be done.
The following is all the code:
# Obtain the page object from urllib. request import urlopenfrom urllib. request import urlretrievefrom pyquery import PyQuery as pq # modify request header module, simulate real-person access import requestsimport time # introduce system object import OS # Your own configuration file, please rename the config-sample.py to config. and then fill in the corresponding value to import config # define the link set to avoid duplicate links pages = set () session = requests. session () baseUrl =' http://pic1.ajkimg.com 'Downloaddir = 'images' # retrieve all list pages connection def getAllPages (): pageList = [] I = 1 while (I <2): newLink =' http://sh.fang.anjuke.com/loupan/all/p '+ Str (I) +'/'pageList. append (newLink) I = I + 1 return pageListdef getAbsoluteURL (baseUrl, source): if source. startswith (" http://www . "): Url =" http: // "+ source [11:] elif source. startswith ("http: //"): url = source elif source. startswith ("www. "): url =" http: // "+ source [4:] else: url = baseUrl +"/"+ source if baseUrl not in url: return None return url # The internal path of this function is written according to your actual situation. This facilitates subsequent data import def getDownloadPath (baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl. replace ("www. "," ") path = path. replace (baseUrl, "") path = downloadDirec Paths + path directory = OS. path. dirname (path) if not OS. path. exists (directory): OS. makedirs (directory) return path # obtain all connections on the current page def getItemLinks (url): global pages; # First, determine whether the page can be obtained. try: req = session. get (url, headers = config. value ['headers']) # this judgment can only determine whether the error is 404 or 500. if the DNS cannot be resolved, it cannot determine the cause T IOError as e: print ('Can not reach the page. ') print (e) else: h = pq (req. text) # obtain all House modules on the first page, houseItems = h ('. item- Mod ') # extract the information we need from the module, such as the URL, price, and thumbnail of the details page # I prefer to only get the URL of the details page, then, obtain more information for houseItem in houseItems on the details page. items (): houseUrl = houseItem. find ('. items-name '). attr ('href ') # print (houseUrl) pages. add (houseUrl) # obtain various fields on the details page. here, you can edit def getItemDetails (url): # First, determine whether the page can be obtained. try: req = session. get (url, headers = config. value ['headers']) # this judgment can only determine whether the error is 404 or 500. if the DNS cannot be resolved, it cannot determine whether the cause T IOError as e: print ('Can not rea Ch the page. ') print (e) else: time. sleep (1) h = pq (req. text) # get title housePrice = h ('h1 '). text () if h ('h1 ')! = None else 'none' # get price housePrice = h ('. sp-price'). text () if h ('. sp-price ')! = None else 'none' # get image url houseImage = h ('. con a: first-child img '). attr ('src') houseImageUrl = getAbsoluteURL (baseUrl, houseImage) if houseImageUrl! = None: urlretrieve (houseImageUrl, getDownloadPath (baseUrl, houseImageUrl, downLoadDir) # if bsObj. find ('em ', {'class', 'sp-price'}) = None: # housePrice = 'none' # else: # housePrice = bsObj. find ('em ', {'class', 'sp-price '}). text; # if bsObj. select ('. con a: first-child. item img ') = None: # houseThumbnail = 'none' # else: # houseThumbnail = bsObj. select ('. con a: first-child. item img '); # start to run the codeallPages = getAllPages () for I in allPages: getItemLinks (I) # at this time, pages should be filled with a lot of url content for I in pages: getItemDetails (I) # print (pages)