Python3 crawling chain house second-hand house, python3 crawling second-hand house
Preface
As a newbie, I just entered the Python crawler field. Today I tried to crawl the second-hand house of the chain house. I have already crawled the world of the house. Let's see what's different from the chain house.
I. Analyze and observe the crawling website structure
Here in Guangzhou chain home second-hand house as an example: http://gz.lianjia.com/ershoufang/
This is the first page. Let's take a look at the changes in the url on the second page and find one more/g2, the third page/pg3. Is the original one adding/pg1, let's test the http://gz.lianjia.com/ershoufang/pg1/ = http://gz.lianjia.com/ershoufang/that question is not big, we continue the analysis.
These are the information about the second-hand house we want, but there is a link to it. Let's take a look:
The information on second-hand houses is more comprehensive, so I want the information on this page.
2. Write Crawlers
1. Get the url
First we can get all the detailed links, there are 100 pages, then we can http://gz.lianjia.com/ershoufang/pg1/.../pg2.../pg3.../pg100 into all the URLs, and then get the detailed url in each page from these URLs, then, analyze the html from the detailed url to get the desired information.
2. Analyze htmlhttp: // gz.lianjia.com/ershoufang/pg1/
First open the network in the chrom developer tool analysis, hook up the preserve log, clear it, And then refresh the page.
Found get: http://gz.lianjia.com/ershoufang/pg1/html requested
Then we can start generating all URLs first:
Def generate_allurl (user_in_nub): url = 'HTTP: // gz.lianjia.com/ershoufang/pg?#/' for url_next in range (1, int (user_in_nub): yield url. format (url_next) def main (): user_in_nub = input ('number of input pages: ') for I in generate_allurl (user_in_nub): print (I) if _ name _ = '_ main _': main ()
Running result:
In this way, we generate a 100 page url.
Then we need to analyze the detailed URLs of each page in these URLs:
Analyze the webpage structure first,
We found that the url we want is the tag in the title in the class. We can use the request method to obtain html and the regular expression method to analyze and obtain the details page url:
import requests
import re
Def a method to pass in the generated generate_allurl and print it.
def get_allurl(generate_allurl): get_url = requests.get(generate_allurl,) if get_url.status_code == 200: re_set = re.compile('<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"') re_get = re.findall(re_set,get_url.text) print(re_get)
Obtain the details page Link normally.
In the next step, we will analyze the detailed page connection to obtain the information in it. You can click this arrow with the built-in developer tool to select the webpage element:
The information is found in the main class. You can use the method in the BeautifulSoup module to obtain the information:
from bs4 import BeautifulSoup
Define a method to pass the detailed url link for analysis:
Def open_url (re_get): res = requests. get (re_get) if res. status_code = 200: info = {} soup = BeautifulSoup (res. text, 'lxml') info ['title'] = soup. select ('. main ') [0]. text info ['total price'] = soup. select ('. total ') [0]. text + '000000' info ['price per sq '] = soup. select ('. unitPriceValue ') [0]. text return info
Put requests here. the get object is passed to res, and the variable is passed to BeautifulSoup. The lxml parser is used to parse the variable, and then the result is passed to soup, and then soup is enabled. select method to filter, because the above title is in, main:
Soup. select ('. main'), because it is a class, so we need to add... if the filtering is id, add #.
The following result is displayed:
Is a list, so we need to add [0] to retrieve it, and then we can use method. text to get the text.
def open_url(re_get):
res = requests.get(re_get)
if res.status_code == 200:
soup = BeautifulSoup(res.text,'lxml')
title = soup.select('.main')[0].text
print(title)
Expected result
Then you can add it to the dictionary, and return returns the dictionary:
Def open_url (re_get): res = requests. get (re_get) if res. status_code = 200: info = {} soup = BeautifulSoup (res. text, 'lxml') info ['title'] = soup. select ('. main ') [0]. text info ['total price'] = soup. select ('. total ') [0]. text + '000000' info ['price per sq '] = soup. select ('. unitPriceValue ') [0]. text return info
Expected result:
It can also be stored in the xlsx document:
Def pandas_to_xlsx (info): pd_look = pd. DataFrame (info) pd_look.to_excel('chain houses and second-hand houses .xlsx ', sheet_name = 'chain houses and second-hand houses ')
OK is basically complete. Is there anything clear? Leave a message and I will continue to update it