Python3 crawling chain house second-hand house, python3 crawling second-hand house

Source: Internet
Author: User

Python3 crawling chain house second-hand house, python3 crawling second-hand house
Preface

As a newbie, I just entered the Python crawler field. Today I tried to crawl the second-hand house of the chain house. I have already crawled the world of the house. Let's see what's different from the chain house.

I. Analyze and observe the crawling website structure

Here in Guangzhou chain home second-hand house as an example: http://gz.lianjia.com/ershoufang/

This is the first page. Let's take a look at the changes in the url on the second page and find one more/g2, the third page/pg3. Is the original one adding/pg1, let's test the http://gz.lianjia.com/ershoufang/pg1/ = http://gz.lianjia.com/ershoufang/that question is not big, we continue the analysis.

 

These are the information about the second-hand house we want, but there is a link to it. Let's take a look:

The information on second-hand houses is more comprehensive, so I want the information on this page.

2. Write Crawlers

1. Get the url

First we can get all the detailed links, there are 100 pages, then we can http://gz.lianjia.com/ershoufang/pg1/.../pg2.../pg3.../pg100 into all the URLs, and then get the detailed url in each page from these URLs, then, analyze the html from the detailed url to get the desired information.

 

2. Analyze htmlhttp: // gz.lianjia.com/ershoufang/pg1/

First open the network in the chrom developer tool analysis, hook up the preserve log, clear it, And then refresh the page.

Found get: http://gz.lianjia.com/ershoufang/pg1/html requested

Then we can start generating all URLs first:

Def generate_allurl (user_in_nub): url = 'HTTP: // gz.lianjia.com/ershoufang/pg?#/' for url_next in range (1, int (user_in_nub): yield url. format (url_next) def main (): user_in_nub = input ('number of input pages: ') for I in generate_allurl (user_in_nub): print (I) if _ name _ = '_ main _': main ()

Running result:

 

In this way, we generate a 100 page url.

 

Then we need to analyze the detailed URLs of each page in these URLs:

  Analyze the webpage structure first,

We found that the url we want is the tag in the title in the class. We can use the request method to obtain html and the regular expression method to analyze and obtain the details page url:

import requests
import re

Def a method to pass in the generated generate_allurl and print it.

def get_allurl(generate_allurl):    get_url = requests.get(generate_allurl,)    if get_url.status_code == 200:        re_set = re.compile('<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"')        re_get = re.findall(re_set,get_url.text)        print(re_get)

  

Obtain the details page Link normally.

In the next step, we will analyze the detailed page connection to obtain the information in it. You can click this arrow with the built-in developer tool to select the webpage element:

  

The information is found in the main class. You can use the method in the BeautifulSoup module to obtain the information:

from bs4 import BeautifulSoup

Define a method to pass the detailed url link for analysis:

Def open_url (re_get): res = requests. get (re_get) if res. status_code = 200: info = {} soup = BeautifulSoup (res. text, 'lxml') info ['title'] = soup. select ('. main ') [0]. text info ['total price'] = soup. select ('. total ') [0]. text + '000000' info ['price per sq '] = soup. select ('. unitPriceValue ') [0]. text return info

Put requests here. the get object is passed to res, and the variable is passed to BeautifulSoup. The lxml parser is used to parse the variable, and then the result is passed to soup, and then soup is enabled. select method to filter, because the above title is in, main:

Soup. select ('. main'), because it is a class, so we need to add... if the filtering is id, add #.

The following result is displayed:

Is a list, so we need to add [0] to retrieve it, and then we can use method. text to get the text.

def open_url(re_get):
res = requests.get(re_get)
if res.status_code == 200:
soup = BeautifulSoup(res.text,'lxml')
title = soup.select('.main')[0].text
print(title)

Expected result

Then you can add it to the dictionary, and return returns the dictionary:

Def open_url (re_get): res = requests. get (re_get) if res. status_code = 200: info = {} soup = BeautifulSoup (res. text, 'lxml') info ['title'] = soup. select ('. main ') [0]. text info ['total price'] = soup. select ('. total ') [0]. text + '000000' info ['price per sq '] = soup. select ('. unitPriceValue ') [0]. text return info

Expected result:

It can also be stored in the xlsx document:

Def pandas_to_xlsx (info): pd_look = pd. DataFrame (info) pd_look.to_excel('chain houses and second-hand houses .xlsx ', sheet_name = 'chain houses and second-hand houses ')
        

OK is basically complete. Is there anything clear? Leave a message and I will continue to update it

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.