Python3 crawler crawl to take home housing information (multi-threaded version)

Source: Internet
Author: User

The first step is to analyze the website structure http://esf.zs.fang.com/

Find the information we need to get, click to see,

Links inside information more detailed, these are what we want to get.

1. We can get all the detailed links under the http://esf.zs.fang.com/link first http://esf.zs.fang.com/chushou/3_255784229.htm

2. You can then analyze and get the data we need under the detailed link

3. Access to database MongoDB after obtaining data

Open the Administrator tool F12 see detailed links to http://esf.zs.fang.com/

Next page

Find detailed links in the DL tag

Are neat so that we can get this detailed link using the BeautifulSoup method, but this link is not complete

Watch the links on the details page to find out what they are related to http://esf.zs.fang.com + chushou/3_255784229.htm

And then we can start writing.

We'll get the full details page first.

http://esf.zs.fang.com/house/i33/

http://esf.zs.fang.com/house/i34/

The number of pages found is associated with URLs, so we can get a generator to generate these original links

def get_url (USER_IN_CITY,USER_IN_NUB):  #获取user_in_city, user_in_nub link    url_home = (' Http://esf. ') + user_in_city + '. fang.com/house/i3{}/') for    Url_next in range (1, int (USER_IN_NUB)):        yield Url_home.format ( Url_next)
User_in_city,user_in_nub are http://esf.zs.fang.com/house/i34/Zs and i34,city and nub when parameters are passed to the function, we can change the resulting web page by calling later
Call

Because this is a generator, we're going to iterate over the generation

Get the URL and then we're going to parse the URL inside to get the detailed URL:
Viewing structures with developer tools

Link really a tag inside the href so we can def a method gets and then generates a full detailed link.

def open_url (url,user_in_city):    try:        res = requests.get (URL, headers=headers1)        if Res.status_code = = 200:            soup = BeautifulSoup (res.text, ' html5lib ')            Url_start = ' http://esf. ' + user_in_city + '. fang.com ' for            title In Soup.select ('. Title '):  # URL link list                url_end = Title.select (' a ') [0][' href ']                yield Url_start + url_end    Except requestexception:        return print (' Check Open_url ')

First import requests,from BS4 import BeautifulSoup to obtain and analyze

Join Status_code exception handling, if status_code is not equal to the return ' check Open_url '

Go directly to the source bar, old godless.

#!/usr/bin/env python3#-*-coding:utf-8-*-# Author; Tsukasaimport pymongomongo_url= ' localhost ' mongo_db= ' fangtianxia ' mongo_table= ' fangtianxia_fs ' import jsonimport Requestsfrom BS4 Import beautifulsoupfrom requests.exceptions import Requestexceptionimport Pandas as Pdimport Timefrom F Ake_useragent Import useragentfrom multiprocessing Import Poolua = useragent () headers1 = {' user-agent ': ' Ua.ramdom '} Client = Pymongo.    Mongoclient (mongo_url) db = Client[mongo_db]def Get_url (USER_IN_CITY,USER_IN_NUB): #获取user_in_city, user_in_nub links Url_home = (' Http://esf. ') + user_in_city + '. fang.com/house/i3{}/') for Url_next in range (1, int (USER_IN_NUB)): Yield Url_home.format (url_  Next) def open_url (url,user_in_city): Try:res = Requests.get (URL, headers=headers1) if Res.status_code = = 200:soup = BeautifulSoup (Res.text, ' html5lib ') Url_start = ' http://esf. ' + user_in_city + '. fang.c               Om ' for title ' in Soup.select ('. Title '): # URL link list Url_end = Title.select (' a ') [0][' href '] yield Url_start + url_end except Requestexception:return Print (' Check Open_url ') def one_page (house_url): Try:res = Requests.get (House_url, headers=headers1) if Res. Status_code ==200:soup = BeautifulSoup (Res.text, ' html5lib ') info = {} info[' web page '] = hous E_url info[' title '] = Soup.select (' h1 ') [0].text.strip () # Get title info[' Total price '] = Soup.select ('. red20b ') [0]. Text + ' # Total price info[' contact phone ' = soup.select (' #mobilecode ') [0].text # Phone #now_time = Time.strftime (' %y-%m-%d\t%h:%m ', Time.localtime (Time.time ())) #info [' obj update time '] = Now_time for SL in Soup.select (' Spa n '): # Gets the publish time if ' Publish Time ' in Sl.text.lstrip (' <span> '): Key, Value = (Sl.text.strip ( ). Rstrip (' ('). Split (': ')) Info[key] = value + ' * ' + soup.select (' #Time ') [0].text for DD in S Oup.select (' DD '): # Get detailedFine content if ': ' In Dd.text.strip (): Key, Value = (Dd.text.strip (). Split (': ')) Info[key] = value print (info) return info except Requestexception:return print (' Check One _page ') def writer_to_text (text): With open (' Room world. Text ', ' a ', encoding= ' Utf-8 ') as F:f.write (Json.dumps (text,ensure_ ascii=false) + ' \ n ') f.close () def pandas_to_xlsx (pd_list): Pd_look = PD. DataFrame (pd_list) pd_look.to_excel (' Room world. xlsx ', sheet_name= ' room World ') def pandas_to_csv (pd_list): Pd_look = PD. DataFrame (pd_list) pd_look.to_csv (' Room world. csv ', mode= ' A + ', Header=false) def save_to_mongodb (one_page): #添加到MongoDB i F Db[mongo_table].insert (One_page): Print (' Save to MongoDB ok! ', one_page) return True return falsedef Update_to_mongodb (one_page): #update到MongoDB if db[mongo_table]        . Update ({' webpage ': one_page[' page ']},{' $set ': one_page},true): Print (' Save MongoDB success! ')  Return True return falsedef main (URL):  Data=[] save = one_page (URL) data.append (save) pandas_to_csv (data) Update_to_mongodb (save) #writer_to_tex T (one_page (URL)) If __name__ = = ' __main__ ': user_in_city = input (' Enter the alphabet of the city you need: \ n: Zhongshan Zs, Guangzhou gz\n!!! Do not enter, otherwise cannot run ') USER_IN_NUB = 1 + int (input (' Enter crawl page: ') pool = Pool () for URL in Get_url (user_in_city, USER_IN_NUB ): Pool.map (Main,[url_open for url_open in Open_url (URL, user_in_city)]) "The house smells good and Loud (???)"

  

Python3 crawler crawl to take home housing information (multi-threaded version)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.