Python3 crawler crawl to take home housing information (multi-threaded version)

Last Update:2017-04-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first step is to analyze the website structure http://esf.zs.fang.com/

Find the information we need to get, click to see,

Links inside information more detailed, these are what we want to get.

1. We can get all the detailed links under the http://esf.zs.fang.com/link first http://esf.zs.fang.com/chushou/3_255784229.htm

2. You can then analyze and get the data we need under the detailed link

3. Access to database MongoDB after obtaining data

Open the Administrator tool F12 see detailed links to http://esf.zs.fang.com/

Find detailed links in the DL tag

Are neat so that we can get this detailed link using the BeautifulSoup method, but this link is not complete

Watch the links on the details page to find out what they are related to http://esf.zs.fang.com + chushou/3_255784229.htm

And then we can start writing.

We'll get the full details page first.

http://esf.zs.fang.com/house/i33/

http://esf.zs.fang.com/house/i34/

The number of pages found is associated with URLs, so we can get a generator to generate these original links

def get_url (USER_IN_CITY,USER_IN_NUB):  #获取user_in_city, user_in_nub link    url_home = (' Http://esf. ') + user_in_city + '. fang.com/house/i3{}/') for    Url_next in range (1, int (USER_IN_NUB)):        yield Url_home.format ( Url_next)

User_in_city,user_in_nub are http://esf.zs.fang.com/house/i34/Zs and i34,city and nub when parameters are passed to the function, we can change the resulting web page by calling later
Call

Because this is a generator, we're going to iterate over the generation

Get the URL and then we're going to parse the URL inside to get the detailed URL:
Viewing structures with developer tools

Link really a tag inside the href so we can def a method gets and then generates a full detailed link.

def open_url (url,user_in_city):    try:        res = requests.get (URL, headers=headers1)        if Res.status_code = = 200:            soup = BeautifulSoup (res.text, ' html5lib ')            Url_start = ' http://esf. ' + user_in_city + '. fang.com ' for            title In Soup.select ('. Title '):  # URL link list                url_end = Title.select (' a ') [0][' href ']                yield Url_start + url_end    Except requestexception:        return print (' Check Open_url ')

First import requests,from BS4 import BeautifulSoup to obtain and analyze

Join Status_code exception handling, if status_code is not equal to the return ' check Open_url '

Go directly to the source bar, old godless.

#!/usr/bin/env python3#-*-coding:utf-8-*-# Author; Tsukasaimport pymongomongo_url= ' localhost ' mongo_db= ' fangtianxia ' mongo_table= ' fangtianxia_fs ' import jsonimport Requestsfrom BS4 Import beautifulsoupfrom requests.exceptions import Requestexceptionimport Pandas as Pdimport Timefrom F Ake_useragent Import useragentfrom multiprocessing Import Poolua = useragent () headers1 = {' user-agent ': ' Ua.ramdom '} Client = Pymongo.    Mongoclient (mongo_url) db = Client[mongo_db]def Get_url (USER_IN_CITY,USER_IN_NUB): #获取user_in_city, user_in_nub links Url_home = (' Http://esf. ') + user_in_city + '. fang.com/house/i3{}/') for Url_next in range (1, int (USER_IN_NUB)): Yield Url_home.format (url_  Next) def open_url (url,user_in_city): Try:res = Requests.get (URL, headers=headers1) if Res.status_code = = 200:soup = BeautifulSoup (Res.text, ' html5lib ') Url_start = ' http://esf. ' + user_in_city + '. fang.c               Om ' for title ' in Soup.select ('. Title '): # URL link list Url_end = Title.select (' a ') [0][' href '] yield Url_start + url_end except Requestexception:return Print (' Check Open_url ') def one_page (house_url): Try:res = Requests.get (House_url, headers=headers1) if Res. Status_code ==200:soup = BeautifulSoup (Res.text, ' html5lib ') info = {} info[' web page '] = hous E_url info[' title '] = Soup.select (' h1 ') [0].text.strip () # Get title info[' Total price '] = Soup.select ('. red20b ') [0]. Text + ' # Total price info[' contact phone ' = soup.select (' #mobilecode ') [0].text # Phone #now_time = Time.strftime (' %y-%m-%d\t%h:%m ', Time.localtime (Time.time ())) #info [' obj update time '] = Now_time for SL in Soup.select (' Spa n '): # Gets the publish time if ' Publish Time ' in Sl.text.lstrip (' <span> '): Key, Value = (Sl.text.strip ( ). Rstrip (' ('). Split (': ')) Info[key] = value + ' * ' + soup.select (' #Time ') [0].text for DD in S Oup.select (' DD '): # Get detailedFine content if ': ' In Dd.text.strip (): Key, Value = (Dd.text.strip (). Split (': ')) Info[key] = value print (info) return info except Requestexception:return print (' Check One _page ') def writer_to_text (text): With open (' Room world. Text ', ' a ', encoding= ' Utf-8 ') as F:f.write (Json.dumps (text,ensure_ ascii=false) + ' \ n ') f.close () def pandas_to_xlsx (pd_list): Pd_look = PD. DataFrame (pd_list) pd_look.to_excel (' Room world. xlsx ', sheet_name= ' room World ') def pandas_to_csv (pd_list): Pd_look = PD. DataFrame (pd_list) pd_look.to_csv (' Room world. csv ', mode= ' A + ', Header=false) def save_to_mongodb (one_page): #添加到MongoDB i F Db[mongo_table].insert (One_page): Print (' Save to MongoDB ok! ', one_page) return True return falsedef Update_to_mongodb (one_page): #update到MongoDB if db[mongo_table]        . Update ({' webpage ': one_page[' page ']},{' $set ': one_page},true): Print (' Save MongoDB success! ')  Return True return falsedef main (URL):  Data=[] save = one_page (URL) data.append (save) pandas_to_csv (data) Update_to_mongodb (save) #writer_to_tex T (one_page (URL)) If __name__ = = ' __main__ ': user_in_city = input (' Enter the alphabet of the city you need: \ n: Zhongshan Zs, Guangzhou gz\n!!! Do not enter, otherwise cannot run ') USER_IN_NUB = 1 + int (input (' Enter crawl page: ') pool = Pool () for URL in Get_url (user_in_city, USER_IN_NUB ): Pool.map (Main,[url_open for url_open in Open_url (URL, user_in_city)]) "The house smells good and Loud (???)"

Python3 crawler crawl to take home housing information (multi-threaded version)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 crawler crawl to take home housing information (multi-threaded version)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 crawler crawl to take home housing information (multi-threaded version)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support