【scrapy實踐】_爬取安居客_廣州_新樓盤資料,scrapy新樓盤

來源:互聯網
上載者:User

【scrapy實踐】_爬取安居客_廣州_新樓盤資料,scrapy新樓盤

需求:爬取【安居客—廣州—新樓盤】的資料,具體到每個樓盤的詳情頁的若干欄位。

痛點:樓盤類型各式各樣:住宅 別墅 商住 商鋪 寫字樓,不同樓盤欄位的名稱不一樣。然後同一種類型,比如住宅,又分為不同的情況,比如分為期房在售,現房在售,待售,尾盤。其他類型也有類似情況。所以欄位不能設定固定住。

解決方案:目前想到的解決方案,第一種:scrapy中items.py中不設定欄位,spider中爬的時候自動識別欄位(也就是有啥欄位就保留下來),然後返回字典存起來。第二種,不同欄位的網頁分別寫規則單獨抓取。顯然不可取。我採用的是第一種方案。還有其他方案的朋友們,歡迎交流哈。

目標網址為:http://gz.fang.anjuke.com/ 該網頁下的樓盤資料

樣本樓盤網址:http://gz.fang.anjuke.com/loupan/canshu-298205.html?from=loupan_tab

開始編寫scrapy指令碼。建立工程步驟略過。

1、count.py

 1 __author__ = 'Oscar_Yang' 2 #-*- coding= utf-8 -*- 3 """ 4     查看mongodb儲存狀況的指令碼count.py 5 """ 6 import time 7 import pymongo 8 client = pymongo.MongoClient("localhost", 27017) 9 db = client["SCRAPY_anjuke_gz"]10 sheet = db["anjuke_doc1"]11 12 while True:13     print(sheet.find().count())14     print("____________________________________")15     time.sleep(3)

1 """2     entrypoint.py3 """4 from scrapy.cmdline import execute5 execute(['scrapy', 'crawl', 'anjuke_gz'])
 1 # -*- coding: utf-8 -*- 2 """ 3     settings.py 4 """ 5  6 # Scrapy settings for anjuke_gz project 7 # 8 # For simplicity, this file contains only settings considered important or 9 # commonly used. You can find more settings consulting the documentation:10 #11 #     http://doc.scrapy.org/en/latest/topics/settings.html12 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html13 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html14 15 BOT_NAME = 'anjuke_gz'16 17 SPIDER_MODULES = ['anjuke_gz.spiders']18 NEWSPIDER_MODULE = 'anjuke_gz.spiders'19 MONGODB_HOST = "127.0.0.1"20 MONGODB_PORT = 2701721 MONGODB_DBNAME="SCRAPY_anjuke_gz"22 MONGODB_DOCNAME="anjuke_doc1"23 24 # Crawl responsibly by identifying yourself (and your website) on the user-agent25 #USER_AGENT = 'anjuke_gz (+http://www.yourdomain.com)'26 27 # Obey robots.txt rules28 ROBOTSTXT_OBEY = False29 30 # Configure maximum concurrent requests performed by Scrapy (default: 16)31 #CONCURRENT_REQUESTS = 3232 33 # Configure a delay for requests for the same website (default: 0)34 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay35 # See also autothrottle settings and docs36 #DOWNLOAD_DELAY = 337 # The download delay setting will honor only one of:38 #CONCURRENT_REQUESTS_PER_DOMAIN = 1639 #CONCURRENT_REQUESTS_PER_IP = 1640 41 # Disable cookies (enabled by default)42 #COOKIES_ENABLED = False43 44 # Disable Telnet Console (enabled by default)45 #TELNETCONSOLE_ENABLED = False46 47 # Override the default request headers:48 #DEFAULT_REQUEST_HEADERS = {49 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',50 #   'Accept-Language': 'en',51 #}52 53 # Enable or disable spider middlewares54 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html55 #SPIDER_MIDDLEWARES = {56 #    'anjuke_gz.middlewares.AnjukeGzSpiderMiddleware': 543,57 #}58 59 # Enable or disable downloader middlewares60 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html61 #DOWNLOADER_MIDDLEWARES = {62 #    'anjuke_gz.middlewares.MyCustomDownloaderMiddleware': 543,63 #}64 65 # Enable or disable extensions66 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html67 #EXTENSIONS = {68 #    'scrapy.extensions.telnet.TelnetConsole': None,69 #}70 71 # Configure item pipelines72 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html73 ITEM_PIPELINES = {74    'anjuke_gz.pipelines.AnjukeGzPipeline': 300,75 }76 77 # Enable and configure the AutoThrottle extension (disabled by default)78 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html79 #AUTOTHROTTLE_ENABLED = True80 # The initial download delay81 #AUTOTHROTTLE_START_DELAY = 582 # The maximum download delay to be set in case of high latencies83 #AUTOTHROTTLE_MAX_DELAY = 6084 # The average number of requests Scrapy should be sending in parallel to85 # each remote server86 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.087 # Enable showing throttling stats for every response received:88 #AUTOTHROTTLE_DEBUG = False89 90 # Enable and configure HTTP caching (disabled by default)91 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings92 HTTPCACHE_ENABLED = True93 HTTPCACHE_EXPIRATION_SECS = 094 HTTPCACHE_DIR = 'httpcache'95 HTTPCACHE_IGNORE_HTTP_CODES = []96 HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

接下來,是items。因為沒有設定欄位,為預設的代碼。

 1 # -*- coding: utf-8 -*- 2  3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # http://doc.scrapy.org/en/latest/topics/items.html 7  8 import scrapy 9 10 11 class AnjukeGzItem(scrapy.Item):12     # define the fields for your item here like:13     # name = scrapy.Field()14     pass

接下來,是piplines.py。在中設定了mongodb的配置。

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf import settingsclass AnjukeGzPipeline(object):    def __init__(self):        host=settings["MONGODB_HOST"]        port=settings["MONGODB_PORT"]        dbname=settings["MONGODB_DBNAME"]        client=pymongo.MongoClient(port=port,host=host)        tdb = client[dbname]        self.post=tdb[settings["MONGODB_DOCNAME"]]    def process_item(self,item,spider):        info = dict(item)        self.post.insert(info)        return item

最後,是最主要的spider.py

 1 from scrapy.http import Request 2 import scrapy 3 from bs4 import BeautifulSoup 4 import re 5 import requests 6 """ 7     spider指令碼 8 """ 9 class Myspider(scrapy.Spider):10     name = 'anjuke_gz'11     allowed_domains = ['http://gz.fang.anjuke.com/loupan/']12     start_urls = ["http://gz.fang.anjuke.com/loupan/all/p{}/".format(i) for i in range(39)]13 14     def parse(self, response):15         soup = BeautifulSoup(response.text,"lxml")16         content=soup.find_all(class_="items-name") #返回每個樓盤的對應資料17         for item in content:18             code=item["href"].split("/")[-1][:6]19             real_href="http://gz.fang.anjuke.com/loupan/canshu-{}.html?from=loupan_tab".format(code) #拼湊出樓盤詳情頁的url20             res=requests.get(real_href)21             soup = BeautifulSoup(res.text,"lxml")22             a = re.findall(r'<div class="name">(.*?)</div>', str(soup))23             b = soup.find_all(class_="des")24             data = {}25             for (i, j) in zip(range(len(b)), a):26                 data[j] = b[i].text.strip().strip("\t")27                 data["url"] = real_href28             yield data

下面是存入mongodb的情況。

  因為針對不同的網頁結構,爬取的規則是一個,所以爬取的時候就不能針對每個欄位進行爬取,所以存到庫裡的資料如果要是分析的話還需要清洗。

在python中使用mongodb的查詢語句,再配合使用pandas應該就很方便清洗了。

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.