四十六 Python分布式爬蟲打造搜尋引擎Scrapy精講—elasticsearch(搜尋引擎)scrapy寫入資料到elasticsearch中

最後更新：2018-01-03 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：方法 article init imp [] strong from adt 分享圖片

前面我們講到的elasticsearch(搜尋引擎)操作，如：增、刪、改、查等操作都是用的elasticsearch的語言命令，就像sql命令一樣，當然elasticsearch官方也提供了一個python操作elasticsearch(搜尋引擎)的介面包，就像sqlalchemy操作資料庫一樣的ORM框，這樣我們操作elasticsearch就不用寫命令了，用elasticsearch-dsl-py這個模組來操作，也就是用python的方式操作一個類即可

elasticsearch-dsl-py下載

：https://github.com/elastic/elasticsearch-dsl-py

文檔說明：http://elasticsearch-dsl.readthedocs.io/en/latest/

首先安裝好elasticsearch-dsl-py模組

1、elasticsearch-dsl模組使用說明

create_connection(hosts=[‘127.0.0.1‘])：串連elasticsearch(搜尋引擎)伺服器方法，可以串連多台伺服器
class Meta：設定索引名稱和表名稱
索引類名稱.init(): 產生索引和表以及欄位
執行個體化索引類.save():將資料寫入elasticsearch(搜尋引擎)

elasticsearch_orm.py 操作elasticsearch(搜尋引擎)檔案

#!/usr/bin/env python# -*- coding:utf8 -*-from datetime import datetimefrom elasticsearch_dsl import DocType, Date, Nested, Boolean,     analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer# 更多欄位類型見第三百六十四節elasticsearch(搜尋引擎)的mapping映射管理from elasticsearch_dsl.connections import connections       # 匯入串連elasticsearch(搜尋引擎)伺服器方法connections.create_connection(hosts=[‘127.0.0.1‘])class lagouType(DocType):                                                   # 自訂一個類來繼承DocType類    # Text類型需要分詞，所以需要知道中文分詞器，ik_max_wordwei為中文分詞器    title = Text(analyzer="ik_max_word")                                    # 設定，欄位名稱=欄位類型，Text為字串類型並且可以分詞建立倒排索引    description = Text(analyzer="ik_max_word")    keywords = Text(analyzer="ik_max_word")    url = Keyword()                                                         # 設定，欄位名稱=欄位類型，Keyword為一般字元串類型，不分詞    riqi = Date()                                                           # 設定，欄位名稱=欄位類型，Date日期類型    class Meta:                                                             # Meta是固定寫法        index = "lagou"                                                     # 設定索引名稱(相當於資料庫名稱)        doc_type = ‘biao‘                                                   # 設定表名稱if __name__ == "__main__":          # 判斷在本代碼檔案執行才執行裡面的方法，其他頁面調用的則不執行裡面的方法    lagouType.init()                # 產生elasticsearch(搜尋引擎)的索引，表，欄位等資訊# 使用方法說明：# 在要要操作elasticsearch(搜尋引擎)的頁面，匯入此模組# lagou = lagouType()           #執行個體化類# lagou.title = ‘值‘            #要寫入欄位=值# lagou.description = ‘值‘# lagou.keywords = ‘值‘# lagou.url = ‘值‘# lagou.riqi = ‘值‘# lagou.save()                  #將資料寫入elasticsearch(搜尋引擎)

2、scrapy寫入資料到elasticsearch中

爬蟲檔案

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom adc.items import LagouItem,LagouItemLoader  #匯入items容器類,和ItemLoader類import timeclass LagouSpider(CrawlSpider):                     #建立爬蟲類    name = ‘lagou‘                                  #爬蟲名稱    allowed_domains = [‘www.luyin.org‘]             #起始網域名稱    start_urls = [‘http://www.luyin.org/‘]          #起始url    custom_settings = {        "AUTOTHROTTLE_ENABLED": True,                             #覆蓋掉settings.py裡的相同設定，開啟COOKIES        "DOWNLOAD_DELAY":5    }    rules = (        #配置抓取列表頁規則        Rule(LinkExtractor(allow=(‘ggwa/.*‘)), follow=True),        #配置抓取內容頁規則        Rule(LinkExtractor(allow=(‘post/\d+.html.*‘)), callback=‘parse_job‘, follow=True),    )    def parse_job(self, response):                  #回呼函數，注意：因為CrawlS模板的源碼建立了parse回呼函數，所以切記我們不能建立parse名稱的函數        atime = time.localtime(time.time())         #擷取系統目前時間        dqatime = "{0}-{1}-{2} {3}:{4}:{5}".format(            atime.tm_year,            atime.tm_mon,            atime.tm_mday,            atime.tm_hour,            atime.tm_min,            atime.tm_sec        )  # 將格式化時間日期，單獨取出來拼接成一個完整日期        url = response.url        item_loader = LagouItemLoader(LagouItem(), response=response)   # 將資料填充進items.py檔案的LagouItem        item_loader.add_xpath(‘title‘, ‘/html/head/title/text()‘)        item_loader.add_xpath(‘description‘, ‘/html/head/meta[@name="Description"]/@content‘)        item_loader.add_xpath(‘keywords‘, ‘/html/head/meta[@name="keywords"]/@content‘)        item_loader.add_value(‘url‘, url)        item_loader.add_value(‘riqi‘, dqatime)        article_item = item_loader.load_item()yield article_item

items.py檔案

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html#items.py,檔案是專門用於，接收爬蟲擷取到的資料資訊的，就相當於是容器檔案import scrapyfrom scrapy.loader.processors import MapCompose,TakeFirstfrom scrapy.loader import ItemLoader                #匯入ItemLoader類也就載入items容器類填充資料from adc.models.elasticsearch_orm import lagouType  #匯入elasticsearch操作模組class LagouItemLoader(ItemLoader):                  #自訂Loader繼承ItemLoader類，在爬蟲頁面調用這個類填充資料到Item類    default_output_processor = TakeFirst()          #預設利用ItemLoader類，載入items容器類填充資料，是清單類型，可以通過TakeFirst()方法，擷取到列表裡的內容def tianjia(value):                                 #自訂資料預先處理函數    return value                                    #將處理後的資料返給Itemclass LagouItem(scrapy.Item):                       #設定爬蟲擷取到的資訊容器類    title = scrapy.Field(                           #接收爬蟲擷取到的title資訊        input_processor=MapCompose(tianjia),        #將資料預先處理函數名稱傳入MapCompose方法裡處理，資料預先處理函數的形式參數value會自動接收欄位title    )    description = scrapy.Field()    keywords = scrapy.Field()    url = scrapy.Field()    riqi = scrapy.Field()    def save_to_es(self):        lagou = lagouType()                         # 執行個體化elasticsearch(搜尋引擎對象)        lagou.title = self[‘title‘]                 # 欄位名稱=值        lagou.description = self[‘description‘]        lagou.keywords = self[‘keywords‘]        lagou.url = self[‘url‘]        lagou.riqi = self[‘riqi‘]        lagou.save()                                # 將資料寫入elasticsearch(搜尋引擎對象)        return

pipelines.py檔案

# -*- coding: utf-8 -*-# Define your item pipelines here## Don‘t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom adc.models.elasticsearch_orm import lagouType  #匯入elasticsearch操作模組class AdcPipeline(object):    def process_item(self, item, spider):        #也可以在這裡將資料寫入elasticsearch搜尋引擎，這裡的缺點是統一處理        # lagou = lagouType()        # lagou.title = item[‘title‘]        # lagou.description = item[‘description‘]        # lagou.keywords = item[‘keywords‘]        # lagou.url = item[‘url‘]        # lagou.riqi = item[‘riqi‘]        # lagou.save()        item.save_to_es()       #執行items.py檔案的save_to_es方法將資料寫入elasticsearch搜尋引擎        return item

settings.py檔案，註冊pipelines

# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   ‘adc.pipelines.AdcPipeline‘: 300,}

main.py爬蟲開機檔案

#!/usr/bin/env python# -*- coding:utf8 -*-from scrapy.cmdline import execute  #匯入執行scrapy命令方法import sysimport ossys.path.append(os.path.join(os.getcwd())) #給Python解譯器，添加模組新路徑 ,將main.py檔案所在目錄添加到Python解譯器execute([‘scrapy‘, ‘crawl‘, ‘lagou‘, ‘--nolog‘])  #執行scrapy命令# execute([‘scrapy‘, ‘crawl‘, ‘lagou‘])  #執行scrapy命令

運行爬蟲

寫入elasticsearch(搜尋引擎)情況

補充：elasticsearch-dsl 的增刪改查

#!/usr/bin/env python# -*- coding:utf8 -*-from datetime import datetimefrom elasticsearch_dsl import DocType, Date, Nested, Boolean,     analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer# 更多欄位類型見第三百六十四節elasticsearch(搜尋引擎)的mapping映射管理from elasticsearch_dsl.connections import connections       # 匯入串連elasticsearch(搜尋引擎)伺服器方法connections.create_connection(hosts=[‘127.0.0.1‘])class lagouType(DocType):                                                   # 自訂一個類來繼承DocType類    # Text類型需要分詞，所以需要知道中文分詞器，ik_max_wordwei為中文分詞器    title = Text(analyzer="ik_max_word")                                    # 設定，欄位名稱=欄位類型，Text為字串類型並且可以分詞建立倒排索引    description = Text(analyzer="ik_max_word")    keywords = Text(analyzer="ik_max_word")    url = Keyword()                                                         # 設定，欄位名稱=欄位類型，Keyword為一般字元串類型，不分詞    riqi = Date()                                                           # 設定，欄位名稱=欄位類型，Date日期類型    class Meta:                                                             # Meta是固定寫法        index = "lagou"                                                     # 設定索引名稱(相當於資料庫名稱)        doc_type = ‘biao‘                                                   # 設定表名稱if __name__ == "__main__":          # 判斷在本代碼檔案執行才執行裡面的方法，其他頁面調用的則不執行裡面的方法    lagouType.init()                # 產生elasticsearch(搜尋引擎)的索引，表，欄位等資訊# 使用方法說明：# 在要要操作elasticsearch(搜尋引擎)的頁面，匯入此模組# lagou = lagouType()           #執行個體化類# lagou.title = ‘值‘            #要寫入欄位=值# lagou.description = ‘值‘# lagou.keywords = ‘值‘# lagou.url = ‘值‘# lagou.riqi = ‘值‘# lagou.save()                  #將資料寫入elasticsearch(搜尋引擎)

1新增資料

from adc.models.elasticsearch_orm import lagouType  #匯入剛才配置的elasticsearch操作模組　　　　　lagou = lagouType()                         # 執行個體化elasticsearch(搜尋引擎對象)
　　　　　lagou._id = 1　　　　　　　　　　　　　#自訂ID，很重要，以後都是根據ID來操作
        lagou.title = self[‘title‘]                 # 欄位名稱=值        lagou.description = self[‘description‘]        lagou.keywords = self[‘keywords‘]        lagou.url = self[‘url‘]        lagou.riqi = self[‘riqi‘]        lagou.save()                                # 將資料寫入elasticsearch(搜尋引擎對象)

2刪除指定資料

from adc.models.elasticsearch_orm import lagouType  #匯入剛才配置的elasticsearch操作模組

sousuo_orm = lagouType()                    # 執行個體化sousuo_orm.get(id=1).delete()               # 刪除id等於1的資料

3修改指定的資料

from adc.models.elasticsearch_orm import lagouType  #匯入剛才配置的elasticsearch操作模組sousuo_orm = lagouType()                           # 執行個體化sousuo_orm.get(id=1).update(title=‘123456789‘)     # 修改id等於1的資料

以上全部使用elasticsearch-dsl模組

注意下面使用的原生elasticsearch模組

刪除指定使用，就是相當於刪除指定資料庫

使用原生elasticsearch模組刪除指定索引

from elasticsearch import Elasticsearch                                     # 匯入原生的elasticsearch(搜尋引擎)介面client = Elasticsearch(hosts=settings.Elasticsearch_hosts)                  # 串連原生的elasticsearch# 使用原生elasticsearch模組刪除指定索引#要做容錯處理，如果索引不存在會報錯            try:                client.indices.delete(index=‘jxiou_zuopin‘)            except Exception as e:                pass

原生查詢

from elasticsearch import Elasticsearch                 # 匯入原生的elasticsearch(搜尋引擎)介面            client = Elasticsearch(hosts=Elasticsearch_hosts)       # 串連原生的elasticsearchresponse = client.search(                               # 原生的elasticsearch介面的search()方法，就是搜尋，可以支援原生elasticsearch語句查詢                index="jxiou_zuopin",                               # 設定索引名稱                doc_type="zuopin",                                  # 設定表名稱                body={                                              # 書寫elasticsearch語句                    "query": {                        "multi_match": {                            # multi_match查詢                            "query": sousuoci,                      # 查詢關鍵詞                            "fields": ["title"]                     # 查詢欄位                        }                    },                    "from": (page - 1) * tiaoshu,                   # 從第幾條開始擷取                    "size": tiaoshu,                                # 擷取多少條資料                    "highlight": {                                  # 查詢關鍵詞高亮處理                        "pre_tags": [‘<span class="gaoliang">‘],    # 高亮開始標籤                        "post_tags": [‘</span>‘],                   # 高亮結束標籤                        "fields": {                                 # 高亮設定                            "title": {}                             # 高亮欄位                        }                    }                }            )            # 開始擷取資料            total_nums = response["hits"]["total"]                  # 擷取查詢結果的總條數            hit_list = []                                           # 設定一個列表來儲存搜尋到的資訊，返回給html頁面            for hit in response["hits"]["hits"]:                                # 迴圈查詢到的結果                hit_dict = {}                                                   # 設定一個字典來儲存迴圈結果                if "title" in hit["highlight"]:                                 # 判斷title欄位，如果高亮欄位有類容                    hit_dict["title"] = "".join(hit["highlight"]["title"])      # 擷取高亮裡的title                else:                    hit_dict["title"] = hit["_source"]["title"]                 # 否則擷取不是高亮裡的title                hit_dict["id"] = hit["_source"]["nid"]                          # 擷取返回nid                # 加密樣音地址                hit_dict["yangsrc"] = jia_mi(str(hit["_source"]["yangsrc"]))    # 擷取返回yangsrc                hit_list.append(hit_dict)

四十六 Python分布式爬蟲打造搜尋引擎Scrapy精講—elasticsearch(搜尋引擎)scrapy寫入資料到elasticsearch中

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More