python搜尋引擎和爬蟲架構介紹

來源:互聯網
上載者:User

全文檢索索引引擎 

1、Sphinx

1.1.Sphinx是什麼

Sphinx是由俄羅斯人Andrew Aksyonoff開發的一個全文檢索索引引擎。意圖為其他應用提供高速、低空間佔用、高結果 相關度的全文檢索搜尋功能。Sphinx可以非常容易的與SQL資料庫和指令碼語言整合。當前系統內建MySQL和PostgreSQL 資料庫資料來源的支援,也支援從標準輸入讀取特定格式 的XML資料。通過修改原始碼,使用者可以自行增加新的資料來源(例如:其他類型的DBMS 的原生支援)

Official APIs for PHP, Python, Java, Ruby, pure C are included in Sphinx distribution

 

1.2.Sphinx的特性
  • 高速的建立索引(在當代CPU上,峰值效能可達到10 MB/秒);
  • 高效能的搜尋(在2 – 4GB 的文本資料上,平均每次檢索回應時間小於0.1秒);
  • 可處理海量資料(目前已知可以處理超過100 GB的文本資料, 在單一CPU的系統上可 處理100 M 文檔);
  • 提供了優秀的相關度演算法,基於短語相似性和統計(BM25)的複合Ranking方法;
  • 支援分布式搜尋;
  • 支援短語搜尋
  • 提供文檔摘要產生
  • 可作為MySQL的儲存引擎提供搜尋服務;
  • 支援布爾、短語、詞語相似性等多種檢索模式;
  • 文檔支援多個全文檢索索引欄位(最大不超過32個);
  • 文檔支援多個額外的屬性資訊(例如:分組資訊,時間戳記等);
  • 支援斷詞;
1.3.Sphinx中文分詞

中文的全文檢索索引和英文等latin系列不一樣,後者是根據空格等特殊字元來斷詞,而中文是根據語義來分詞。目前大多數資料庫尚未支援中文全文檢索索引,如Mysql。故,國內出現了一些Mysql的中文全文檢索索引的外掛程式,做的比較好的有hightman的中文分詞。Sphinx如果需要對中文進行全文檢索索引,也得需要一些外掛程式來補充。其中我知道的外掛程式有 coreseek 和 sfc 。

 2、Xapian

 

Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby (so far!)

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

 

爬蟲

1、Scrapy

1.1、What is Scrapy

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

1.2、Scrapy Features

Simple

        Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way

Productive

        Just write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you

Fast
Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server
Extensible
Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable
Scrapy runs on Linux, Windows, Mac and BSD
Open Source and 100% Python
Scrapy is completely written in Python, which makes it very easy to hack
Well-tested
Scrapy has an extensive test suite with very good code coverage

 Html處理

1、Beautiful Soup

 Beautiful Soup 是用Python寫的一個HTML/XML的解析器,它可以很好的處理不規範標記並產生剖析樹(parse tree)。它提供簡單又常用的導航(navigating),搜尋以及修改剖析樹的操作。它可以大大節省你的編程時間。 對於Ruby,使用Rubyful Soup。

 

與web網站互動

1、mechanize

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module

 

  • mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:

    • any URL can be opened, not just http:

    • mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().

  • Easy HTML form filling.

  • Convenient link parsing and following.

  • Browser history (.back() and .reload() methods).

  • The Referer HTTP header is added properly (optional).

  • Automatic observance of robots.txt.

  • Automatic handling of HTTP-Equiv and Refresh.

 

 

 

 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.