全文檢索索引引擎
1、Sphinx
1.1.Sphinx是什麼
Sphinx是由俄羅斯人Andrew Aksyonoff開發的一個全文檢索索引引擎。意圖為其他應用提供高速、低空間佔用、高結果 相關度的全文檢索搜尋功能。Sphinx可以非常容易的與SQL資料庫和指令碼語言整合。當前系統內建MySQL和PostgreSQL 資料庫資料來源的支援,也支援從標準輸入讀取特定格式 的XML資料。通過修改原始碼,使用者可以自行增加新的資料來源(例如:其他類型的DBMS 的原生支援)
Official APIs for PHP, Python, Java, Ruby, pure C are included in Sphinx distribution
1.2.Sphinx的特性
- 高速的建立索引(在當代CPU上,峰值效能可達到10 MB/秒);
- 高效能的搜尋(在2 – 4GB 的文本資料上,平均每次檢索回應時間小於0.1秒);
- 可處理海量資料(目前已知可以處理超過100 GB的文本資料, 在單一CPU的系統上可 處理100 M 文檔);
- 提供了優秀的相關度演算法,基於短語相似性和統計(BM25)的複合Ranking方法;
- 支援分布式搜尋;
- 支援短語搜尋
- 提供文檔摘要產生
- 可作為MySQL的儲存引擎提供搜尋服務;
- 支援布爾、短語、詞語相似性等多種檢索模式;
- 文檔支援多個全文檢索索引欄位(最大不超過32個);
- 文檔支援多個額外的屬性資訊(例如:分組資訊,時間戳記等);
- 支援斷詞;
1.3.Sphinx中文分詞
中文的全文檢索索引和英文等latin系列不一樣,後者是根據空格等特殊字元來斷詞,而中文是根據語義來分詞。目前大多數資料庫尚未支援中文全文檢索索引,如Mysql。故,國內出現了一些Mysql的中文全文檢索索引的外掛程式,做的比較好的有hightman的中文分詞。Sphinx如果需要對中文進行全文檢索索引,也得需要一些外掛程式來補充。其中我知道的外掛程式有 coreseek 和 sfc 。
2、Xapian
Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby (so far!)
Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.
爬蟲
1、Scrapy
1.1、What is Scrapy
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
1.2、Scrapy Features
Simple
Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way
Productive
Just write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
-
Fast
-
Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server
-
Extensible
-
Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
-
Portable
-
Scrapy runs on Linux, Windows, Mac and BSD
-
Open Source and 100% Python
-
Scrapy is completely written in Python, which makes it very easy to hack
-
Well-tested
-
Scrapy has an extensive test suite with very good code coverage
Html處理
1、Beautiful Soup
Beautiful Soup 是用Python寫的一個HTML/XML的解析器,它可以很好的處理不規範標記並產生剖析樹(parse tree)。它提供簡單又常用的導航(navigating),搜尋以及修改剖析樹的操作。它可以大大節省你的編程時間。 對於Ruby,使用Rubyful Soup。
與web網站互動
1、mechanize
Stateful programmatic web browsing in Python, after Andy Lester’s Perl module
mechanize.Browser
and mechanize.UserAgentBase
implement the interface of urllib2.OpenerDirector
, so:
any URL can be opened, not just http:
mechanize.UserAgentBase
offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt
handling, without having to make a new OpenerDirector
each time, e.g. by calling build_opener()
.
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back()
and .reload()
methods).
The Referer
HTTP header is added properly (optional).
Automatic observance of robots.txt
.
Automatic handling of HTTP-Equiv and Refresh.