7 Open source search engines for big data processing

Last Update:2015-09-06 Source: Internet

Author: User

Tags apache solr solr elastic search

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

7 Open source search engines for big data processing

Big data is a term that includes everything, meaning that datasets are large and complex, and they need specially designed hardware and software tools. Datasets are usually T or a larger level. These datasets are created from a variety of sources, including sensors, collecting meteorological information, and publicly available information such as magazines, newspapers, and articles. It also includes purchase transactions, web logs, medical records, military reconnaissance, video and image archives, and large-scale e-commerce, among others.

To analyze this data requires specialized hardware and software, this article describes 7 open source search engines suitable for large data processing:

1. Apache Lucene

Lucene is an open source full-text Search engine toolkit for the Apache Software Foundation, a full-text search engine architecture that provides a complete query engine and indexing engine, part of the text analysis engine. Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.

Characteristics:

Indexing process:
- Handle more than 150GB of data per hour on today's popular hardware platforms
- Memory footprint is small, only 1MB of heap memory
- Incremental indexes are as fast as batch indexing
- The index size is approximately the 20-30% size of the text index
- Static Index Pruning
Search algorithm:
- Range Search-Return best results first
- Many powerful query types: Phrase query, wildcard query, approximate query, scope query, etc.
- Can be queried individually for a field
- Can be sorted individually according to a field
- Multi-index Search and merge search results
- Allow simultaneous update of indexes and searches
- Flexible façade search, highlighting, union and grouping of result sets
- Fast, low memory footprint and fault tolerance
- Pluggable ranking models, including VSM and Okapi MB25
- Configurable storage Engine
Cross-platform Solutions
- 100% Pure Java
- Other languages provide index-compatible implementations

2. Apache SOLR

Apache Solr (pronunciation: SOLer) is an open source search server. SOLR is developed using the Java language, mainly based on HTTP and Apache Lucene implementations. The resources stored in Apache SOLR are stored as objects in Document. Each document consists of a series of field, and each field represents a property of the resource. Each Document in SOLR needs to have a property that uniquely identifies itself, by default the name of the property is the ID, which is used in the Schema configuration file: <uniqueKey>id</uniqueKey> to describe it.

3. ElasticSearch

Elastic Search is an open source, distributed, restful search engine built on Lucene. Designed for cloud computing, it can achieve real-time search, stable, reliable, fast, easy to install and use. Supports data indexing using JSON with HTTP.

4. Sphinx

Sphinx is an SQL-based full-text search engine that can be combined with mysql,postgresql for full-text searching, which provides a more specialized search function than the database itself, making it easier for applications to implement specialized full-text searches. Sphinx Specifically designed search API interfaces for some scripting languages, such as Php,python,perl,ruby, and also designed a storage engine plugin for MySQL.

5. Xapian

Xapian is a full-text retrieval program written in C + +, which functions like Lucene in Java. Although Lucene in the Java World is already a standard full-text retrieval program, the C + + world does not have a corresponding tool, and Xapian fills this gap.

6. Nutch

Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers.

Although Web search is a basic requirement for roaming the Internet, the number of existing Web search engines is declining. And it is likely to evolve further into a company that has monopolized almost all web searches for its commercial gain. This is obviously detrimental to the vast majority of Internet users.

Nutch provides us with such a different choice. Nutch as an open source search engine will be more transparent and more trustworthy than those of commercial search engines. All major search engines now employ a private sorting algorithm, rather than explaining why a page is in a particular location. In addition, some search engines are ranked according to the cost of the site rather than by their own value. Unlike them, Nucth has nothing to hide, and there is no motive to distort the results of the search. Nutch will do its utmost to provide users with the best search results.

Nutch is committed to making it easy for everyone to configure world-class web search engines at a fraction of the cost. To accomplish this ambitious goal, Nutch must be able to:

Take billions of pages per month
Maintain an index for these pages
Thousands of searches per second for index files
Provide high-quality search results
operate at minimal cost

7. Lgte

Lgte is based on Lucene, which provides an extended Lucene API for integrating many services, such as fragment generation, query extensions, and so on, and provides a set of unit tests.

Features include:

Provides an abstraction layer for a simple and efficient Lucene API
Support for integrated search and sorting in terms of themes, time, and geography
Support Lucene standard retrieval model, provide more advanced probability retrieval method
Support Rochio Query Extension
Provides a framework for the IR emulation experience (for example, dealing with Clef/trec topics)
Java replacement version with Trec_eval tools
Contains a simple test application to search for Braun Corpus or Cranfield Corpus
Trec/clef Simulation Framework-Tools for collection indexing, running theme searches and outputting results using Treckeval format
Use different folders to provide quarantine fields
Provide a hierarchical index through a foreign key field
Provides classes for parsing documents using Yahoo PlaceMaker

7 Open source search engines for big data processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More