7 Open source search engines for big data processing

Source: Internet
Author: User
Tags apache solr solr elastic search

7 Open source search engines for big data processing

Big data is a term that includes everything, meaning that datasets are large and complex, and they need specially designed hardware and software tools. Datasets are usually T or a larger level. These datasets are created from a variety of sources, including sensors, collecting meteorological information, and publicly available information such as magazines, newspapers, and articles. It also includes purchase transactions, web logs, medical records, military reconnaissance, video and image archives, and large-scale e-commerce, among others.

To analyze this data requires specialized hardware and software, this article describes 7 open source search engines suitable for large data processing:

1. Apache Lucene

Lucene is an open source full-text Search engine toolkit for the Apache Software Foundation, a full-text search engine architecture that provides a complete query engine and indexing engine, part of the text analysis engine. Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.

Characteristics:

    • Indexing process:
      • Handle more than 150GB of data per hour on today's popular hardware platforms
      • Memory footprint is small, only 1MB of heap memory
      • Incremental indexes are as fast as batch indexing
      • The index size is approximately the 20-30% size of the text index
      • Static Index Pruning
    • Search algorithm:
      • Range Search-Return best results first
      • Many powerful query types: Phrase query, wildcard query, approximate query, scope query, etc.
      • Can be queried individually for a field
      • Can be sorted individually according to a field
      • Multi-index Search and merge search results
      • Allow simultaneous update of indexes and searches
      • Flexible façade search, highlighting, union and grouping of result sets
      • Fast, low memory footprint and fault tolerance
      • Pluggable ranking models, including VSM and Okapi MB25
      • Configurable storage Engine
    • Cross-platform Solutions
      • 100% Pure Java
      • Other languages provide index-compatible implementations

2. Apache SOLR

Apache Solr (pronunciation: SOLer) is an open source search server. SOLR is developed using the Java language, mainly based on HTTP and Apache Lucene implementations. The resources stored in Apache SOLR are stored as objects in Document. Each document consists of a series of field, and each field represents a property of the resource. Each Document in SOLR needs to have a property that uniquely identifies itself, by default the name of the property is the ID, which is used in the Schema configuration file: <uniqueKey>id</uniqueKey> to describe it.

3. ElasticSearch

Elastic Search is an open source, distributed, restful search engine built on Lucene. Designed for cloud computing, it can achieve real-time search, stable, reliable, fast, easy to install and use. Supports data indexing using JSON with HTTP.

4. Sphinx

Sphinx is an SQL-based full-text search engine that can be combined with mysql,postgresql for full-text searching, which provides a more specialized search function than the database itself, making it easier for applications to implement specialized full-text searches. Sphinx Specifically designed search API interfaces for some scripting languages, such as Php,python,perl,ruby, and also designed a storage engine plugin for MySQL.

5. Xapian

Xapian is a full-text retrieval program written in C + +, which functions like Lucene in Java. Although Lucene in the Java World is already a standard full-text retrieval program, the C + + world does not have a corresponding tool, and Xapian fills this gap.

6. Nutch

Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers.

Although Web search is a basic requirement for roaming the Internet, the number of existing Web search engines is declining. And it is likely to evolve further into a company that has monopolized almost all web searches for its commercial gain. This is obviously detrimental to the vast majority of Internet users.

Nutch provides us with such a different choice. Nutch as an open source search engine will be more transparent and more trustworthy than those of commercial search engines. All major search engines now employ a private sorting algorithm, rather than explaining why a page is in a particular location. In addition, some search engines are ranked according to the cost of the site rather than by their own value. Unlike them, Nucth has nothing to hide, and there is no motive to distort the results of the search. Nutch will do its utmost to provide users with the best search results.

Nutch is committed to making it easy for everyone to configure world-class web search engines at a fraction of the cost. To accomplish this ambitious goal, Nutch must be able to:

    • Take billions of pages per month
    • Maintain an index for these pages
    • Thousands of searches per second for index files
    • Provide high-quality search results
    • operate at minimal cost

7. Lgte

Lgte is based on Lucene, which provides an extended Lucene API for integrating many services, such as fragment generation, query extensions, and so on, and provides a set of unit tests.

Features include:

    • Provides an abstraction layer for a simple and efficient Lucene API
    • Support for integrated search and sorting in terms of themes, time, and geography
    • Support Lucene standard retrieval model, provide more advanced probability retrieval method
    • Support Rochio Query Extension
    • Provides a framework for the IR emulation experience (for example, dealing with Clef/trec topics)
    • Java replacement version with Trec_eval tools
    • Contains a simple test application to search for Braun Corpus or Cranfield Corpus
    • Trec/clef Simulation Framework-Tools for collection indexing, running theme searches and outputting results using Treckeval format
    • Use different folders to provide quarantine fields
    • Provide a hierarchical index through a foreign key field
    • Provides classes for parsing documents using Yahoo PlaceMaker

7 Open source search engines for big data processing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.