Search engine (I) Full-text resource retrieval framework Lucene

Source: Internet
Author: User

I wrote and searched today. Most of the searches involved in the current project were made by me. The manager once asked me that I had never used the luence search engine, I have never heard of such a search toolkit, but I have never been touched by it. I am gradually learning about the search in the idle time of my project, I have a new understanding of the search and feel that the search direction is worthy of our research and application.

In terms of learning, it is important to select a correct learning direction. In the course of our study, Mr. Mi gave me directions, so you will not be confused and know what to learn next, but slowly, you need to determine what you want to learn and change roles. The amount of information that is so developed on the internet is usually displayed in front of you. When you click here on the last network, it will never end, there is a lot of information on the Internet, and it is worth looking for a good direction to learn. Think about whether it is a PC application or a mobile phone, there is a search function, in the big data era, data is all in T units. How can we quickly find out what users want is becoming more and more noteworthy.

 Big Data Definition:

Big Data is a term that includes everything. It refers to a large and complex dataset. They need specially designed hardware and software tools. A dataset is usually T or larger. These datasets are created from a variety of sources, including sensors, collecting meteorological information, and publishing available information, such as magazines, newspapers, and articles. It also includes purchase transaction records, network logs, medical records, military reconnaissance, video and image files, and large-scale e-commerce.

Luence, you can locate it like this: it can be said that it is synonymous with search. When it comes to search, luence will generally come to mind. In addition to it, there are also many articles similar to search tools that you want to read. You can go and see them by yourself.

Lucene is an open-source full-text search engine toolkit of the Apache Software Foundation. It is a full-text search engine architecture that provides a complete query engine and index engine, part of the text analysis engine. Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.

Features:

  • Indexing Process:
    • Processing of over 150 GB of data per hour on popular hardware platforms
    • Memory usage is small, only 1 MB of heap memory is required
    • Incremental indexing and batch indexing are as fast as possible
    • The index size is about 20-30% of the text index size.
    • Static index pruning
  • Search Algorithm:
    • Range search-returns the best result first
    • Many powerful query types: phrase query, wildcard query, approximate query, and Range Query
    • You can query a specific field separately.
    • Sort by a field separately
    • Multi-index search and merge search results
    • Allows simultaneous Index Update and search
    • Flexible facade search, highlight, and result set combination and grouping
    • Fast, low memory usage and fault tolerance
    • Pluggable ranking model, including VSM and Okapi mb25
    • Configurable storage engine
  • Cross-platform Solutions
    • 100% pure Java
    • Implement index compatibility in other languages

As an open-source project, Lucene has triggered a huge response from the open-source community since its release. Programmers not only use it to build specific full-text retrieval applications, in addition, it is integrated into a variety of system software, as well as the construction of Web applications, and even some commercial software also uses Lucene as the core of its internal full-text retrieval subsystem.

For example, the well-known oschina, Eclipse, myeclipse, and jforum all use luence's no-search framework to implement their own search content, it is necessary to add the search capability in our own project to greatly improve the search experience of our development system.

Additional reading:

Another open source code extension in luence and a search without confidential content is nutch. Now many search engines will find you if your website is not frequently accessed, many of them are paid to search for you for a high price. I will search for you for my search.

Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and web crawler.

Although Web search is a basic requirement for roaming the Internet, the number of existing Web search engines is declining. and it is likely to evolve into a company that monopolizes almost all Web searches for commercial benefits. this is obviously not conducive to Internet users.

Nutch provides us with such a different choice. compared with commercial search engines, as an open-source search engine, nutch is more transparent and trustworthy. currently, all major search engines use private sorting algorithms, which does not explain why a webpage is placed in a specific position. in addition, some search engines pay for the website, rather than sort by their own value. different from them, nucth has no need to conceal or distort the search results. nutch will do its best to provide users with the best search results.

In order to achieve this ambitious goal, nuttch must be able:

  • Retrieve billions of webpages every month
  • Maintain an index for these webpages
  • Search index files for thousands of times per second
  • Provides high-quality search results
  • Operate at minimal cost

Relationship between luence and Applications



Principle Analysis

Indexing is the core of modern search engines. The indexing process is to process the source data into an index file that is very convenient to query. Why is indexing so important? Imagine that you want to search for documents containing a keyword in a large number of documents. If you do not create an index, you need to read these documents into the memory in sequence, then, check whether this article contains the keywords to be searched. In this case, it will take a lot of time. Think about the search engine's search results within milliseconds. This is because an index is created. You can think of an index as a data structure that allows you to quickly and randomly access the keywords stored in the index, then find the document associated with the keyword. Lucene adopts an inverted index mechanism. Reverse indexing means that we maintain a word/phrase table. For each word/phrase in this table, a linked list describes which documents contain the word/phrase. In this way, you can quickly obtain search results when entering query conditions. We will introduce the indexing mechanism of Lucene in the second part of this series. Because Lucene provides a simple and easy-to-use API, therefore, even if you are not familiar with the full text indexing mechanism at the beginning, you can easily use Lucene to index your documents.


After you have created an index for the document, you can search for these indexes. The search engine first parses the search keywords, then searches for the created indexes, and finally returns the documents associated with the keywords entered by the user.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.