Chapter 1 Solr In Action

Source: Internet
Author: User
Tags apache solr processing text solr

Chapter 1 Solr In Action

  • 1.1 do I need a search engine?
  •  

     

    Chapter 1 Solr Introduction

    Overview of this chapter:

    · Data features processed by search engines

    · Common search engine Use Cases

    · Introduction to Solr core modules

    · Reasons for choosing Solr

    · Function Overview

     

    With the rapid development of technologies such as social media, cloud computing, mobile Internet, and big data, we are entering an exciting computing age. One of the major challenges software architects have begun to face is how to deal with the massive data generated and used by the world's huge user base. In addition, users are beginning to expect that online software applications will always be stable and available, and they will be able to keep responding, which puts forward higher scalability and stability requirements for applications. To meet these needs, some dedicated non-relational data storage and processing technologies, collectively referred to as NoSQL (Not Only SQL) technology, are gaining more and more popularity. These systems do not require that all data be stored in a relational data model that once became a de facto standard, but share a common design model, match the data storage and processing engine with a specific data type. In other words, NoSQL technology optimizes the performance for handling specific categories of data types. Due to the increasing demand for scalability and performance, NoSQL technologies and traditional relational databases are mixed. This cross-border architecture is becoming increasingly popular. The past era of data processing solutions can be used all over the world is gone forever.


    This book mainly discusses a special NoSQL technology, Apache Solr. Like other non-relational siblings, Solr has also optimized the handling of a specific type of problems. Specifically, Solr is an enterprise-level search engine that can be expanded and quickly deployed to search massive text center data and sort returned results in relevance.

     

    This sentence is a bit difficult to read, but it doesn't matter. Let's take a look at the highlights in this definition:

     

    · Scalability: Solr can distribute indexing and query operations to multiple servers in a cluster.

    · Rapid Deployment: Solr is an open-source software that is easy to install and configure. You can use Solr directly based on the Sample configuration in the installation package.

    · Optimized search: Solr search is fast enough. For complex search queries, Solr can be processed in sub-seconds. Generally, a complex query can be processed in dozens of milliseconds.

    · Massive text: Solr is designed for processing massive texts of more than one million levels, and can well process massive data.

    · Data in the text center: Solr optimizes the search for text content containing natural languages, such as emails, webpages, resumes, and PDF documents, or Twitter, Weibo, blog, and other social media content are suitable for Solr processing.

    · RESULTS are sorted by Relevance: Solr's search results are sorted by the degree of relevance between the results document and the user query, ensuring that the most relevant results are returned first.

     

    In this book, you will learn how to use Solr to design and implement a scalable search solution. Our learning journey begins with understanding the data types and typical cases supported by Solr. In this way, you can better understand the location of Solr in the panorama of the entire modern software application architecture, and what problems Solr is designed to deal.

     

     

    1.1 do I need a search engine?

    We guess you already have some ideas to use the search engine, otherwise you will not open this book. Therefore, we will not waste time figuring out why you are starting to consider using Solr. Let's discuss something about your data and use cases, which of the following questions must you answer before deciding whether to use a search engine. This eventually comes down to how to have a deep understanding of your data and your users, so as to choose a suitable technology to satisfy both needs at the same time. We will first discuss which data attributes are suitable for processing by search engines.

    1. 1.1.1 manage data in the text Center

    Reasonable Selection of storage and processing engines that match data is one of the iconic requirements of modern software application architecture. If you are a good programmer, you should know that you should select the most appropriate data structure based on the data used in algorithms. For example, if you want to implement fast random search, you will not use the linked list structure to store data. The same principle applies to the selection of search engines. Here we list four main features of data that are suitable for processing with search engines like Solr:

    1. Text center data

    2. Reads far more data than writes

    3. Document-oriented data

    4. Flexible Schema

    Perhaps the fifth data feature should be added here: massive data volume, that is, "Big Data". However, we are mainly concerned with the main characteristics of Solr that distinguish it from other NoSQL technologies, however, processing massive volumes of data is not one of the main differences.

    Although the four main features of data types that can be effectively processed by search engines like Solr are listed here, this is just a rough principle and is not a strict standard. Let's discuss these data features in depth to see why they are so important for search. Now we only focus on concepts. The implementation details will be discussed later.


    Text center data

    You must have seen someone use the term "unstructured data" to describe the data processed by a search engine. We think that the word "unstructured" is ambiguous, because any document generated based on human language has a certain structure. To understand the term "unstructured", you can think of it as a computer. In the eyes of a computer, text documents are a ghost stream. This NLP stream must parse the semantic structure through specific language rules before it can be retrieved. This is where the search engine works.

    We believe that the term "data in the text Center" is more suitable for describing the data types processed by Solr. Because the search engine was designed to extract the implicit structure of text data and generate relevant indexes to improve the query efficiency. The word "data in the text Center" implicitly indicates that the text information in the document contains the content that you are interested in. Of course, the search engine also supports non-text data, such as digital data, but its main strength lies in processing natural language-based text data.

    All the above are "text". In fact, "center" is also very important, because if your users are not interested in the content of the text, then the search engine may not be the best choice to solve your problem. For example, for an application used by employees to create travel expense reports, each report includes structured data such as date, expense type, exchange rate, and quantity, in addition, some remarks may be provided after each charge to describe the general situation of the charge. Such an application contains text information, but is not an example of "text center data", because when the accounting department uses these employees' expense reports to generate monthly expense reports, the text in the remarks will not be searched. The text is not the main content of interest here. Simply put, not all data containing text information is suitable for search engines.

    So now, let's take a few minutes to figure out whether your data is "text center data ". The main consideration is whether the text information in the data is used for retrieval. If the answer is YES, the search engine may be a good solution. In Chapter 5th and Chapter 6th, we will discuss how to use Solr's text analysis to extract details about the structure of text data.

     

    Read far more data than written data:

    Another data feature that a search engine can efficiently process is "reading far more data than writing data ". First, you must declare that Solr allows you to update the existing document content in the index. You can interpret "reading far more than writing" as the frequency of reading documents is much higher than that of creating and updating documents. But do not simply understand that you cannot write data at all, or you will be limited to updating data at a specific frequency. In fact, a key feature of Solr4 is "near real-time query", which allows you to index thousands of documents every second and query these newly added documents almost immediately.

    The key point behind "reading much more data than writing" is that your data should be read multiple times in its lifecycle after being written to Solr. You can understand that a search engine is not mainly used to store data, but mainly used to query stored data (a query request is a read operation ). Therefore, if you need to update data frequently, the search engine may not be suitable for your needs. Other NoSQL technologies, such as Cassandra, may be more suitable for your fast and random writing needs.

     

    Document-oriented data

    So far, we have been using a more general term "data", but in reality, search engines are processing document data. In a search engine, a document is an independent set composed of fields. Each value field only saves data values and cannot be nested with other value fields. In other words, in a search engine such as Solr, documents are flat and there is no dependency between documents. In Solr, the concept of "flat" is relatively loose. A value field can store multiple data values, but it cannot be nested with subvalue fields. That is to say, you can store multiple data values in a value field, but you cannot nest other value fields in the Value Field.

    In Solr, this flat and document-oriented method can well process documented data, such as web pages, blogs, and pdf documents. What should I do if I use solr to process structured data in a relational database? In this case, you need to extract the data stored across tables in a relational database, structure the data, and put it in a flat self-contained document structure. In chapter 3, we will learn how to deal with such problems.

    You also need to consider which Value Domains in your document data need to be stored in Solr and which Value Domains need to be stored in other systems (such as databases ). Simply put, the search engine only stores the data to be retrieved and the data used to display the search results. For example, if you have an online video search index, you do not want to store the video file in Solr, A reasonable solution should be to put all the large video files in the Content Delivery Network (CDN. Generally, you only need to store the minimum data that meets the search requirements in the search engine. The online video example just now clearly shows that Solr should not be used as a general data storage technology. Solr's job is to find the video files that users are interested in, rather than store the video files themselves.

     

    Flexible Schema

    The last key feature of search engine data is flexible schema. This means that documents in the query index do not need to have a unified structure. In a relational database, each row of data in a table must have the same structure. In Solr, documents can have different value ranges. Of course, documents in the same index should have at least some of the values that everyone has for retrieval, but not all documents have the same value structure.

    For example, we would like to create a search application to find rental and sale houses. Obviously, each document will have some common Value domains, such as the location, number of rooms, and number of toilets. However, depending on whether the type is rental or sale, different documents have different value ranges. A property for sale has a price value and a property tax value, while a property document for rent has a monthly rental fee and a pet policy.

    To sum up, a search engine like Solr is specially optimized for processing text centers. It reads far more data than it writes, and is document-oriented and has flexible Schema data. Solr is not a general data storage and processing technology, which is also the main factor that distinguishes it from other NoSQL technologies.

    There are a variety of different data storage and processing solutions to choose from, the advantage is that you no longer need to bother looking for a general technical solution that can meet all the needs. The search engine performs well in some specific tasks, but in other aspects, the performance is poor. This means that in most cases, you can use Solr as a powerful supplement to relational databases and other NoSQL technologies, rather than replacing the latter.

    Now that we have talked about the data types optimized for processing by Solr, let's discuss the actual use cases that a search engine like solr is designed to solve. Understanding these cases can help you understand how search engine technology is different from other data processing technologies.

     

    1. 1.1.2 common search engine Use Cases

    In this section, let's take a look at what a search engine like Solr can do. As we mentioned in Section 1.1.1, these discussions are just guidance-oriented recommendations and should not be viewed as strict rules of use. Before we start, you need to realize that the threshold for making a good search service is very high. Nowadays, users are used to using fast and efficient network search engines like Google and Bing, many popular websites also have their own powerful search solutions to help users quickly obtain desired information. Therefore, users are not unfamiliar with the search service and will be very picky. When evaluating a search engine like Solr, or designing your own search scheme, you must have a root line and put the user experience at a high priority.

     

     

    Basic keyword Query

    Obviously, as a search engine, basic keyword queries must be supported first. This is also one of the main functions of the search engine. However, the keyword query function is worth emphasizing here, because this is the most typical way for users to use search engines. Few users want to enter a complex search form to search. Considering that the keyword search function will be the most common way for users to interact with your search engine, this basic function must be able to provide users with a good user experience.

    Generally, users only need to enter a few simple keywords to obtain good search results. This may sound like a simple matching task: match the query string with the document. However, consider the following issues that must be addressed to achieve a good user experience:

    · Relevant results must be returned quickly. In most cases, results can be returned within one second.

    · Automatic Error Correction when a spelling error occurs in a query string

    · Auto-completion suggestions are used to reduce the user input burden, which is common in mobile apps.

    · Process synonym synonyms in query strings

    · Matching documents containing the Language Variations of query strings (Note: language variations are semantic terms, that is, Approximate expressions with different words)

    · Phrase processing: whether the user wishes to match all words in the phrase or only some words in the phrase

    · Processing of some general prepositions, such as "a," "an", "of", "the", etc.

    · If the user is not satisfied with the top query results, how can I return more query results to the user?

    As you can see, without using specific processing methods, such a pile of problems will make it difficult to implement such simple functions. However, with search engines like Solr, these functions can be achieved easily. After you provide users with a powerful keyword search tool, you need to consider how to display the query results, which leads to the next use case, sort the query results returned by the search in the order of relevance between the results and query requests.

     

    Sorted search results

    The search engine returns the "TOP" result for the query. When an SQL query is performed on a relational database, a data record of a row is either returned for a matching query or ignored for a non-matching query. The query results are also sorted by a certain attribute of the data record. For search engines, the returned result documents are sorted in descending order based on the scores, which indicates the matching degree between the documents and the query. The matching degree score is calculated based on a series of factors. However, the higher the score, the higher the correlation between the results documents and the query.

    There are several factors that determine how the results document is sorted by relevance. First, modern search engines generally store massive volumes of documents, millions or even billions of records. If relevance sorting is not performed on the query results, the user will be overwhelmed by a large number of returned results, and the search results cannot be viewed clearly and effectively. Second, the user's experience in using other search engines makes the user accustomed to using a few keywords to get good query results, which also makes the user generally less patient. They will expect the search engine to work according to what they want, regardless of whether the information they enter is completely correct. For example, for the background search service of a mobile app, you will expect the search service to return the correct search result after you enter a few short query words that may contain spelling errors.

    If you want to manually intervene in the sorting results, you can add weights to a specific document, value range, or query string, or directly increase the relevance score of a document. For example, if you want to push the newly added documents to the top positions, you can improve Document Sorting by creation time. In chapter 3, we will learn about document sorting.


    Besides keyword queries

    Using a search engine like Solr, you can enter a few keywords to get some search results. However, for many users, this is only the first step in query interaction. They need to be able to continue browsing in the query results. An interactive session process that drives information discovery is also a major application scenario of the search engine. Generally, users do not know exactly what information they want to query before searching, and they do not know what information they actually store in your system. A good search engine can help users refine their information requirements and reach the desired information step by step.

    The core idea here is to provide users with a tool while returning the results of the document corresponding to the user's initial query so that they can continuously improve the query to obtain more information. In other words, in addition to returning matched documents, you should return a tool to let the user know what to do next. For example, you can classify query results by attributes to facilitate further Browsing Based on your needs. This function is called Faceted-Search, which is also one of the highlights of Solr. We will see an example of real estate classification search in section 1.2. In chapter 8, we will detail the details of the classification search function.

     

    What search engines are not suitable...

    Finally, we will discuss some use cases that are not suitable for applying search engines. First, the general design of the search engine is to return a small document set for each query, usually containing 10 to 100 results documents. More results can be obtained through the results paging function provided by Solr. If a query result contains millions of documents, If you want all matching documents to be returned at one time, you will wait for a long time. The query itself will be executed very quickly, but it is time-consuming to rebuild millions of documents from the index structure. Because Solr and other search engines store Value Domains on hard disks only apply to quickly generating a small number of document results. If you need to generate a large number of query results at a time, in this storage mode, it takes a lot of time to generate a large number of document results.

    Another scenario that is not suitable for use by search engines is the scenario where most subsets of index files need to be read to complete deep analysis tasks. Even if you use the results paging technology to avoid the problem you just mentioned, if an analysis needs to read a large amount of data in the index file, you will also encounter a huge performance problem, because the underlying data structure of the index file is not designed for a large number of reads at a time.

    We have mentioned a little before, but we should emphasize it again here, that is, the search engine technology is not suitable for querying between documents. Solr does support query based on parent-child relationship, but does not support query between complex relational data structures. In chapter 3, you will learn how to adapt the relational data structure to a flat document structure suitable for solr processing for queries.

    Finally, most search engines do not have direct documentation-level security support, at least Solr does not. If you need to strictly manage documents, you can only find a way out of the search engine.

    Here we have learned about the use case scenarios and data types suitable for search engine processing. Next we will discuss what Solr can do and how these functions are implemented. In the next section, you will learn what major functions Solr has and how it implements software design principles such as external system integration, scalability, and high availability.


    Ibatis in action Chinese Version

    Baidu, community, available in resources. Ibatis in action English version + Chinese version. Under the "JAVA Development-J2EE document" category.

    What is the difference between Struts In Action and Struts In Action Chinese Version 2?

    Apache Struts 2 is known as WebWork 2. After several years of development, the WebWork and Struts communities decided to merge into one, that is, Struts 2.
    Action class:
    Struts1 requires the Action class to inherit an abstract base class. A common problem with Struts1 is the use of abstract class programming rather than interfaces.
    Struts 2 Action class can implement an Action interface or other interfaces to make optional and customized services possible. Struts2 provides an ActionSupport base class to implement common interfaces. The Action interface is not required. Any POJO object with the execute identifier can be used as the Action object of Struts2.
    Thread mode:
    Struts1 Action is a singleton mode and must be thread-safe, because only one instance of Action is used to process all requests. The Singleton policy limits what Struts1 actions can do and requires caution during development. Action resources must be thread-safe or synchronized.
    The Struts2 Action object generates an instance for each request, so there is no thread security problem. (In fact, the servlet container generates many discarded objects for each request without causing performance and garbage collection problems)
    Servlet dependency:
    Struts1 Action depends on the Servlet API, because when an Action is called, HttpServletRequest and HttpServletResponse are passed to the execute method.
    Struts 2 Action does not depend on the container, allowing the Action to be tested independently from the container. If necessary, Struts2 Action can still access the initial request and response. However, other elements reduce or eliminate the need to directly access HttpServetRequest and HttpServletResponse.
    Testability:
    One major problem in testing Struts1 Action is that the execute method exposes the servlet API (which makes the test dependent on the container ). A third-party extension, Struts TestCase, provides a set of Struts1 simulated objects for testing ).
    Struts 2 Action can be tested through initialization, setting properties, and calling methods. "dependency injection" also makes testing easier.
    Capture input:
    Struts1 uses the ActionForm object to capture input. All actionforms must inherit a base class. Because other JavaBean cannot be used as an ActionForm, developers often create redundant class capture inputs. Dynamic beans (DynaBeans) can be used as an option to create a traditional ActionForm. However, developers may re-describe (create) the existing JavaBean (which will still lead to redundant javabean ).
    Struts 2 directly uses the Action attribute as the INPUT attribute, eliminating the need for the second input object. The INPUT attribute may be a rich object type with its own (sub) attribute. The Action attribute can be accessed through taglibs on the web page. Struts2 also supports the ActionForm mode. Rich object type, including business objects, which can be used as input/output objects. This ModelDriven feature simplifies taglib's reference to POJO input objects.
    Expression Language:
    Struts1 integrates JSTL, so jstl el is used. This kind of EL has basic object graph traversal, but the support for set and index attributes is weak.
    Struts2 can use JSTL, but also supports a stronger and more flexible Expression Language-"Object Graph Notation Language" (OGNL ).
    Bind the value to the page (view ):
    S... the remaining full text>

    Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.