Solr In Action Chinese Version chapter 1 (ii), solraction

Source: Internet
Author: User
Tags ranges solr

Solr In Action Chinese Version chapter 1 (ii), solraction

  1. What is Solr?

In this section, we design a search application from scratch to introduce the key components of Solr. This process will help you understand Solr functions and the original intention of designing these functions. However, before we begin to introduce the features of Solr, we need to clarify some features that Solr does not have:

1) Solr is not a web search engine like Google or Bing.

2) there is no relationship between Solr and SEO optimization, which is often mentioned in website optimization.

Now, let's assume that we are going to design a real estate search network application for potential buyers. The core use case scenario of this application is to search for houses in the United States in a web browser. Figure 1.1 describes the virtual application interface. You don't need to worry too much about the simplicity of the UI. This is just a visualization model that we can discuss easily. The focus is on this example. Let's take a look at what types of search experiences Solr can provide.

Let's first quickly browse the key features of Solr described in Section 1 1.1. We start from the upper left corner and look clockwise. First, Solr provides powerful functions to support the keyword search box. As we discussed in 1.1.2, an outstanding keyword search function requires the support of a powerful and complex architecture. Fortunately, this complex architecture provided by Solr can be quickly installed and used. Specifically, solr provides spelling check, auto-completion suggestions for user input, synonym processing, phrase query, and processing similar to "buying a house" and "purchase a home ". language variant text analysis function.

In addition, Solr provides powerful geographic location query functions. In Figure 1.1, the list of houses that meet the query conditions is displayed on the map based on the distance between the longitude and latitude of the specified coordinate point center. With the geographical location support of Solr, You can query documents based on the geographical distance, or even sort documents based on the geographical distance. For geographic location queries, it is important to quickly return search results and allow users to perform new queries by scaling or moving the location on the map.

Once a user initiates a query, the query results can be displayed by category based on the different features of the result document set by using the Solr Classification Retrieval function, which makes it easier for users to browse the results returned by the search. Classified search is a method that displays result sets by feature category. It helps you further refine the query to obtain information that meets your needs. In Figure 1.1, the query results are classified and displayed based on the house features, room types, and directory types.

Now we have a basic understanding of what functions should be supported by the Real Estate Query application. Next we will look at how to use Solr to implement these features. First, we need to figure out how Solr matches the housing list in the index with the user's query. This principle is also the basis of all search applications.

  1. 1.1.1 information retrieval engine

Solr is developed based on Lucene, a well-known Java open-source information retrieval library under Apache. In chapter 3, we will discuss in detail what information retrieval is. Now let's take a look at how an authoritative academic paper on the concept of modern search engines defines information retrieval, from which we can understand some key concepts:

 

Information Retrieval (IR) refers to a collection of massive data (usually stored in computer systems) based on a certain amount of unstructured essential attributes (usually text content) find out the process of meeting the information requirements (usually documents. --- Excerpted from information retrieval, published in January 2008

 

In our real estate search application, users' main information needs are to find the house to be purchased based on factors such as location, room type, house features, and price. Our index contains all the houses in the United States, which is definitely regarded as "massive" data. In short, Solr uses Lucene's core architecture to index documents, process query requests, and implement document search.

Lucene uses JAVA at the underlying layer to establish and manage inverted indexes. An inverted index is a special data structure used to match text queries. Figure 1.2 shows a brief overview of the Lucene inverted index used in our real estate search application.

In chapter 3, you will learn in detail how inverted indexes work. But now it is enough to understand the content in Figure 1.2. The process of adding a new document record (Record No. 44th in the figure) to the index file and how the document is queried and matched through inverted indexes.

You may be wondering, isn't it possible to use SQL statements in relational databases? This is a simple example. However, a key difference between Lucene index query and database query is that in Lucene, the query results are sorted by the matching degree of the query, the database query results can only be sorted by a certain attribute in the data table. In other words, sorting results based on relevance is a key feature of information retrieval and an important feature different from other queries.

-Create an inverted index for network-wide data

You may be surprised because search engines like Google use inverted indexes to query webpages. In fact, the need to create inverted indexes for data across the network directly results in the emergence of MapReduce technology.

MApReduce is a programming model. Through the Map and reduce stages, large-scale data processing operations are distributed to the commercial server cluster for distributed operation. Google uses MapReduce technology to build its huge inverted index to support network queries. When using MapReduce technology, a series of data segments are generated in the Map phase and the unique document ID of the data segment is marked where all data segments appear. In the Reduce stage, all data segments are sorted, and data segments with the same Document ID are sent to the same CER for processing. In this way, CER will add data segments with the same Document ID, create an inverted index for it.

The Apache Hadoop project provides an open-source Mapreduce implementation. The Apache Nutch open-source project uses Hadoop to create a Lucene inverted index for data query across the network. The discussion about Haddop and Nutch is beyond the scope of this book, but we suggest that you study these open source projects to help you build large-scale search indexes.

Now that Lucene provides the core architecture for search, let's see what value Solr attaches to Lucene. We start from using the flexible schema. xml configuration file of Solr to manage index creation.

  1. 1.1.2 flexible Schema Management

Although Lucene provides a core architecture for document indexing and query execution, it does not provide a convenient interface to set how indexes should be created. To use Lucene, you need to write some JAVA code to define the value range and how to analyze these value ranges. Solr provides a simple declaration method to define the index structure. You can also specify how to analyze the value fields in the index as needed. All of this can be done through an xml configuration file named schema. XML. Solr translates the schema. xml configuration into Lucene index in its underlying implementation. This method saves your programming time and makes your index structure more readable. On the other hand, the index created by Solr is 100% compatible with the programming Lucene index.

In addition, Solr adds some other good functions to the core Lucene indexing function. Specifically, Solr provides two new value types: Copy Fields and Dynamic Fields. Copy Fields provides a method to assign the original text content from one or more value ranges to another new field value range. Dynamic Fields allows you to assign the same value type to multiple value ranges without explicit declaration in schema. xml. This is useful when creating models for documents with multiple value ranges. We will go into the schema. xml file in Chapter 5 and chapter 6.

In our real estate search application, you may be surprised that we did not make any changes to the Solr sample schema. xml and then directly used it. This actually shows how flexible the schema configuration of Solr is. It should be the Solr example schem designed for product search, but our real estate search application can be used directly.

So far, we have learned that Lucene provides a powerful search library to support document indexing, query execution, and sorting of results. Also, with schema. xml, You can flexibly define the index structure. You only need to modify the XML configuration and do not need to program javaseapi. Next, you need to be able to obtain these services through the network. So in the next section, we will learn how Solr runs as a Java web application and how it uses XML, JSON, HTTP and other standard technologies are integrated with other systems.

  1. 1.1.3 Java web Applications

As a Java web application, Solr can run on any modern JAVA Servlet Engine, such as Jetty or Tomcat, it can also run on a complete J2EE application server such AS JBoss or Oracle. Figure 1.3 shows the main software modules of the Solr server.

Figure 1.3 may seem a little informative at first glance. We can spend some time first to get familiar with terms and terms. You don't have to worry about terms or concepts you don't understand. When you finish reading this book, you will have a deep understanding of all the concepts and terms in Figure 1.3.

As we mentioned at the beginning of this chapter, Solr designers point out that Solr is very suitable for integration into existing systems and can serve as a powerful supplement to existing systems. In fact, it is hard to find a system that Solr cannot be integrated. As we will see in Chapter 2, it takes only a few minutes for you to download Solr to start a Solr system as an example.

To achieve the goal of easy integration, Solr core services must be accessible by different applications and programming languages. Solr provides simple restful services that support standards such as XML, JSON, and HTTP. By the way, we do not use the RESTFul term to describe Solr's HTTP-based API, because it does not strictly abide by all the REST (Representatonal state transfer) principles. For example, in Solr, http post is used to DELETE a document, rather than http delete.

Restful interfaces are good enough for basic functions, but developers often want to use programming languages they are familiar with to compile some template-based tools that call network services and process returned results. The good news is that Solr provides libraries for many popular programming languages, including Python, JAVA,. NET, and Ruby.

  1. 1.1.4 create multiple indexes on the same server

A notable feature of modern software application architecture is the flexibility to meet rapidly changing needs. Solr has a very useful feature in this regard, that is, you do not have to use a single index to complete all the tasks in Solr. Solr supports running multiple Solr cores on a single Solr engine. In Figure 1.3, we describe the scenarios where multiple solr cores run simultaneously as different layers in the same Java web application environment.

Each core has an independent index and configuration. Multiple Solr cores can exist in one Solr instance. In this way, you only need one Solr server to manage multiple cores, so that you can easily share server resources and monitor and maintain services. Solr has dedicated APIs for creating and managing multiple cores.

One application of Solr's multi-core support function is Data Partitioning. For example, one core is used to take charge of the latest documents, and another core is used to process the previously generated documents, this function is called chronological sharding.

In our real estate search application, we can also use multiple cores to process different types of housing resources, each of which is managed using a separate index file. For example, for the real estate information of rural land, the process of buying an agricultural land is different from that of buying a commercial housing, we can use a separate index to manage information about agricultural land and store it in a separate core.

 

  1. 1.1.5 Scalability (through plug-in extension)

Figure 1.3 shows three major Solr subsystems: document management, query processing, and text analysis. Of course, these subsystems are the macro abstraction of complicated subsystems in Solr. We will study these subsystems one by one in later chapters. Each of these modules is serialized by a series of functional modules. You can concatenate new functional modules in the pipeline. This means that if you want to add a new function to solr, you do not need to rewrite the entire Query Processing Engine. You only need to enter your new function module in the appropriate position. In this way, Solr's core function modules can be easily expanded and customized, and can be fully customized according to your specific application requirements.

  1. 1.1.6 scalability

Lucene is a very fast search library, and Solr fully utilizes Lucene's powerful performance.

However, aside from Lucene's performance, as a server, due to CPU and I/O restrictions, there are limits on the number of users and requests that can be responded at the same time.

The first card to achieve scalability in Solr is flexible cache Management. This function prevents the server from repeatedly consuming resources. Specifically, Solr sets cache in advance to save a lot of overhead for Repeated Computation. For example, Solr caches the computation results of the query filter. In chapter 4, we will learn about the Cache Management Function of Solr.

The Caching function is limited. to process more documents and obtain higher query throughput, you need to be able to scale the system performance horizontally by extending the server. Now let's take a look at the two most common aspects of Solr scaling. The first is the expansion of query throughput, that is, the maximum number of queries that your engine can process per second. Although Lucene can process each query request very quickly, the overall query throughput is limited for requests that can be processed concurrently by a single service. To obtain a higher query throughput, you need to increase the number of copies of the query server and index so that more servers can process more requests at the same time. This means that if your index is copied to three servers, you can process about three times the original number of requests per second, because the current load on each server is 1/3 of the total number of queries. In practical applications, it is rare to obtain such perfect current scalability. Therefore, using three servers may achieve approximately 2.5 times the original performance.

Another extended dimension is the number of indexed documents. If you are processing a large number of documents, the query performance bottleneck will also occur when the number of files indexed in a single Solr instance reaches a certain level. The solution is to split the index files into small pieces called "Shards" and distribute query requests to these shards for operations.

Use virtualized commercial hardware for expansion

A trend of modern computer technology is to build a software architecture that can be scaled horizontally on Virtualized commercial hardware. Simply put, you can add common commercial server servers to process more traffic. Cloud computing service providers like Amazon EC2 use virtualized commercial hardware to meet this trend. Although Solr can run on Virtualized hardware, it should be noted that Solr is very sensitive to I/O and memory. Therefore, if your enterprise's search performance is given the highest priority, you should consider deploying Solr on high-end hardware equipped with high-performance disks (such as SSD. The hardware considerations for deploying Solr will be discussed in chapter 13th.

Scalability is important, but the automatic recovery capability after an error is also important in modern systems. In the next section, we will discuss how Solr handles errors in software and hardware.


  1. 1.1.7 Fault Tolerance

In addition to scalability, we also need to consider how to handle one or more problems on the server. This issue is especially important when you plan to deploy Solr on Virtualized commercial hardware. The minimum requirement is that you should at least intend to handle these errors. Even if you use high-end hardware with the best architecture, errors will still occur.

We assume that your system has a total of four shards. If the server where Part 2 is located is powered off, Solr will not be able to establish document indexes normally, nor will it be able to respond to the query service. So at this time, you can say that your search engine is "suspended ". To avoid this situation, you can back up each part. Return to our example. When Part 2 fails, Solr redirects all index creation requests and query requests pointing to Part 2 to its backup location, the backup is not suspended at this time and can work normally, so the entire Search Service is still running. After an error occurs, the indexing service and query service can still work, but it may not be as fast as before, because one server is missing to process the request. We will discuss various fault situations in more detail in chapter 16th.

So far, you have seen that Solr has a well-designed modern software architecture that can be well scaled horizontally and handle fault tolerance. Although these are important factors that must be considered after you decide to use solr, you may not be very sure whether Solr is your correct choice. In the next section, we will look at what benefits Solr can bring to ourselves from the perspective of different roles in the company, including software architects, system administrators, and CEOs of the company.

  1. 1.2 Why Solr?

In this section, we hope to provide some key information to help you determine whether Solr is the right choice for your company's technical solutions. Let's start with Solr's attraction to software architects.

  1. 2.1 Solr in the eyes of software architects

When evaluating a new technology, software architects must consider a series of factors, including system stability, scalability, and fault tolerance. Solr scored well in these three aspects.

Speaking of stability, Solr is a mature technology jointly maintained by active open-source communities and experienced code contributors. New users of Solr and Lucene are usually surprised by the Release method of the project. Maybe they used to wait for the official Release version of a project. They have never heard of this method of pull from the branch. Whether your company accepts this method or not, we do not suggest you do this. What we want to say is, the test depth and width of the Automatic Test Module in Lucene and Solr projects are trustworthy. To put it simply, if you get a nightly build from the branch and all the automated tests can pass, you can be assured that all the core functions are OK.

In section 1.2.6, we have come into contact with Solr's method for implementing scalability expansion. In section 1.2.7, we also discussed the issue of fault tolerance. As an architect, you may be most curious about the limitations of Solr's scalability and fault tolerance functions. First, you need to know that in Solr4, The sharding and copy backup functions have been rewritten, which greatly improves the robustness and ease of management. The new expansion method is called SolrCloud. At the underlying implementation level, SolrCloud uses Apache ZooKeeper to manage configuration synchronization on the Solr cluster and monitor the running status of the cluster. Here are some highlights of Solr's new SolrCloud functions:



& Middot; centralized Configuration

& Middot; distributed index to avoid SPoF)

& Middot; automatic fault tolerance, automatic generation of new primary parts

& Middot; any node can trigger distributed full queries covering all parts of the entire cluster, and has integrated automatic fault tolerance and load balancing.

However, this does not mean that the scalability of Solr has no room for improvement. SolrCloud needs to be improved in two aspects. First, not all functions work in distributed mode. For example, the joins connection function. Second, once an index is created, the number of parts of the index cannot be dynamically adjusted. To change the number of parts, you can only re-create an index for all documents. We will discuss in detail all aspects of SolrCloud in chapter 16th, but we want to ensure that software architects are aware that Solr scalability has gone a long way over the past few years, in addition, it will continue to improve in the future.

  1. 2.2 Solr in the eyes of System Administrators

As a system administrator, when considering the use of a new technology such as Solr, the highest priority is whether the new technology can work well with existing systems. Solr can easily answer YES to this question. Solr is fully developed based on JAVA and can be run on any operating system with J2SE 6.x/ 7.x JVM virtual machine. Solr also comes with the open-source Java Servlet Engine Jetty provided by Oracle. On the other hand, Solr is a standard Java Web application that can be conveniently deployed on Java web application servers such AS JBoss or Oracle.

All Solr operations can be completed through HTTP requests, and Solr is designed to work collaboratively with HTTP reverse proxies such as Squid or Varnish. Solr also supports JMX, so you can mount Solr to your favorite monitoring program (such as Nagios) for monitoring.

Finally, Solr provides a good management console that can be used to check configurations, View statistics, initiate test queries, and monitor SolrCloud health. Figure 1.4 shows a screenshot of the Solr4 console. We will learn how to use the console in Chapter 2.

  1. 2.2.1 Solr in CEO's eyes
    Although CEOS and other people are unlikely to read this book, we still need to write some key points, so that when the CEO calls you to talk in the hall, you can hide these points. First, management people like to hear that their investment in technology today will produce benefits for a long time in the future. Specifically, Solr allows you to emphasize that many companies still rely on Solr 1.4 to run their products. This is the old version released in 2009. This shows that Solr has successful commercial use cases, and it has been continuously improving.

In addition, CEOs like controllable and predictable technologies. As you will see in the following sections, Solr is very useful. You can build a simple Solr service in a few minutes. Another question is, if the Solr employee changes jobs or runs away, will our company's business be affected? Will the entire service fail? Solr technology is indeed complicated, but its open-source community is very active, which means you can get help in a timely manner as long as you seek help. In addition, you can directly see the source code. Sometimes you may find that there is a problem with writing in a place, you can fix it by yourself. In addition, many commercial service providers can help you plan, implement and maintain your Solr system. Many service providers also provide Solr-related training courses.

Next, CFO may be more concerned about the investment cost of Solr. Investment in using Solr actually does not cost much. We don't need to know the size of your operating environment, so we can confidently say that you can build a simple Solr service in just a few minutes, and you will soon be able to build a document index. A server built on the cloud can process millions of file requests in sub-seconds (in less than one second.

 



What is the difference between Struts In Action and Struts In Action Chinese Version 2?

Apache Struts 2 is known as WebWork 2. After several years of development, the WebWork and Struts communities decided to merge into one, that is, Struts 2.
Action class:
Struts1 requires the Action class to inherit an abstract base class. A common problem with Struts1 is the use of abstract class programming rather than interfaces.
Struts 2 Action class can implement an Action interface or other interfaces to make optional and customized services possible. Struts2 provides an ActionSupport base class to implement common interfaces. The Action interface is not required. Any POJO object with the execute identifier can be used as the Action object of Struts2.
Thread mode:
Struts1 Action is a singleton mode and must be thread-safe, because only one instance of Action is used to process all requests. The Singleton policy limits what Struts1 actions can do and requires caution during development. Action resources must be thread-safe or synchronized.
The Struts2 Action object generates an instance for each request, so there is no thread security problem. (In fact, the servlet container generates many discarded objects for each request without causing performance and garbage collection problems)
Servlet dependency:
Struts1 Action depends on the Servlet API, because when an Action is called, HttpServletRequest and HttpServletResponse are passed to the execute method.
Struts 2 Action does not depend on the container, allowing the Action to be tested independently from the container. If necessary, Struts2 Action can still access the initial request and response. However, other elements reduce or eliminate the need to directly access HttpServetRequest and HttpServletResponse.
Testability:
One major problem in testing Struts1 Action is that the execute method exposes the servlet API (which makes the test dependent on the container ). A third-party extension, Struts TestCase, provides a set of Struts1 simulated objects for testing ).
Struts 2 Action can be tested through initialization, setting properties, and calling methods. "dependency injection" also makes testing easier.
Capture input:
Struts1 uses the ActionForm object to capture input. All actionforms must inherit a base class. Because other JavaBean cannot be used as an ActionForm, developers often create redundant class capture inputs. Dynamic beans (DynaBeans) can be used as an option to create a traditional ActionForm. However, developers may re-describe (create) the existing JavaBean (which will still lead to redundant javabean ).
Struts 2 directly uses the Action attribute as the INPUT attribute, eliminating the need for the second input object. The INPUT attribute may be a rich object type with its own (sub) attribute. The Action attribute can be accessed through taglibs on the web page. Struts2 also supports the ActionForm mode. Rich object type, including business objects, which can be used as input/output objects. This ModelDriven feature simplifies taglib's reference to POJO input objects.
Expression Language:
Struts1 integrates JSTL, so jstl el is used. This kind of EL has basic object graph traversal, but the support for set and index attributes is weak.
Struts2 can use JSTL, but also supports a stronger and more flexible Expression Language-"Object Graph Notation Language" (OGNL ).
Bind the value to the page (view ):
S... the remaining full text>

Where can I go to the Spring in Action (second edition) Chinese version? La

It seems that no one has scanned it yet. Since you know this book is good, it shows that your style is not low. Why bother with Chinese? I have almost finished reading the English version, and I think it is quite easy to read. In contrast to the Chinese version, I read several pages of free trial on csdn. When I saw him translate JPA into "java reserved API", I felt that the translator was not professional enough to continue reading it.
In short, it is recommended that you read English (for this book ). If you think you are not able to use English, find the quick books written by domestic authors ~~ Self-built volume

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.