Expert interview: Search for open source power: Lucene technology prospects

Source: Internet
Author: User

Expert interview: Search for open source power: Lucene technology prospects

Reporter: Why is search engine dominant in today's Web technology?

Wu Zhongxin: what information is the most on the web? The Web has the most unstructured information. This information needs to be integrated, and the search engine (SE) came into being. Emerging web things: Structured blogs, feedmesh, xmtp (Extensible markup Transport Protocol), XML serialization/deserialization, expanded browser application scope, and massive data. You don't need to talk about it. Without the emergence of Se, you still don't know how to find these information sources. Se is now the Web infrastructure, but please note that it is not that fundamental change. Browser and markup language are fundamental changes. Se is a promotional technology. It is very sensitive to emerging transactions. Therefore, it must take the lead. If the network storage information is more important than the library, the Administrator's role is very important, if I cannot find a book at Jiao Tong University, I have to contact the Administrator. Thank you for your attention.

Reporter: What do we think of open-source search engine projects?

Wu Zhongxin: open-source projects are like treasure chest. They can be used, learned, and used for development. what can be learned? Framework, programming skills, and algorithms. For open-source search engine projects, I will focus on "algorithms ". In the preface of "the beauty of code", I mentioned: algorithm + Data Structure = program. Although the concept is somewhat old, companies are willing to take some algorithms and the art of computer programming. Gartner Mr. can be said to guide everyone with the power of his life and make everyone think that he is a little smart (think of algorithms, just pat your head temporarily ), Support Hou Students (who do not understand algorithms or programming skills) are always unable to meet programming standards. This makes people laugh at each other, "mysterious and mysterious, the door to perfection ". Things are displayed in front of us in terms of physical conditions and physical conditions. You can understand things by using them. If you want to look inside the table or physical conditions, you need to explore the "Inside" and "in a simple way ", achieve the goal of "Writing by yourself. This aspect is more prominent for open-source search engine projects.

Reporter: What websites currently use Lucene technology?

Wu Zhongxin: The Lucene website is provided in the poweredby link of the Lucene website, which is sorted alphabetically. Eclipse uses it for document retrieval, and psnc uses it for the architecture of the digital library, usajobs Inc. job Search is a search engine for job search. There are dozens of other desktop search and development sites. If we want to expand the hadoop project evolved from using Lucene's nutch, there will also be a giant site yahoo! using this GFS-like architecture! . So I recommend that you take a look at Lucene's derivatives, which are quite interesting. You can find some of the latest trendy technologies, such as mapreduce and bigtable.

Reporter: Can you introduce the technical features of Lucene?

Wu Zhongxin: the technical features of Lucene are not discussed here. One of them is the Lucene Query Process. The benchmark on the Lucene website mostly talks about the indexing process. The index structure books also talk about it. Although indexing and query are inverse processes, queries also have their own characteristics. The most obvious reason is the balance among hard disk, memory, and CPU. We all know that compression consumes CPU resources during decompression, but compression can save hard disk storage space and reduce I/O access time, which is the time for balanced CPU processing and hard disk access. Because Lucene supports range query, it makes a quick table for the dictionary program and does not use hash. When querying a quick table, it needs to be loaded into the memory in advance, just like a car engine, which needs to be preheated in winter. The saved table added to the memory is related to the size of your dictionary file and the word span that generates the saved table. Lucene records a word in the saved table every other time span in the dictionary file. If the span is too small, the memory usage is too large, and the span is too large, it takes time to traverse words in the span of the hard disk. This is the balance between hard disk access and memory access. When you encounter multiple query words, you also need to consider that the score of each word is kept in the memory. If the last query word does not appear, you will not get the final score. If you can limit the memory, first, save the partial scores of the first 1024 documents in the memory. The CPU first calculates the total score of the 1-1024 documents, and then calculates N/times. N indicates the number of documents in the collection, in this case, the previous query term is a high-frequency term, and the memory consumption will not be too large because it saves some of its scores. This is the balance between the CPU and memory. Hard Disk, memory, CPU, three C3 take 2, huh, I have introduced.

Lucene uses a variety of algorithms in the modern search field. After a deep dive into the modern search field, we can find that it is not a method that highlights the degree of familiarity with the language, but that solves the problem. In the past, many people wanted to transform Lucene into C plus. Although the performance was slightly different, it was not necessary.

Reporter: Some readers often ask the following question: What should I do if I want to create a search engine?

Wu Zhongxin: I have answered this question on Baidu "know". Many people have asked this question. We can see that everyone wants to get started quickly. Here is my Baidu answer:

Http://www.ir.iit.edu /~ Dagr/cs529/files/ir_book/chap % 204% 20 inverted % 20index. pdf,

The simple code for implementing the inverted index-based retrieval prototype is provided. It is not difficult to start. If you want to know about open-source search engine implementation, such as Lucene, mg4j, and stylenx, you can refer to Lucene analysis and application. This is what I wrote with Jia Li, the bottom layer is almost the same. However, please note that this is not a complete search engine. You also need to read the web page crawling process of nutch. We recommend that you read the book Java robot programming and then look back at nutch, it will be a quick start.

Reporter: What are your main tasks and fields of research? (It would be better if you have relevant development experience)

Wu Zhongxin: the current research field is service composition. This is a more interesting question. For example, it is like a factory assembly line that integrates various components to form a complete product, there are more variables in the middle. But it does not belong to the content of this interview.

Development experience related to se. For example, I have written a Chinese Word Segmentation parser written in C # And the algorithm is based on Dijkstra. However, this is an experienced parser that is not perfect. There is also the recovery of indexed documents from inverted indexes, which requires a better understanding of the index process and structure. In addition, the Dublin Core Document of guotu has done more 100 m

Record indexes. The word divider includes a CJK analyzer, a standard word divider, and a natural language word divider. Three types of indexes are generated, I tested Lucene's Chinese query performance well. Haha, who gave me a universal Chinese Query word distribution for the masses will be able to tell you the better word segmentation effect, zipf's law is not necessarily true!

REPORTER: You have recently published a book about Lucene technology. What kind of book is this? What are the content of Lucene analysis and application?

Wu Zhongxin: Lucene analysis and application is an in-depth analysis of Lucene source code. So far, it is the most thorough analysis. From indexing to query, the main process and algorithms are all mentioned. The Lucene features mentioned above are detailed in this book. The content of the errata is also posted on China-Pub. It is intended to provide you with a quality assurance book. At the same time, I hope you do not just want to use it, but also think about how to master the fundamental things, can we innovate ourselves? Zhao ke wrote this "Linux source code analysis" early on. How has China made no breakthrough in operating systems? It seems to be an international operating system bull club. China hasn't published a document for 15 years. Is it hard for everyone to get enough effort?

Reporter: What are the predictions about the evolution, derivation, and change of the future search technology?

Wu Zhongxin: we cannot predict that the search technology brings people's understanding of information associations, from strong associations of relational algebra to loose associations, A large and simple table. The elastic framework is what everyone is pursuing recently. Cloud computing is also a model for the initial phase of socialism in the grid. All the problems are aimed at a certain amount of data. The efficiency is becoming more and more important and the solutions are becoming more and more practical, if you want to understand the development trend clearly, you have to thoroughly explore the nature behind these technologies, starting from the details, and focusing on the essence of these technologies! This applies a programmer's sentence!

The course of science, the theory of relativity, the language of architectural patterns, the history of ancient Chinese Science and Technology, and the study of Ming-style furniture, we can also look at the "sample of miscellaneous articles", which not only breaks our Eastern thinking model, but also retains some of our own cultural connotations when it is impacted by Western culture!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.