Qiandu search and language processing lab platform-opening film

Source: Internet
Author: User

 

From today on, I will officially write some articles about my graduation design system. I was planning to write some articles before, but I always felt that I was missing some knowledge. After more than two months of accumulation, I gradually formed some ideas. During this period, I completed the development of the debug version of the system, accumulated some experience, and learned a lot. I hope my article will inspire you.

 

Overture:

More than two months ago, I officially came to my current lab, the Knowledge Engineering Center of Shenyang Institute of aviation industry. This lab focuses on natural language processing, including information retrieval, machine translation, artificial intelligence, and knowledge engineering. My mentor is the leader of this lab, Professor Cai Dongfeng. The strong learning and research atmosphere inspired my enthusiasm. As the last time I spent at school before graduation, I cherish this precious time. Graduation Design is of course the main line and main achievement of my efforts regarding the schedule during this period. I have carefully constructed a search system for scientific research experiments through hard work and research, which serves as a negligible reward for my care for teacher CAI and fellow lab students. I hope that my efforts at this stage will inspire, joy, or comfort everyone who cares about me. I also hope that these efforts will bring a complete conclusion to my college life.

Thought Source:

The development of the Internet has brought about an information revolution in the world. Through the Internet, people can easily and quickly obtain massive amounts of information. However, the world has not become "flat", and people still cannot obtain information equally. One of the most prominent factors is the "information explosion" problem. How can we extract the "useful" information we need in the face of such a massive amount of information?

As a mature solution, search engines have been widely valued. With the commercial success of the second generation of search engines represented by Google, a "search engine" boom has been set up around the world. Search engine-related technologies have always been a hot topic in the industry and academia.

A search engine is actually an Information Retrieval System. In essence, it is a computer software system that finds information related to user requirements from unstructured information sets [1. Introduction to information retrieval systems]. At present, mature search systems are a combination of a large number of optimization technologies. The performance and effect of the entire system depend not only on the performance of a certain part, but also on the combination of various parts. Therefore, the researchers of the retrieval system should not only solve the part they studied, but also take into account the construction of the entire system. Especially in experiments, researchers and developers should not only build their own optimization technologies, but also build other parts of the retrieval system, this makes it impossible for many people to concentrate on solving those complexity problems.

In the industry, the well-known search and development tools include Apache Lucene, hyper estraier supporting P2P architecture, SQL-based full-text search engine sphier, Structured Text Index, and zebra retrieval. Lucene is the most widely used one. It is developed based on Java and has simple interfaces. It has good system construction and scalability. However, it has poor performance, provides few search algorithms, and cannot support large-scale data processing and other weaknesses. As a result, these Lucene-represented commercial search tools are not suitable for scientific research.

Researchers often need a flexible, robust, and scalable experimental platform that supports large-scale data to test and verify various new optimization technologies and algorithms. In this case, many foreign scientific research institutions have developed their own retrieval tool platforms. Representative include the language model and information retrieval toolkit lemur jointly developed by Carnegie Mellon University (CMU) and UMass, and zettair of the Royal Melbourne Institute of Technology (RMIT, wumpus at the University of Waterloo, Canada. Lemur is a distinctive one. It provides good support for search results tests and is also used by many scientific research institutions. However, lemur cannot provide excellent extension support in terms of search efficiency and index experiments. Currently, only the firtex tool developed by the Chinese Emy of Sciences is similar in China. Although the system architecture and ideas are good, there is still a gap between the support of core search models and foreign tools, therefore, at the current international search evaluation conference, there were no systems developed based on domestic search tools.

Lemur is not a complete retrieval system, but a collection of retrieval system tools. It only provides the necessary components and APIs for building a search system. Although lemur subsequently launched the open-source project indri, which is a complete retrieval system program, Like lemur itself, there are the following problems in using these tools to develop systems:

1. The document support is rough, and some information cannot be synchronized with the latest software version.

2. There are very few references. There are currently no Chinese introductions.

3. Most users are scientific research institutions, with fewer developers and inactive communities.

4. Lemur is complicated to implement, and certain domain knowledge is required for development.

5. weak support for Chinese characters. Only word segmentation is supported.

6. Bugs discovered by some components of open-source software cannot be repaired in a timely manner.

In view of the problems encountered in the above development system, I have come up with the idea of building such a retrieval and language processing platform.

System implementation goals:

1. In-depth study of lemur, including its specific functions, API call methods, development methods, and internal implementation mechanisms.

2. Developed a search platform based on lemur. The platform must have the following features:

L the architecture of the entire system maintains the lemur style and pursues the strength and integrity of flexibility and functions. At the same time, the system structure strives to be concise and minimize the complexity of lemur design.

L The system module is clearly defined and the architecture is highly scalable, so that every component can be easily expanded, customized, and maintained.

L provides good language processing support for English and Chinese.

L friendly interactive interface, allowing researchers to "click the button to perform experiments ".

L improve the system response speed as much as possible.

L try to ensure the simplicity and robustness of the system code.

L form detailed development documents and provide some introductory documents on the construction idea of the retrieval system and lemur for beginners to learn.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.