Full-text retrieval system for machine translation

Source: Internet
Author: User

Full-text retrieval system for machine translation
Abstract: This article introduces the design and implementation of a full-text search system for machine translation.
Bucket structure and common full-text retrieval system functions such as boolean logic retrieval, location retrieval, and Retrieval Relevance sorting
Provides multi-level and cross-Language Retrieval for machine translation. Fuzzy chapters and paragraphs in Machine Translation
This article proposes methods for downgrading and checking, and analyzes the features of the document and selects appropriate
The search expression model solves the issue of relevance discrimination in Machine Translation search.
Using the idea of dynamic planning.
Keywords: Machine Translation, full-text search, paragraph search, and Chapter search
I. Introduction
With the deep understanding of linguistics and the development of computer technology, machine translation technology has developed rapidly and emerged.
A number of practical machine translation systems, especially the development of the Internet, make the network Machine Translation System
Raw. Machine translation is an integrated solution in many fields, such as linguistics, computational mathematics, computer technology, and cognitive science.
Because of the inherent complexity of the language and the limitations of the current development level of artificial intelligence
There is still a gap between the translation quality and objective needs. Manual post-translation editing is often required, and the translation speed
Because a large number of syntax and semantic analysis needs to be performed using dictionaries and rules, there is also a gap with the user's needs, thus improving the machine
The translation accuracy is extremely difficult.
Due to the need for repeated translation, especially for web pages on the Internet, we propose to save
Saves previously manually edited or high-quality translation results, and continuously improves Machine Translation by leveraging existing translation experience.
Speed and quality. Based on the general design principles and rules of the full-text search system
The full-text retrieval system for machine translation is designed and implemented. This full-text retrieval system is not only improved
Full-text retrieval, and provides multi-level and cross-Language Retrieval for machine translation.
Ii. Functions and overall structure
The system also provides information retrieval functions for users and machine translation. User-oriented search
The basic functions of the retrieval system allow users to make full use of collected bilingual information and support cross-language retrieval. Oriented
If the user has translated a similar
The system can directly call the translation information stored in the bilingual information library, from
In addition, since the translations stored in the Information Library have been translated to different degrees
Therefore, the Translation results provided to users are more accurate.
The system is designed and implemented based on the following principles: (1) inheriting the functions of the general full-text retrieval system;
Provides relevant feedback mechanisms to increase the search functions used by the machine translation system. (2) Open models and support
(3) The system is easy to maintain and the consistency of the Chinese and English index structures is maintained. (4) The domestic network
Real-time translation and query processing, and large amounts of information.
The system adopts a Boolean search mode that matches the user's query habits on the basis of inverted sorting, to search for users and machines
Translation retrieval provides fast and accurate retrieval results. Shows the system structure:
Functions of each module:
* Information Document preprocessing module
Information preprocessing filters the formats of non-plain text documents from different sources. The system also saves the original documents and the corresponding
A plain text document allows you to retrieve text information in different formats.
* Index module
The index module analyzes documents in the document library and establishes various index information to provide the foundation and guarantee for retrieval. Main
The content includes: Creating inverted records of document feature information; establishing a comparison between bilingual documents and their internal paragraphs
System; performs Text Analysis and extracts external features of the document.
* User-oriented retrieval module
According to the user's query requirements, read the feature records of the document information to find the user's required information. The main content includes:
Search Expression Processing, search processing, search extension processing, relevance sorting, and related feedback.
User-oriented search is also the basis for Machine Translation search. The system first performs the input search expression
Analyzes and finds errors, and then searches by a single word that can be checked. After the combined operation of the word that can be checked in the search expression
Finally, get the search results and sort the output.
* Machine Translation and retrieval module
To meet the requirements of the Machine Translation System for chapter and paragraph query
You can retrieve the same chapter and paragraph and its translation, or draw a conclusion that the query object is not saved in the bilingual database. This is the system
Core modules.
Iii. Search for Machine Translation
Precise match of chapters and paragraphs is less likely to be achieved. How to quickly and accurately meet the needs of Machine Translation
Finding "similar" chapters and paragraphs is the focus and key issue of the entire search. This article uses the method of gradual refinement
. For Chapter search, match by external features first. If so, the matching results are directly refined.
. If it does not exist, extract its keywords (SET) and configure them as search expressions for contraction. Then
You can perform a fuzzy match check on the row to obtain the search result. For paragraph search, you can use the following methods: Contraction check and precision check.
.
3.1 contraction check
Contraction refers to the extraction of keywords (sets) that represent the features of the chapters and paragraphs to be retrieved.
Search for the relevant chapters and paragraphs in the inverted archive, and quickly narrow down the processing process.
3.1.1 subject Extraction
Because the Network Information Retrieval has high real-time requirements, and the subject extraction here is to construct a search type, add
Therefore, it is impossible to perform detailed syntax and Semantic Analysis on the extraction of subject words, and it is not suitable for Inverse text
Therefore, the system uses the following statistical methods. The system gives priority to the following keywords:
Indexing words: 1) keywords that appear in the title or subtitle sentence and titles of different levels. Assign a title with a higher level
Greater power. Hierarchical titles are extracted from the machine translation-oriented hierarchical search system. 2) Special positioning such as summaries and keywords
Keyword. 3) keywords in the beginning and end of a field. 4) tags with high Word Frequency and long length under the same conditions
The value of the cited word is large.
The calculation formula of the key word weighting function is:
PW indicates the cumulative position weight, freg indicates the word frequency, Len indicates the word length, Lmin indicates the lower limit of the word length, and C indicates a regular
Numbers. For Chinese words, long words have a high specificity, and C can be larger. For English words, the gap is not as obvious as that for Chinese words,
C can be smaller.
The initial value of PW is defined as 0. For each keyword in the above case: 1) In the title, PW = (; level mark
In the question, PW = pw + 10 * I (I is a level); 2) PW = pw + 5; 3) PW = pw + 1; the keyword is in another sentence.
PW = pw + 1/total number of words in a sentence.
3.1.2 relevance search
The full-text search system supports searching keywords in the same paragraph. Therefore, the structure of paragraph search is relatively simple.
, Add the same segment location operator between extracted keywords, and then use this search method to find the phase
Section.
Text Search is a type of relevance determination. Currently, systems that have achieved good results in relevance judgment mostly adopt
A massive space model, such as the smart Experiment System Under Salton's leadership. However, this retrieval model has not yet been applied in a practical system.
. Some systems use the following method to connect all extracted keywords with or operations and search
Narrow down the scope, and then generate a space vector for all documents to determine the degree of relevance to the query document.
. However, I think this method is not efficient and the response time is too long. It is not suitable for the real-time requirements of our system.
In this system, the chapter search expression uses the weighted retrieval question type to prevent the Boolean search model from expressing the Feature Word weight.
It is easy to implement on the selected model. The method is to provide the key word weight in the search expression.
Determines whether the document meets the search criteria based on whether the correlation between the document and the query exceeds the threshold.
Measure similarity formula:
The document keyword weight is specified using the TF * IDF law. M is the total number of documents in the database, and NT is the text that contains the word t
Number of documents. FDT indicates the frequency of occurrence of words. Is the document length, obtained by calculating the number of index words.
3.2 precision inspection
The final check is the process of further matching and obtaining the final search result in the collection of candidate documents obtained from the contraction check.
The system first compares important features to eliminate documents that cannot be matched as soon as possible and narrow down the scope of post-processing.

The chapters to be retrieved are first divided into paragraphs and retrieved by the Section moderation method. Paragraph precision check
Fuzzy. When the structural features of the two paragraphs are basically matched, the sentence is further divided and the sentence is similar.
Degree calculation, and finally determine whether the paragraph matches. The system uses the dynamic planning method to calculate sentence similarity.
The words in the sentence to be translated are listed as the I axis of the I-J plane, and the words in the example sentence are listed as the J axis.
The value of (I, j) is the similarity between word I and word J. The similarity between the two statements is a path from the origin to (I, j ).
The sentence similarity value is the sum of the matching degrees of the vertices passing through the path. Then, the similarity calculation between sentences is transformed
Find an optimal path in the I-J plane to maximize the similarity between the two statements.
In pursuit of speed and accuracy, the current similarity query does not perform operations such as synonym extension.
The similarity D (Ik, JK) can be defined as: for example, I, j are the same, and 1 is used; otherwise, 0 is used. The state transition equation is: (Ik, JK)
= UK (ik-1, jk-1 ).
In addition, similar statement matching paths have certain restrictions: (1) Monotonic restrictions, the path must be from the starting point
Extending to the right or up. (2) Global Path restriction. The skewed path is better than the vertical or horizontal path. (3) local path
Restriction: (Ik, JK) subnode only calculates (Ik + 1, JK), (Ik, JK + 1), (Ik + 1, JK + 1) and
There is no right angle.
Similarity s from the origin to the (I, j) full path is:
The optimal Recurrence Formula for dynamic planning is as follows:
The similarity between statements is defined:
N indicates the number of words in the sentence to be translated. The statement with the highest similarity is used as the search result. If no similarity exists
In the threshold statement, return the query failure mark.
In this way, we can define the relevance of a paragraph based on the relevance of each sentence, so as to retrieve the required paragraph, or even
Chapter.
3.3 correlation Performance Analysis
First, we use examples to introduce the principles of weighted search.
For example, to query documents about network Machine Translation in natural language processing, the weighted query method is as follows:
Natural Language Processing (1) Machine Translation (3) network (2)
If the document contains three words at the same time, the weight of this document is 1 + 3 + 2 = 6. If the document contains natural language
And machine translation, the weights of this document are 1 + 3 = 4... and so on. If the minimum threshold is set to 4,
It can contain three words at the same time, or both words (except for natural language processing and network combinations ).
.
Next we will make a comparison with the vector space model.
In the vector space model, both literature and questions are expressed as vectors. Assume that there are m different index words in the document set.
1, T2 ,..., TM, then each document in the set can be represented by several of the M index words. Ren Yi
It can be expressed as a vector in the index word vector space:
D = (T11, T12,..., t1m)
Similarly, a question Q can be expressed

Bytes ----------------------------------------------------------------------------------------------

How does a search engine work behind the scenes?
Author: Zhu Jie, Institute of software, Chinese Emy of Sciences
The amount of data processed by computers is increasing exponentially. As the data and topics accumulated in the data information library are increasing, how to quickly, effectively, and economically retrieve all information of a topic is required, it has become a very popular topic. One way to solve this problem is to use intelligent search technology. This article provides a summary of the structure of natural language processing and the full-text retrieval technology that will ultimately help network users find information.
Search for information
The presentation, storage, organization, and access of the main research information in information retrieval. That is, relevant information is retrieved from the information database according to the user's query requirements. Information Retrieval has evolved from manual keyword indexing to full-text information retrieval, automatic information summarization, and automatic information classification of computer automatic indexing, and is moving towards natural language processing.
In the field of information retrieval, English Information Retrieval has developed rapidly. For example, a smart Information Retrieval System developed by Salton and others can use vector space to represent the information to be searched, and natural language processing can be applied to information retrieval, which greatly improves the accuracy of information query. The development of Chinese information retrieval systems is relatively slow. Currently, most of the existing Chinese retrieval systems are still keyword retrieval systems, and even many systems are still in the "word" index stage. It is not only inefficient, but also has poor information retrieval accuracy and accuracy. The reason is that Chinese Information Retrieval has its own characteristics. For example, there is no space between Chinese words. Therefore, word segmentation is required before indexing. On the other hand, compared with English, Chinese Syntactic Analysis and semantic understanding are more difficult, resulting in slow development of Chinese Information Retrieval.
Information Retrieval Model
The core of the information retrieval system is the search engine, which needs to filter out information that meets user requirements from a large amount of complicated information. For example, if you want to query information about the sales of computer network products from the Information Library, the query results cannot meet your needs. Based on different search engines, information retrieval can be divided into boolean logic model, fuzzy logic model, vector space model, and probability model.
The Boolean Information Retrieval Model is the simplest information retrieval model. You can submit a query based on the Boolean logical relationship of the Search items in the document. The search engine uses the inverted file structure established in advance, determine the query result. The standard boolean logic model is binary logic. The searched documents are either query-related or query-independent. Query results are generally not sorted by relevance. For example, if the keyword "computer" appears in the document, it is all included in the query results. To overcome the disorder of the query results of the Boolean information search model, a fuzzy logic operation is introduced in the query result processing to compare the retrieved database document information with the user's query requirements, query results are sorted by priority. For example, if you query "computer", more documents with "computer" appear in the front of the page.
Unlike the Boolean Information Retrieval Model, the vector space model uses the vector space of the retrieval item to represent the user's query requirements and database document information. Sort the query results based on the similarity of the vector space. The vector space model not only facilitates the generation of valid query results, but also provides summaries of relevant documents and classifies query results to provide users with the information they need to accurately locate.
The probabilistic model based on Bayesian probability theory is different from the Boolean and vector space models. It uses the inductive learning method of relevant feedback to obtain matching functions.
Although different search models use different methods, the goal is the same, providing the information required by users according to user requirements. In fact, most search systems often mix the above models to achieve the best search results.
Information Retrieval System Structure
Search engines constitute the core of information retrieval systems. However, the information retrieval system also involves the pre-processing of index information document formats, analysis of index information, information indexing, and user information retrieval.
Information preprocessing
Information preprocessing includes two levels: information format conversion and filtering. As an organization that Accesses Different information, the gateway can access data in different organizational forms, such as various databases, different file systems, and web pages. At the same time, information preprocessing can also filter documents in different formats. Such as Microsoft Word, WPS, text, and HTML. This allows the search engine to retrieve not only the body document, but also the document information in the original format.
Information Index
An information index is used to create a feature record of document information, which enables users to easily retrieve required information. Indexing requires the following processing:
Word Segmentation and lexical analysis
A word is the smallest unit of information expression. Unlike a western language, there is no separator (Space) between words in a sentence. Therefore, word segmentation is required. There are differences in Chinese word segmentation. For example, the sentence "satisfying users" can be divided into "satisfying/user/satisfaction", which may also be mistakenly divided into "using/user/satisfaction ". Therefore, we need to use various context knowledge to solve word segmentation differences. In addition, lexical analysis is required to identify the stem of each word, so as to establish information indexes based on the stem.
Perform part-of-speech tagging and related natural language processing
On the basis of segmentation, part-of-speech tagging is performed using rules and Statistics (Markov chain) methods. The N-element syntax statistical analysis method based on Markov Chain random process has been proved to be highly accurate in part-of-speech tagging. On this basis, we also need to use various syntax rules to identify important phrase structures.
Create a search index
Generally, information about search items is created using Inverted Files, as shown in table 1. The related information includes "search item", "File Location Information of the search item", and "search item weight ". For example, the location information of the search item "computer" is "W word of the M sentence in section N of document D ". In this way, you can require that item T1 and item T2 be in the same statement or paragraph in the query. The index creation criterion is that document information must be updated and processed easily.
Table 1: List of typical inverted search items
Term1 doci, wti1; docj, wtj1; _; docm, wtm1
Term2 doci, wti2; Dock, wtk2; _; docn, wtn2
.
.
.
Terms docj, wtjs; docm, wtms; _; docp, wtps
Query extension Processing
The standard of information retrieval evaluation is the accuracy and recall rate of information retrieval. The information retrieval accuracy is the ratio of the number of relevant information documents to the total number of query results in the search results. The recall rate of information retrieval is the ratio of the number of retrieved documents to the total number of documents in the information library.
To improve the recall rate of information retrieval, query extension processing is required. This type of processing expands the search items according to the synonym dictionary and semantic implication dictionary. Synonym extension. For example, "computer" and "computer" refer to the same concept. Therefore, you must query "computer" and vice versa. Topic implication extension refers to not only querying the search term, but also querying the subconcepts contained in it. For example, the subject word "art" includes "movie", "dance", and "painting. "Movies" include "Story films" and "documentary films. Therefore, querying "art" includes, of course, "movie", "dance", "painting", and its sub-concepts.
To improve the accuracy of information retrieval, the vector space model can be used for relevant query feedback processing. That is, the user selects important documents or document fragments from the results of the first query, so that the search engine can re-query the selected documents based on their features, thus improving the query accuracy.
Information Classification and summary
To help you select the required information from the query results, the search engine classifies the document information provided to you according to the document content and generates a brief summary for each document.
The search engine classifies and abstracts the query results based on the statistical features of text search items. For example, if a user queries a search item "computer", the corresponding results may be classified as "Category 1": "network", "system", and "vro"; "Category 2 ": "market", "product", "sales", and other categories. The purpose of classification is to help users find relevant information.
Intelligent proxy
In addition to passive search, a search engine can also use intelligent proxy technology to actively search information. The intelligent user agent of the information retrieval system can monitor the information source on the network in real time according to the information retrieval requirements defined in advance, such as specifying web page updates, online news, emails, and database information changes. The user's required information is provided to the user by email or other means. The user does not need to repeatedly search for the required information, which greatly reduces the time required for information retrieval.
Currently, commercial information retrieval systems primarily use Boolean fuzzy logic and vector space models, supplemented by natural language processing. Natural Language Processing, especially the application of natural language understanding in information retrieval, will greatly improve the accuracy and relevance of information retrieval.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.