"C + + Curriculum design" Report
-- template-based text feature extractor program
a , topic description
1 , Design Objectives
To facilitate the processing of text, it is often necessary to convert the sequence of characters in the text into a sequence of eigenvectors. In general, a series of feature templates are given, as shown in the following three feature templates:
(a) Cn (n= -2,-1,0,1,2)
(b) cncn+1 (n=-2,-1,0,1)
(c) C-1C1
For example, for a given sequence of characters, "Xinhua news agency reporter," you need to sequentially apply each character in the character sequence with three templates to extract the corresponding features, when considering the character "society", the characteristics of the template (a) are: c-2= new, c-1= Hua, c0=, c1=, c2=. Template (b) produced by c-2c-1= Xinhua, c-1c0= Chinese community, c0c1=, c1c2= reporter. Template (c) produced by the characteristics of c-1c-1= Hua Kee. Thus, any character in a character sequence corresponds to a eigenvector based on the three feature templates , and a sequence of characters corresponds to a sequence of eigenvectors. To further facilitate processing, we will also store all the eigenvalues generated by the feature template in a feature dictionary where each eigenvalue will correspond to a sequence number in the feature dictionary, such as the fifth feature in the feature Dictionary is c-1c0= Chinese community, then the ordinal number of the feature is 5. In this way, we can convert the eigenvector corresponding to each of the preceding characters to a numeric vector (that is, each numeric element in the vector represents the ordinal number of the feature in the feature dictionary), and any piece of text can be converted to a sequence of ordinal vectors.
2 , functional design Requirements
(a) for a specified text file, the sequence of characters in a text file can be converted to a corresponding eigenvector sequence (eigenvector is an ordinal vector) according to the three feature templates described above.
(b) The ability to save and open the contents of the feature dictionary and the contents of the eigenvector sequence.
(c) Each template should correspond to one class (class), where the offset values of the templates (a) and (b) are required to be parameterized.
(d) The sequence of characters in the text requires a sentence as the processing unit, and each sentence produces a sequence of eigenvectors. Each sentence is required to correspond to one instance (Instance) class, and the entire text corresponds to an instance sequence class (Instancesequence).
Second, design instructions
1 , design Overview
(a) Development platform: Microsoft Visual Studio. NET 2003 (VC7.0)
(b) Reference books: "C + + Standard Program library", "C + + Primer", "tcpl" and so on
(c) Development cycle: Five days (conception, prototype, modification, modification, perfection)
2 , processing Flow
......
(see PDF document here)
Iii. Advantages and disadvantages of the procedure
1, the advantages of the program
(a) Supporting mixed text in both Chinese and English, and long articles (preferably not more than 10 million characters).
(b) Support for the processing of clauses in multiple symbols , which is currently subject to a period. ", question mark." ", exclamation point." "clause, and can add more necessary clause symbols at will.
(c) The base class is defined for the feature template, and the current ABC three template is fully parameterized for all values, making it easier to add additional templates in the future.
(d) The Sentenceins class can be instantiated independently of this procedure, and the future reuse of the program will be greatly minimized.
(e) The procedure is streamlined and all code totals are only 9K.
2. Problems encountered
In the process of writing, about three-fourths of the time, are used to solve the test program in various edge states, by validating a large number of text, more than 90% of the possible accidents are considered.
This is an example of the debugging of the clause function , which appears in several situations:
① test A paragraph of text, when the sentence appears "really cheap." This order "after the occurrence of serious garbled error." After careful investigation, found that the clause function caused by the problem, the reason is. The last byte of "," and ")" is exactly the period. "Two bytes, resulting in an incorrect clause. After this error is detected, the problem is resolved successfully.
② test a piece of text, check the text vector sequence, found its number of sentences, and the actual situation does not match. After the investigation, the program somewhere, the POS value should be-1 to respond to all possible situations.
3, the existence of defects
(a) It is not possible to have a two-symbol ending special case correct clause: "You're awesome." ”。 The correct clause should be after the "after quotation mark", and the current program will be in the exclamation point after the clause. Example two: "You really. , the correct should be after the "question mark", and the current program will be "exclamation point", "question mark" for two times clause.
(b) The process is capable of processing text mixed in Chinese and English, but it is not possible to deal effectively with the text in English as a whole .
(c) The efficiency of the implementation of the procedure is low and the text processing over 300,000 words appears to be very arduous.
Six, code download
This program code download address: Http://i1984.com/cpluspluscourse_05.rar
This report PDF layout download address: http://i1984.com/cpluspluscourse_05.pdf
Qing Xiang Rabbit
February 20, 2006