On the application of Chinese text automatic correction in the search of movie and TV dramas and Java implementation

Source: Internet
Author: User
Tags exit integer readline split tostring trim stringbuffer

1. Background:

This week because the project needs to the search box to enter the wrong movie name correction processing, to enhance the search hit rate and user experience, studied the Chinese text automatic correction (professional point is proofreading, proofread), and the initial realization of the function, hereby record.

2. Introduction:

Chinese input error proofing and correction refers to the input of unusual or incorrect text when the system prompts the text is incorrect, the simplest example is typing in Word will have a red underline hint. There are two main ideas for implementing this feature:

(1) based on a large number of dictionary segmentation method: The main is to analyze the Chinese character string and a large "machine dictionary" in the match, if found in the dictionary to match the success; This method is easy to implement and is more suitable for the input of Chinese character string

A noun or a name belonging to a certain or a number of areas;

(2) Segmentation method based on statistical information: commonly used is the N-gram language model, in fact is the N-1 order Markov (Markov) model; Here's a brief introduction to the model:

The upper is a byes formula, indicating that the string x1x2 ... The probability of XM appearing is the product of conditional probabilities appearing alone in each word, in order to simplify the calculation of the assumption that the appearance of the Word XI is only related to the preceding N-1 character, the formula above becomes:

This is the N-1-order Markov (Markov) model, which calculates the probability and contrasts with a threshold, which prompts the string to be misspelled if it is less than this threshold value.

3. Implementation:

Because of my project for the input of Chinese character string is basically the name of the film and television drama, and the name of the animation program, the corpus is relatively stable, so here the use of 2-gram, namely two-yuan language model and dictionary segmentation method;

First, the idea:

On the basis of the word segmentation processing-> calculates the probability of two yuan entry (under the Corpus sample, replaces with the frequency of the entry)-> treats the analyzed Chinese character string participle and finds the maximal continuous string and the second largest continuous string->

Using the largest and second-largest sequential strings to match a corpus's movie name-> partial match the reality is spelled incorrectly and returns the corrected string (so the dictionary is important)

Note: Participle here with the Ictclas Java API

Code on:

Create Class Chinesewordproofread

3.1 Initialization of the word segmentation and the film Corpus for word processing

Public ICTCLAS2011 initwordsegmentation () {ICTCLAS2011 wordseg = new ICTCLAS2011 (); try{String Argu = "F:\\java\\workspace\\wordproofread";//set your project path System.out.printl
            N ("Ictclas_init"); if (ICTCLAS2011.
                Ictclas_init (Argu.getbytes ("GB2312"), 0) = = False) {System.out.println ("Init fail!");
            return null; /* * Set the pos callout set ID to represent the POS set 1 calculation            The first level annotation set 0 calculates two level annotation set 2 Peking University two level annotation set 3
                
        North-level annotation set/Wordseg.ictclas_setposmap (2);
            }catch (Exception ex) {System.out.println ("Words segmentation initialization");
        System.exit (-1);
    return wordseg; public boolean wordsegmentate (String argu1,stRing ARGU2) {Boolean ictclasfileprocess = false; try{//File Segmentation ictclasfileprocess = wordseg.ictclas_fileprocess (Argu1.getbytes ("GB2312"), Argu2.getby
                
            TES ("GB2312"), 0); ICTCLAS2011.
                
        Ictclas_exit ();
            }catch (Exception ex) {System.out.println ("File process segmentation failed");
        System.exit (-1);
    return ictclasfileprocess; }

3.2 The frequency at which entries are calculated (tokens)

Public map<string,integer> Calculatetokencount (String afterwordsegfile) {map<string,integer> WordCoun
        TMap = new hashmap<string,integer> ();
        File Movieinfofile = new file (afterwordsegfile);
        BufferedReader MOVIEBR = null;
        try {moviebr = new BufferedReader (new FileReader (Movieinfofile));
            catch (FileNotFoundException e) {System.out.println ("Movie_result.txt File not Found");
        E.printstacktrace ();
        } String wordsline = null; try {while (Wordsline=moviebr.readline ())!= null) {string[] words = Wordsline.trim (). Split (
                " "); for (int i=0;i<words.length;i++) {int wordCount = Wordcountmap.get (words[i]) ==null? 0:wordcountmap
                    . Get (Words[i]);
                    Wordcountmap.put (Words[i], wordcount+1);
                        
                    Totaltokenscount + 1; if (words.leNgth > 1 && i < words.length-1) {StringBuffer wordstrbuf = new StringBuffer ();
                        Wordstrbuf.append (Words[i]). Append (words[i+1]); int wordstrcount = Wordcountmap.get (wordstrbuf.tostring ()) ==null?
                        0:wordcountmap.get (Wordstrbuf.tostring ());
                        Wordcountmap.put (Wordstrbuf.tostring (), wordstrcount+1);
                    Totaltokenscount + 1; catch (IOException e) {S
            Ystem.out.println ("Read Movie_result.txt file failed");
        E.printstacktrace ();
    return wordcountmap; }

3.3 Find the correct tokens in the string to be parsed

Public map<string,integer> Calculatetokencount (String afterwordsegfile) {map<string,integer> WordCoun
        TMap = new hashmap<string,integer> ();
        File Movieinfofile = new file (afterwordsegfile);
        BufferedReader MOVIEBR = null;
        try {moviebr = new BufferedReader (new FileReader (Movieinfofile));
            catch (FileNotFoundException e) {System.out.println ("Movie_result.txt File not Found");
        E.printstacktrace ();
        } String wordsline = null; try {while (Wordsline=moviebr.readline ())!= null) {string[] words = Wordsline.trim (). Split (
                " "); for (int i=0;i<words.length;i++) {int wordCount = Wordcountmap.get (words[i]) ==null? 0:wordcountmap
                    . Get (Words[i]);
                    Wordcountmap.put (Words[i], wordcount+1);
                        
                    Totaltokenscount + 1; if (words.leNgth > 1 && i < words.length-1) {StringBuffer wordstrbuf = new StringBuffer ();
                        Wordstrbuf.append (Words[i]). Append (words[i+1]); int wordstrcount = Wordcountmap.get (wordstrbuf.tostring ()) ==null?
                        0:wordcountmap.get (Wordstrbuf.tostring ());
                        Wordcountmap.put (Wordstrbuf.tostring (), wordstrcount+1);
                    Totaltokenscount + 1; catch (IOException e) {S
            Ystem.out.println ("Read Movie_result.txt file failed");
        E.printstacktrace ();
    return wordcountmap; }

3.4 Get maximum continuous and second largest continuous string (also possible for single character)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.