How can I tell if a string is Java code or an English word?

Last Update:2015-03-05 Source: Internet

Author: User

Tags java keywords

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Consider the following two strings:

1.for(int i=02.doin English(nottoa sentence).

It's easy to see that the first one is Java code, the second is an English sentence. So how do computer programs differentiate between the two?

Java code may not be resolvable because it is a complete method (or declaration or expression), which provides a workaround for this problem. Sometimes the Java code and the English word are not completely delimited, and the solution is not 100% accurate and reliable. However, such a solution will only require a slight adjustment to your business needs, and you can also download the relevant code from GitHub.

The basic idea of this scheme is to turn a string into a set of marker symbols. For example, the code above might be: "Key,separator,id,assign,number,separatot,..." and then use some simple rules to differentiate Java code from English.

A Tokenizer class that converts a string to a set of tokens is provided below.

 PackageLexicalImportJava.util.LinkedList;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern; Public  class tokenizer {    Private  class TokenInfo {         Public FinalPattern regex; Public Final intToken Public TokenInfo(Pattern regex,intToken) {Super(); This. Regex = regex; This. token = token; }    } Public  class Token {         Public Final intToken Public FinalString sequence; Public Token(inttoken, String sequence) {Super(); This. token = token; This. sequence = sequence; }    }PrivateLinkedlist<tokeninfo> Tokeninfos;PrivateLinkedlist<token> tokens; Public Tokenizer() {Tokeninfos =NewLinkedlist<tokeninfo> (); Tokens =NewLinkedlist<token> (); } Public void Add(String regex,intToken) {Tokeninfos. Add (NewTokenInfo (Pattern.compile ("^("+ Regex +")"), token); } Public void tokenize(String str)        {String s = Str.trim (); Tokens.clear (); while(!s.equals ("")) {//system.out.println (s);            BooleanMatch =false; for(TokenInfo Info:tokeninfos) {Matcher m = Info.regex.matcher (s);if(M.find ()) {match =true;                    String tok = M.group (). Trim (); s = M.replacefirst (""). Trim (); Tokens.add (NewToken (Info.token, Tok)); Break; }            }if(!match) {//throw New Parserexception ("Unexpected character in input:" + s);Tokens.clear (); System.out.println ("Unexpected character in input:"+ s);return; }        }    } PublicLinkedlist<token>Gettokens() {returnTokens } PublicStringgettokensstring() {StringBuilder SB =NewStringBuilder (); for(Tokenizer.token Tok:tokens)        {sb.append (Tok.token); }returnSb.tostring (); }}

We can use Java keywords, operators, identifiers, separators, and so on. and assign a mapping value identifier (used to store the Java keyword), so it is easy to distinguish between Java code and English.

 PackageLexicalImportGreenblocks.javaapiexamples.DB;ImportJava.io.IOException;ImportJava.sql.ResultSet;ImportJava.sql.SQLException;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;ImportOrg.apache.commons.lang.StringUtils;ImportNlp. Postagger; Public  class englishorcode {    Private StaticTokenizer Tokenizer =NULL; Public Static void Initializetokenizer() {Tokenizer =NewTokenizer ();//key WordsString keystring ="Abstract assert Boolean break byte case Catch"+"Char class const continue default do double else enum"+"extends false final finally float for Goto if implements"+"Import instanceof int interface long native new null"+"Package private protected public return short static"+"STRICTFP Super switch synchronized this throw throws true"+"transient try void volatile while Todo"; String[] keys = Keystring.split (" "); String KEYSTR = Stringutils.join (keys,"|"); Tokenizer.add (KEYSTR,1); Tokenizer.add ("\\(|\\)|\\{|\\}|\\[|\\]|;|,|\\.| =|>|<|!| ~|"+"\\?|:| ==|<=|>=|! =|&&|\\|\\| | \\+\\+|--|"+"\\+|-|\\*|/|&|\\| | \\^|%| \ ' |\ "|\n|\r|\\$|\\#",2);//separators, operators, etcTokenizer.add ("[0-9]+],3);//numberTokenizer.add ("[a-za-z][a-za-z0-9_]*],4);//identifierTokenizer.add ("@",4); } Public Static void Main(string[] args)throwsSQLException, ClassNotFoundException, IOException {initializetokenizer (); String s ="Do Something in 中文版";if(Isenglish (s)) {System.out.println ("中文版"); }Else{System.out.println ("Java Code"); } s ="for" (int i = 0; I < b.size (); i++) {";if(Isenglish (s)) {System.out.println ("中文版"); }Else{System.out.println ("Java Code"); }    }Private Static Boolean Isenglish(String replaced)        {tokenizer.tokenize (replaced); String patternstring = tokenizer.gettokensstring ();if(Patternstring.matches (". *444.*") || Patternstring.matches ("4+")){return true; }Else{return false; }    }}

Output Result:

EnglishJava Code

Original link

How can I tell if a string is Java code or an English word?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More