Consider the following two strings:
1.for(int i=02.doin English(nottoa sentence).
It's easy to see that the first one is Java code, the second is an English sentence. So how do computer programs differentiate between the two?
Java code may not be resolvable because it is a complete method (or declaration or expression), which provides a workaround for this problem. Sometimes the Java code and the English word are not completely delimited, and the solution is not 100% accurate and reliable. However, such a solution will only require a slight adjustment to your business needs, and you can also download the relevant code from GitHub.
The basic idea of this scheme is to turn a string into a set of marker symbols. For example, the code above might be: "Key,separator,id,assign,number,separatot,..." and then use some simple rules to differentiate Java code from English.
A Tokenizer class that converts a string to a set of tokens is provided below.
PackageLexicalImportJava.util.LinkedList;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern; Public class tokenizer { Private class TokenInfo { Public FinalPattern regex; Public Final intToken Public TokenInfo(Pattern regex,intToken) {Super(); This. Regex = regex; This. token = token; } } Public class Token { Public Final intToken Public FinalString sequence; Public Token(inttoken, String sequence) {Super(); This. token = token; This. sequence = sequence; } }PrivateLinkedlist<tokeninfo> Tokeninfos;PrivateLinkedlist<token> tokens; Public Tokenizer() {Tokeninfos =NewLinkedlist<tokeninfo> (); Tokens =NewLinkedlist<token> (); } Public void Add(String regex,intToken) {Tokeninfos. Add (NewTokenInfo (Pattern.compile ("^("+ Regex +")"), token); } Public void tokenize(String str) {String s = Str.trim (); Tokens.clear (); while(!s.equals ("")) {//system.out.println (s); BooleanMatch =false; for(TokenInfo Info:tokeninfos) {Matcher m = Info.regex.matcher (s);if(M.find ()) {match =true; String tok = M.group (). Trim (); s = M.replacefirst (""). Trim (); Tokens.add (NewToken (Info.token, Tok)); Break; } }if(!match) {//throw New Parserexception ("Unexpected character in input:" + s);Tokens.clear (); System.out.println ("Unexpected character in input:"+ s);return; } } } PublicLinkedlist<token>Gettokens() {returnTokens } PublicStringgettokensstring() {StringBuilder SB =NewStringBuilder (); for(Tokenizer.token Tok:tokens) {sb.append (Tok.token); }returnSb.tostring (); }}
We can use Java keywords, operators, identifiers, separators, and so on. and assign a mapping value identifier (used to store the Java keyword), so it is easy to distinguish between Java code and English.
PackageLexicalImportGreenblocks.javaapiexamples.DB;ImportJava.io.IOException;ImportJava.sql.ResultSet;ImportJava.sql.SQLException;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;ImportOrg.apache.commons.lang.StringUtils;ImportNlp. Postagger; Public class englishorcode { Private StaticTokenizer Tokenizer =NULL; Public Static void Initializetokenizer() {Tokenizer =NewTokenizer ();//key WordsString keystring ="Abstract assert Boolean break byte case Catch"+"Char class const continue default do double else enum"+"extends false final finally float for Goto if implements"+"Import instanceof int interface long native new null"+"Package private protected public return short static"+"STRICTFP Super switch synchronized this throw throws true"+"transient try void volatile while Todo"; String[] keys = Keystring.split (" "); String KEYSTR = Stringutils.join (keys,"|"); Tokenizer.add (KEYSTR,1); Tokenizer.add ("\\(|\\)|\\{|\\}|\\[|\\]|;|,|\\.| =|>|<|!| ~|"+"\\?|:| ==|<=|>=|! =|&&|\\|\\| | \\+\\+|--|"+"\\+|-|\\*|/|&|\\| | \\^|%| \ ' |\ "|\n|\r|\\$|\\#",2);//separators, operators, etcTokenizer.add ("[0-9]+],3);//numberTokenizer.add ("[a-za-z][a-za-z0-9_]*],4);//identifierTokenizer.add ("@",4); } Public Static void Main(string[] args)throwsSQLException, ClassNotFoundException, IOException {initializetokenizer (); String s ="Do Something in 中文版";if(Isenglish (s)) {System.out.println ("中文版"); }Else{System.out.println ("Java Code"); } s ="for" (int i = 0; I < b.size (); i++) {";if(Isenglish (s)) {System.out.println ("中文版"); }Else{System.out.println ("Java Code"); } }Private Static Boolean Isenglish(String replaced) {tokenizer.tokenize (replaced); String patternstring = tokenizer.gettokensstring ();if(Patternstring.matches (". *444.*") || Patternstring.matches ("4+")){return true; }Else{return false; } }}
Output Result:
EnglishJava Code
Original link
How can I tell if a string is Java code or an English word?