Regular Expressions written yesterday

Source: Internet
Author: User

The internship yesterday wrote a full day regular expression, to match the data returned by multiple models ONU switch, test data text document size 200mb+, a total of 2 million + data, by 4 manufacturers switch back data.

To write the regular must first look at the data format, 200mb+ text file, notepad++ can not open, Sublime Text2 Open after the computer becomes very slow, it is impossible to use in the Sublime text matching, data paging cards. On the Internet to find a section of the Java code text file, thank stackoverflow.com Source URL Look here, I put the code down.

After the partition is done, a portion of the data is pasted to start writing the regular, the basic idea is to start matching the part, then use Java to test the entire document, and then find unmatched data, iterative matching.

The whole process is still more painful, and is constantly discovering new formatting problems. Finally, we have to use two regular expressions as a match, because some vendors do not unify the format when they do the message:

    1. The separation between the fields, some vendors are separated by the tab blank character, some vendors are divided by 2 or more space whitespace characters;
    2. Inside the field, some vendors make up the entire field in the "-" connection field, and some vendor fields use a single space space character;
    3. The number of fields, most vendors are the return of the message has 12 fields, but a certain part of the business model switch, the return is ... 11 fields, using 2 space whitespace characters to split the field, using a single space space inside the field.

The return data like 3 is hard to tell how many fields there are, a business ... Come on. After I solved the problem of 1 and 2, I found that there were some fields that could not be matched, and only 11 fields were found after careful comparison with the data. In the process of solving 1 and 2, it was found that because of the difference between the delimiters, it is not possible to use only "\s" to achieve the field separation of all data:

    tab-delimited message format between the
    1. fields, the fields are delimited with "\ t", in order to match the space space inside the field, the ' \t[\s\s]*?\t ' rule is used, which is between two tab characters, Use the lazy principle to match any character and any whitespace characters, the
    2. field is separated by 2 or more space in the message format, separated by the number of occurrences of space whitespace, the rule is ' \s{2,} ', the space whitespace inside the matching field, the idea of the same 1, The rule changes to ' \s{2,}[\s\s]*?\s{2,} '; the
    3. section has only 11 fields of message data, which is a subset of # #, so look for ways to modify the matching rules for # #, fortunately the missing message is the same column, for example the missing field is "A  b  c "In the middle of the B field, between A and C when missing B, 4 space whitespace characters are used to separate. Solution: After matching to the A field, discard the ' \s{2,}\s+\s{2,} ' rule instead of ' \s{2}[\s\s]*?\s{1,} ', which matches the a field to 2 whitespace, followed by a lazy match, between 2 white-space characters and at least 1 white-space characters, A string that matches any non-whitespace character combined with a white-space character. An example of a better point:
#1 Number of normal fields (field separated by 2 spaces) a  B  cadkc  Kdlsk  kslakd#2  missing field B data (field separated by 4 spaces) a    cadkc    KSLAKD matching rule \s+\s{2} ([\s\s]*?) \s{1,}\s+\s+ will match to field a,\s{2} will match 2 space characters after field A, [\s\s]*? is lazy match non-whitespace characters with white space characters any time, until you encounter " At least one of the whitespace field ", which is the rule represented by \s{1,}, at which time # is matched to the B field, #2会匹配到空字符, because after 2 whitespace characters, it is followed by a space character. \s+ matches to the C field.

Java split text File source code:

ImportJava.io.BufferedReader;ImportJava.io.File;Importjava.io.FileNotFoundException;ImportJava.io.FileReader;ImportJava.io.FileWriter;Importjava.io.IOException;ImportJava.io.PrintWriter; Public classTextfilesplit { Public Static voidMain (string[] args) {if(Args.length! = 2) {System.out.println ("Invalid input!"); }        //The first parameter is a file pathFile File =NewFile (args[0]); //The second parameter is the number of rows per file        intNumlinesperchunk = Integer.parseint (args[1])/; BufferedReader Reader=NULL; PrintWriter writer=NULL; Try{Reader=NewBufferedReader (Newfilereader (file)); } Catch(FileNotFoundException e) {e.printstacktrace ();                } String Line; LongStart =System.currenttimemillis (); Try{ Line=Reader.readline ();  for(inti = 1; Line! =NULL; i++) {writer=NewPrintWriter (NewFileWriter (Args[0] + "_part" + i + ". txt"));  for(intj = 0; J < Numlinesperchunk && line! =NULL; J + +) {writer.println (line); Line=Reader.readline ();            } writer.flush (); }        } Catch(IOException e) {e.printstacktrace ();        } writer.close (); LongEnd =System.currenttimemillis (); System.out.println ("Taken Time[sec]:"); System.out.println (End-start)/1000); }}

Regular Expressions written yesterday

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.