Introduction and analysis of standard CSV format and examples of analytic algorithms

Introduction and analysis of standard CSV format and examples of analytic algorithms _c language

Last Update:2017-01-18 Source: Internet

Author: User

Tags extend format definition rfc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

CSV is an ancient data transmission format, and its full name is comma-separated values (comma-separated value). Born in that standard-missing wilderness, the standard of CSV has been (by 2005) null--in the world there are n CSV formats, they are self-contained and incompatible with each other. For example, we can say from the name that CSV is at least a comma-delimited format, but in fact, some CSV format is to use semicolons (;) to do the separation. If there is no such thing as a standard, then it will eventually develop slowly or even decline because of fragmentation. The CSV format discussed in this article is based on the RFC4180 specification released in 2005. I think that after this specification is released, we should be more consciously follow this set of norms to develop-although this set of standards still have some fatal flaws

We can get a document that contains the CSV format definition from the IETF. Of course, if you feel trouble reading English documents, you can directly look at my below.

1. When a single piece of information does not contain a newline character (CRLF ie \ r \ n), the data is kept on one line and ends with \ r \ n.

AAA,BBB,CCC,DDDCRLF Legal

Aaa,b content without line breaks, and a single message is wrapped, illegal

Bb.ccc,dddcrlf

2. The last message can be no line break (of course there are line breaks are also legal)

Aaa,bbb,ccc,dddcrlf

Eee,fff,ggg,hhh Legal

Aaa,bbb,ccc,dddcrlf
EEE,FFF,GGG,HHHCRLF Legal

3. The first piece of information may be a header message. This header and subsequent message formats are the same, and the following information has the same number of modules (in the example, AAA and BBB and CCC and DDD are each considered as a module). (Personally think this is a flaw in the RFC design for this CSV format, because this rule will not allow us to determine from the rule point of view whether the first piece of information is the header or the general information.) Of course the RFC design must have its reasons. ）

Index,character Legal and literally we can think this is the head of course we can also think it's not the head

1,acrlf
2,bcrlf

Indexcrlf illegal, the number of modules is not uniform
1,acrlf

4. Each piece of information is separated by a half-angle comma (,) with several modules. The number of modules per piece of information is equal. You cannot use a half-width comma after the last module of each piece of information. Spaces is treated as the content of a module and cannot be ignored. (This rule contains a relatively large amount of information)

AAA,BBBCRLF Legal
Ccc,ddd,crlf illegal, the last module of a piece of information you cannot use a half-width comma
Eee;ffffcrlf illegal, use half-width commas instead of semicolons
GGG, H H H CRLF Legal, note a number of spaces in the HHH module, which is part of the module and cannot be ignored
Iii,jjj,kkkkcrlf illegal, the number of modules and the above is not uniform

5. Double quotes can be used to extend the end of each module (and, of course, it may not be used). If you do not use double quotes to extend the module, you cannot have double quotes in the module. (implication: If double quotes appear in the module, the module will be spread out in double quotes)

"AAA", Bbbcrlf Legal.
A "AA,BBBCRLF is not valid because a" AA contains double quotes, and this module is not extended by double quotes

6. If the module contains double quotes, half-width commas, or newline characters, the module should be extended in double quotes.

"A\r\na" A,bbbcrlf legal, the first module contains a newline character, which is enclosed in double quotes
"A,aa", Bbbcrlf Legal

7. When double quotes appear in the module, extend the end of the module in double quotes and turn a double quote in the module into a pair of double quotes.

"A" "AA", BBBCRLF Legal, raw data for a "aa,bbb

With the above rules, we can write the corresponding extraction algorithm. Here's a set of core code for extracting information from a CSV file that I wrote in my work

BOOL Ccsv2json::P arse () {bool bsuc = FALSE; 
    do {if (Invalid_handle_value = = M_hfile) {break; 
    } overlapped ov; 
    memset (&ov, 0, sizeof (overlapped)); 
    BYTE Lpbuffer[buffersize] = {0}; 
    DWORD dwhaveread = 0; 
    Std::string Strsingle;  BOOL bfirstdoublequotes = FALSE;  
    Whether the first character is "BOOL bbeforeisdoublequotes = FALSE;" 
    BOOL bbeforeisx0d = FALSE; 
    ListString Liststr; 
    BOOL bpairdoublequotes = FALSE; while (ReadFile (M_hfile, lpbuffer, sizeof (lpbuffer), &dwhaveread, &ov)) {ov. 
      Offset + + Dwhaveread; 
 
        for (DWORD dwindex = 0; dwindex < Dwhaveread, dwindex++) {byte& by = * (lpbuffer + dwindex); 
            if (bfirstdoublequotes) {//have predecessor ' if (Isdoublequotes (by)) {bbeforeisx0d = FALSE; 
              if (bbeforeisdoublequotes) {strsingle.append (1, (char) (by)); 
            Bbeforeisdoublequotes = FALSE; 
   }         else {bbeforeisdoublequotes = TRUE; 
            } else {if (bbeforeisdoublequotes) {bfirstdoublequotes = FALSE; 
            } bbeforeisdoublequotes = FALSE; 
              if (ISCRLF) {if (bfirstdoublequotes) {strsingle.append (1, (char) (by)); 
                else if (FALSE = = bbeforeisx0d) {liststr.push_back (strsingle); 
                M_listliststr.push_back (LISTSTR); 
                Liststr.clear (); 
                Strsingle.clear (); 
              Bfirstdoublequotes = FALSE; 
            } bbeforeisx0d = isx0d (by); 
              else if (issep (by)) {bbeforeisx0d = FALSE; 
              if (bfirstdoublequotes) {strsingle.append (1, (char) (by)); 
                else {bbeforeisx0d = FALSE; ListstR.push_back (Strsingle); 
              Strsingle.clear (); 
              } else {bbeforeisx0d = FALSE; 
            Strsingle.append (1, (char) (by)); }} else{//If Isdoublequotes (by) {Bbeforei 
            sx0d = FALSE; 
              if (Strsingle.empty ()) {//empty string, the first is "bfirstdoublequotes = TRUE"; 
            Bbeforeisdoublequotes = FALSE; 
              else {strsingle.append (1, (char) (by)); 
            Continue 
            } else {bbeforeisdoublequotes = FALSE; 
                if (Iscrlf (by)) {if (FALSE = bbeforeisx0d) {liststr.push_back (strsingle); 
                M_listliststr.push_back (LISTSTR); 
                Liststr.clear (); 
                Strsingle.clear (); 
          Bfirstdoublequotes = FALSE;      Bbeforeisdoublequotes = FALSE; 
            else {//continuous \ r \ n does not consider setting to new row} bbeforeisx0d = isx0d (by); 
              else if (issep (by)) {bbeforeisx0d = FALSE; 
              Liststr.push_back (Strsingle); 
            Strsingle.clear (); 
              else {bbeforeisx0d = FALSE; 
            Strsingle.append (1, (char) (by));    
    }} memset (lpbuffer, 0, sizeof (lpbuffer)); } if (false = = Strsingle.empty ()) {//while (Iscrlf (strsingle.at (Strsingle.length ()-1)) &&amp ; 
Strsingle.length () > 0) {//Strsingle = STRSINGLE.SUBSTR (0, Strsingle.length ()-1); 
      } liststr.push_back (Strsingle); 
      M_listliststr.push_back (LISTSTR); 
      Liststr.clear (); 
    Strsingle.clear (); 
  } bsuc = TRUE; 
   
  while (0); if (NULL!= m_hfile) {
    CloseHandle (M_hfile); 
  M_hfile = NULL; 
return BSUC; 
 }

This code extracts a CSV file into a std::list<std::list<std::string>> structure. As the name above shows, my function is to convert the CSV file to JSON format, and I have also written code to convert from JSON format to CSV format file. The code is in the works.

Thank you for reading, I hope to help you, thank you for your support for this site!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More