Introduction and analysis of standard CSV format and examples of analytic algorithms _c language

Source: Internet
Author: User
Tags extend format definition rfc

CSV is an ancient data transmission format, and its full name is comma-separated values (comma-separated value). Born in that standard-missing wilderness, the standard of CSV has been (by 2005) null--in the world there are n CSV formats, they are self-contained and incompatible with each other. For example, we can say from the name that CSV is at least a comma-delimited format, but in fact, some CSV format is to use semicolons (;) to do the separation. If there is no such thing as a standard, then it will eventually develop slowly or even decline because of fragmentation. The CSV format discussed in this article is based on the RFC4180 specification released in 2005. I think that after this specification is released, we should be more consciously follow this set of norms to develop-although this set of standards still have some fatal flaws

We can get a document that contains the CSV format definition from the IETF. Of course, if you feel trouble reading English documents, you can directly look at my below.

1. When a single piece of information does not contain a newline character (CRLF ie \ r \ n), the data is kept on one line and ends with \ r \ n.

AAA,BBB,CCC,DDDCRLF Legal

Aaa,b content without line breaks, and a single message is wrapped, illegal

Bb.ccc,dddcrlf

2. The last message can be no line break (of course there are line breaks are also legal)

Aaa,bbb,ccc,dddcrlf

Eee,fff,ggg,hhh Legal

Aaa,bbb,ccc,dddcrlf
EEE,FFF,GGG,HHHCRLF Legal

3. The first piece of information may be a header message. This header and subsequent message formats are the same, and the following information has the same number of modules (in the example, AAA and BBB and CCC and DDD are each considered as a module). (Personally think this is a flaw in the RFC design for this CSV format, because this rule will not allow us to determine from the rule point of view whether the first piece of information is the header or the general information.) Of course the RFC design must have its reasons. )

Index,character Legal and literally we can think this is the head of course we can also think it's not the head

1,acrlf
2,bcrlf

Indexcrlf illegal, the number of modules is not uniform
1,acrlf

4. Each piece of information is separated by a half-angle comma (,) with several modules. The number of modules per piece of information is equal. You cannot use a half-width comma after the last module of each piece of information. Spaces is treated as the content of a module and cannot be ignored. (This rule contains a relatively large amount of information)

AAA,BBBCRLF Legal
Ccc,ddd,crlf illegal, the last module of a piece of information you cannot use a half-width comma
Eee;ffffcrlf illegal, use half-width commas instead of semicolons
GGG, H H H CRLF Legal, note a number of spaces in the HHH module, which is part of the module and cannot be ignored
Iii,jjj,kkkkcrlf illegal, the number of modules and the above is not uniform

5. Double quotes can be used to extend the end of each module (and, of course, it may not be used). If you do not use double quotes to extend the module, you cannot have double quotes in the module. (implication: If double quotes appear in the module, the module will be spread out in double quotes)

"AAA", Bbbcrlf Legal.
A "AA,BBBCRLF is not valid because a" AA contains double quotes, and this module is not extended by double quotes

6. If the module contains double quotes, half-width commas, or newline characters, the module should be extended in double quotes.

"A\r\na" A,bbbcrlf legal, the first module contains a newline character, which is enclosed in double quotes
"A,aa", Bbbcrlf Legal

7. When double quotes appear in the module, extend the end of the module in double quotes and turn a double quote in the module into a pair of double quotes.

"A" "AA", BBBCRLF Legal, raw data for a "aa,bbb

With the above rules, we can write the corresponding extraction algorithm. Here's a set of core code for extracting information from a CSV file that I wrote in my work

BOOL Ccsv2json::P arse () {bool bsuc = FALSE; 
    do {if (Invalid_handle_value = = M_hfile) {break; 
    } overlapped ov; 
    memset (&ov, 0, sizeof (overlapped)); 
    BYTE Lpbuffer[buffersize] = {0}; 
    DWORD dwhaveread = 0; 
    Std::string Strsingle;  BOOL bfirstdoublequotes = FALSE;  
    Whether the first character is "BOOL bbeforeisdoublequotes = FALSE;" 
    BOOL bbeforeisx0d = FALSE; 
    ListString Liststr; 
    BOOL bpairdoublequotes = FALSE; while (ReadFile (M_hfile, lpbuffer, sizeof (lpbuffer), &dwhaveread, &ov)) {ov. 
      Offset + + Dwhaveread; 
 
        for (DWORD dwindex = 0; dwindex < Dwhaveread, dwindex++) {byte& by = * (lpbuffer + dwindex); 
            if (bfirstdoublequotes) {//have predecessor ' if (Isdoublequotes (by)) {bbeforeisx0d = FALSE; 
              if (bbeforeisdoublequotes) {strsingle.append (1, (char) (by)); 
            Bbeforeisdoublequotes = FALSE; 
   }         else {bbeforeisdoublequotes = TRUE; 
            } else {if (bbeforeisdoublequotes) {bfirstdoublequotes = FALSE; 
            } bbeforeisdoublequotes = FALSE; 
              if (ISCRLF) {if (bfirstdoublequotes) {strsingle.append (1, (char) (by)); 
                else if (FALSE = = bbeforeisx0d) {liststr.push_back (strsingle); 
                M_listliststr.push_back (LISTSTR); 
                Liststr.clear (); 
                Strsingle.clear (); 
              Bfirstdoublequotes = FALSE; 
            } bbeforeisx0d = isx0d (by); 
              else if (issep (by)) {bbeforeisx0d = FALSE; 
              if (bfirstdoublequotes) {strsingle.append (1, (char) (by)); 
                else {bbeforeisx0d = FALSE; ListstR.push_back (Strsingle); 
              Strsingle.clear (); 
              } else {bbeforeisx0d = FALSE; 
            Strsingle.append (1, (char) (by)); }} else{//If Isdoublequotes (by) {Bbeforei 
            sx0d = FALSE; 
              if (Strsingle.empty ()) {//empty string, the first is "bfirstdoublequotes = TRUE"; 
            Bbeforeisdoublequotes = FALSE; 
              else {strsingle.append (1, (char) (by)); 
            Continue 
            } else {bbeforeisdoublequotes = FALSE; 
                if (Iscrlf (by)) {if (FALSE = bbeforeisx0d) {liststr.push_back (strsingle); 
                M_listliststr.push_back (LISTSTR); 
                Liststr.clear (); 
                Strsingle.clear (); 
          Bfirstdoublequotes = FALSE;      Bbeforeisdoublequotes = FALSE; 
            else {//continuous \ r \ n does not consider setting to new row} bbeforeisx0d = isx0d (by); 
              else if (issep (by)) {bbeforeisx0d = FALSE; 
              Liststr.push_back (Strsingle); 
            Strsingle.clear (); 
              else {bbeforeisx0d = FALSE; 
            Strsingle.append (1, (char) (by));    
    }} memset (lpbuffer, 0, sizeof (lpbuffer)); } if (false = = Strsingle.empty ()) {//while (Iscrlf (strsingle.at (Strsingle.length ()-1)) &&amp ; 
Strsingle.length () > 0) {//Strsingle = STRSINGLE.SUBSTR (0, Strsingle.length ()-1); 
      } liststr.push_back (Strsingle); 
      M_listliststr.push_back (LISTSTR); 
      Liststr.clear (); 
    Strsingle.clear (); 
  } bsuc = TRUE; 
   
  while (0); if (NULL!= m_hfile) {
    CloseHandle (M_hfile); 
  M_hfile = NULL; 
return BSUC; 
 }

This code extracts a CSV file into a std::list<std::list<std::string>> structure. As the name above shows, my function is to convert the CSV file to JSON format, and I have also written code to convert from JSON format to CSV format file. The code is in the works.

Thank you for reading, I hope to help you, thank you for your support for this site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.