Optimization solution for reading TXT Excel tables (the most concise algorithm I know currently)

Source: Internet
Author: User
Tags spl

As far as I know, all Chinese companies that are engaged in large-scale games are using Excel for data-the Program abstracts data interfaces and fills in data for planning in Excel-apart from different editors, the data usage is almost the same.
Save the Excel table as a TXT file to get a simple text table. To identify the program, you only need to solve two problems:
1. Read data correctly by row or column Analysis
2. Data Storage

I plan to write this blog as short as possible, let's go.

I. Table reading problems

Define two functions (in C ++ instructions ):
// Replace the function. Purpose: replace all the pattern in the string with REPL.
Void Replace (string S, string pattern, string repl)
// Cut the function. Purpose: Split the string into N strings by the symbol SPL and put the result in array v.
Void split (string S, string SPL, vector <string> & V)

Algorithm:
// Read the entire file to s first, which must be read in binary format. Otherwise, the linefeed will be processed by the underlying layer, resulting in nonstandard format. Omitted
// Step 1:
Replace (S, "\ x0a \ x0d", "\ x0d ");
// Step 2:
Replace (S, "\ t \" "," \ t ");
// Step 3:
Replace (S, "\" \ t "," \ t ");
// Step 4:
Replace (S ,"\"\"","\"");
// At This Point, special cases such as line breaks and quotation marks in cells in the file have been almost completely processed. The rest is just a simple table with \ x0d as the line break and \ t as the column break.
// Step 5: Cut the file into lines:
Split (S, "\ x0d", lines );
// Step 6, split each line into cells:
Split (line, "\ t", cells );

Explanation:
Step 1: unify the line feed. This step is to take care of the Mac Excel version and unify the Mac version and the win version. If you only consider the win version, you can omit Step 1 and change the second parameter to \ x0a \ x0d in step 5.
The difficulty of line feed is that a cell can contain special characters such as line breaks. I also accidentally found that after saving the Excel file, the line break in the cell is \ x0a, the line break at the end of a row is \ x0d, which makes it easy to differentiate and does not require context analysis.
Step 2 and Step 3 solve the problem in these two steps: When a cell contains special characters (such as English commas), Excel will add an English quotation mark before and after the cell. Note that this quotation mark is always after or before \ t, so you can use this simple method to remove it.
Step 4, the problem solved in this step is: Excel itself in order to distinguish the quotation marks automatically added by Excel in the previous step from the ones actually entered by the user, the quotation marks entered by the user are all converted into two consecutive quotation marks (for example, if the user inputs three quotation marks to six quotation marks). Because the quotation marks automatically added in the previous step have been removed, in this step, you only need to change the two consecutive quotation marks into one.

Step 5 and Step 6 have nothing to explain.

Universal solution in special cases:
The Replace in the above algorithm can solve the 99.9% read problem based on my experience. What are the remaining problems? 1. \ t and quotation marks may appear in the cell, although the person who does not know the special method cannot enter them. 2. There is a bug in MAC Excel, and the newline and ending line in the cell may change to \ x0d (in this case, an error occurs when you open the file in Windows excel. The best way is to restore the file in time ).
In special cases, I suggest you use a separate table Check Tool outside the main program of the project to check the correctness of the table from multiple angles. In addition to logical errors, you can also save Excel files in special cases. You only need to remember the following:

-- The content that appears between two separate double quotes must be a cell.

In this way, you only need to traverse the file to find out the special format. I used this algorithm in the python Regular Expression Library. It is very interesting.

Ii. data format for data storage

In my previous project, the data structure was like this: table contains lines, line contains cells, which is more consistent with intuition. But the disadvantage is: Cell has multiple data types-integer, decimal, string, and their array form. There are 6 data types in total. When expressed in a line struct, A complicated data structure must be designed to accommodate them. The first problem is that the Code is not easy to understand, and the second is that many irrelevant variables are introduced. When the table is large, the memory usage is considerable.

Simple practice: save data by column. I will not deprive readers of the pleasure of doing so. Let's think about it. By the palace of Mao (?) The idea is something that solves several problems in one breath.

C ++'s method of saving uncertain types is worth recording.
Class table
{
Vector <void *> data_cols; // data of multiple columns, each of which is a void *
}
In case of different types, data_cols is forcibly converted to the following type (for example ):
Vector <vector <int> *>;
// Integer
Vector <vector <string> *>;
// String
Vector <vector <int> *>;
// Integer Array
Vector <vector <string> *>;
// String Array

In actual code writing, there will be no layer-3 vector, which is generally used as follows:
Vector <vector <int> * D = data_cols [I];
// Assume that A is a vector <int>
D-> push_back ();

Traps:
A large number of new requests are called to apply for space.
Data such as data_cols cannot be easily clear or delete. During the destructor, the data must be forcibly converted to the corresponding type and then deleted or clear. Otherwise, leakage may occur because the internal destructor is not called.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.