Dream code-a programmer's self-white (2)

Source: Internet
Author: User
Tags xml parser

This document is not repostedDream code-a programmer's self-white (2)


Airmax is a huge project plan that requires 3 ~ 5 years to complete. This abbreviation is composed of the first letter and characteristic Words of the company's main product. It also means that this project will affect all these pillar products. These products of the company are design software in different fields. This project plans to integrate resources in three aspects, that is, to provide a unified library or framework for all products: Program appearance (GUI), rendering (mainly 3D engines ), and file format (save design results ). These products are the current Taurus of the company, and they are very strong. This means that airmax is a very challenging project and requires superb organization and coordination to succeed.

The airdata project is one of the three carriages and aims to unify the file format and related libraries. "Data" in the project name indicates the nature of the project. Later, it was also referred to as ADP, which meant to get the axxx data package. However, this abbreviation will not be widely used until one year later. One architect in the U.S. is in charge of the ADP project and three other programmers. We seem to have four employees at the beginning. One of our colleagues went to another project group just a few days ago because they were not interested.

At the beginning of the project, what I want to know most is the positioning of the project and the problems to be solved. I have tried to ask American designers with these questions, and I can only get the answer without any nutrition. I never get a clear idea about the project blueprint, and the designer doesn't seem to care. "Unified File Format and library" is not a qualified answer, it cannot set up a beacon for our project. In this tangle, we had to do things that gave clear instructions. The first thing is to understand OPC. This OPC is a so-called open packaging conventions file package Organization standard proposed by Microsoft. It will be directly supported by the operating system on Windows Vista, which was not released yet. Office
2007 has actually adopted OPC, which is the docx, pptx, and so on.

Although I am only an ordinary soldier of ADP, I still hope that ADP will be a successful project and will be proud of my contribution in the future. Based on my simple ideas, I think the general direction of ADP is correct. There is no technical risk or great difficulty. It is a new project and there is no historical burden. Many programmers dream of such a project, but there are two hidden worries that make me uneasy. One of them is the file storage problem. As mentioned above, our files must comply with OPC specifications. Simply put, they are a zip file, and OPC specifies how to organize the files inside the ZIP file. Our data will be packaged into a zip file in the form of an XML file. Regarding the XML-based data, it is said that the company has tried before and finally failed due to performance issues. I have never liked XML. Apart from writing an "usable" XML parser, I can't see any technical advantages for third-stream programmers. Those who used XML to write configuration files all seem to me broken: Is XML readable? Similarly, compared with writing javascrpt, isn't the bad way of HTML uncomfortable? But unfortunately, XML in the market still has a great advantage. Why? I don't know. However, I have always maliciously thought that XML caters to a large number of inferior programmers who do not understand the lexical analyzer. Compared with JSON, JSON can be at least readable. Far away.

Unlike office products, our software usually produces a large number of files. M is a common size. If you use a very wordy representation method, the file size will expand to a very dangerous level. Another problem is that these existing design files have complicated internal structures. For example, a product has something similar to a database in its design file. If we simply convert them into XML and compress them into ZIP files, in addition to reducing the performance to an unacceptable level, the existing file access workflow will also be damaged. Therefore, the so-called "Unified File Storage" does not mean a single file format of 10 thousand. In my opinion, we have at least two types of file storage mechanisms. One is the high-performance file storage mechanism used during product work, this mechanism is attached to the machine where the product is located. Our goal is to squeeze all the resources on the machine to obtain high performance. Another mechanism is the preservation of design results. Such files are for the purpose of "Exchange. Files generated by a product must be read from other products, operating systems, and CPU architectures. For the purpose of exchange, the performance requirements can be less demanding, but a compact format is required and strictly backward compatible.

It is not very easy to solve these two problems, but it is still very stable if the goal is set and the problems to be solved are eliminated one by one. But unfortunately, the performance-oriented design has never been considered separately. What ADP actually does is always the exchange-oriented part, and it attempts to achieve high performance. To be "Interchangeable", what ADP should do is to design a specification or "protocol" for the package format ". OPC only solves the high-level structure of the package. We should carefully consider how to explain the specific files and how to store the data. At that time, data (corresponding to the basic data type in C ++) was stored in two formats: Text and binary. How can we read and write them? Taking into account the meaning behind "Interchangeable", if no specification design is made, the stored files will inevitably be riddled with holes. Once such files containing errors are widely used, the cost of fixing will be high or even impossible. I wrote a very long email to address my interests. This issue will not be immediately exposed, but it has been suffering from development and testing over and over again.

In addition to file storage, there is one more thing that ADP wants to do, that is, data management during runtime. I didn't understand this at the beginning. The conventional methods for object storage are serialization, Orm, and manual read/write. However, ADP did not do this, but provided a method that I thought was inaccurate. Let's take a look at the running data management from the product perspective. For the data writing process, the product first creates a dataset through the adp api, then creates a memory view, then writes the data (and type) of its own object to the view, and then ADP scans the view, write to an external file. The reading process is the opposite. For the same logical data, such as point, each product has long been available, and because the code is not the same, it is clear that in C ++, the point of product A is different from that of product B, and there are even different methods. However, at the data model level, the two points are the same. They both represent a point in space and both have float-Type X, Y, and Z. This means that ADP cannot write a point that is the same as that of all products. Therefore, ADP can only define its own point, and each product itself converts the ADP point into their own. In this case, the data type of ADP is only an intermediate product. If all products can follow the same point storage method, in fact, there is no need for the layer of ADP memory blocks, simply read and convert the file data into their data. The ADP management object is purely redundant, and a layer is added, which is both slow and a waste of memory. At that time, I realized it was a waste of time, but I also felt that ADP could only do this. That is to say, intuition and experience tell me the problem, but I cannot find the root cause of the problem.

ADP also defines some basic numeric types, which correspond to the basic types of C ++. A memory block is regarded as a continuous storage of data. In this way, when accessing data, you need to know the offset and type of the required data in the memory block, and then the ADP function helps read the data. This reading method is of course inconvenient and dangerous. The lack of metadata that describes the data type and layout requires the application to have precise control over the data. At that time, all the data types with fixed length have been written, but there is a data type, String, because the length is not fixed, so it has not been completed yet. How to put a string? This is what O colleagues in the United States call "Challenge ".

There were two obvious options at that time. One is to write the string in the memory block and mark the string length in front of it. Another option is to store a new string array. Both have advantages and disadvantages. Consider a C ++ struct that contains string data members. The advantage of the first solution is that the data is stored continuously, and the data replication is friendly, just memcpy. The disadvantage is that you cannot simply locate the offset of a specific data member in the memory block. Because the string is longer, you must start scanning from the beginning to calculate the offset of a specific data member in the memory block. On the contrary, it is still easy to locate data members in the memory block, but memcpy obviously cannot work if an extra discrete memory unit is required. On the one hand, ADP hopes to support memcpy to write data to files, and on the other hand, it also wants to easily locate data members. This dilemma is the source of "Challenge. I chose the second solution.

One of the reasons for choosing the second solution is that the way users access data is already poor (troublesome and dangerous) and cannot be complicated. In addition, when memcpy is used in the IO process, Io is slow, and the processing of multiple nodes does not affect much. Another implicit reason is that ADP must actually know the layout of objects in the memory block. Otherwise, it is impossible to reconstruct the memory block from the text when reading. But ADP didn't realize this at the time, so there was no relevant API or special abstract mechanism, but it actually did this. I thought that data layout and even data type abstraction are necessary in the future, so now it is not a waste of work on this assumption. So I added a string, which is actually supported by the character array of C ++. Then, you can handle the file read/write, memory replication, allocation, and release issues with caution, and feel that the work is almost done. A week later, I sent the result to O in the United States.

In addition, China seemed to be designing Io implementation at the same time. Designer g in the United States seems to be planning to use istream/ostream interfaces to access data, but it is well known that this is a terrible interface. In addition to istream and ostream, there are also corresponding streambuf interfaces and Localization support. Therefore, G must be implemented using boost iostream. This is a non-problem-oriented but technology-oriented choice. I didn't want to do this thing. In my memory, how can I make it better than expected? (it seems that it is mainly a performance problem. @ Tony, just add it, I can't remember), and finally completely disappears from the code.

At the same time, I also actively read the existing ADP code, and found that all the infrastructure was copied from the DwF project, without any improvement, and my heart was cool. The existing problems in the zooming still exist. Fortunately, the type of the string in the source file has not been copied. It seems that my previous report has played a role. This also raised the idea of writing a brand new string, because for the ADP type, the string type is too important, and I never thought of working on the C string. Out of dissatisfaction with the code quality, I plan to do some preliminary work in this regard.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.