Data Organization for a report: a file or a database?

Last Update:2014-08-05 Source: Internet

Author: User

Tags system log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In a report development project, the source data of a report can be placed in a database or in a file. For example, an internet company's website Operation report System, the company registered users basic information from the website system, using the Oracle database, user operation data from the Web site system log files, is a text file. It is common practice to import user action data from a text file into Oracle , and then use SQL statements to extract and calculate data.

Is it best practice to put all of the report's data in the database? Is it possible to put all or part of the report data into the file system? What are the pros and cons of these two practices?

Here we compare the report tool with Java programs Access data files (referred to as: File reports), and report tools have different support for direct access to relational databases (abbreviated: Database reports) in several ways. Wherethe Java program is used to calculate the source data of the report.

First, the data read speed

A report application is a process of reading data, making calculations, and displaying pages. From the data read angle, the file is read directly from the operating system, depending on the IO speed of the hard disk . The database needs to be read through JDBC, because the object of the data stream needs to be converted, the current industry mainstream database JDBC is relatively slow. This is a problem that has not been solved at all.

in this respect, the document report can be 5 points and the database report will be 3 points.

Second, data computing performance

from a computational standpoint, simple SQL statements execute faster and complex SQL is not easy to optimize. If you are using a for loop to fetch calculations in a stored procedure , you may be slower than Java.

in this respect, both have 3 points.

Third, data consistency

The relational database has metadata (a data dictionary describing the structure), and transaction management, and the consistency of random data writes is guaranteed to be good, and of course it sacrifices a part of the performance. However, in the report application, the data read more operations, write operations are generally sequential write, so the data consistency requirements are not high.

In this regard, the document report was 3 points and the database report was 5 points.

Iv. Ease of management

Files can be managed according to business category, module relationship, time order, and can be managed in a high level of management. And the database is flat structure, can not be in the form of multi-level directory management data, only for the management of a small number of tables. It is easy to form a large number of confusing table names in the database, which is less manageable.

In this regard, the document report was 5 points and the database report was 3 points.

Five, easy to develop

simple SQL is easier to write, but complex SQL and stored procedures and Java are hard to write. The problem with SQL is that there is no support for step calculations, incomplete aggregation, lack of ordered collections, and unsupported object references. This results in complex SQL and stored procedures that do not conform to natural thinking habits, and programming is more difficult. Java does not provide a basic class of relational operations, such as group operations are written by programmers themselves, programming is not easy.

In this respect, the document report can only be 1 points, and the database report will be 3 points.

Six, low cost

the combined cost of the data in the file is much lower than the database. In this regard, the reporting document was 5 points and the database was 1 points.

Vii. Extensibility

data files are easier to scale on the data store. Especially with HDFs , the file system is very easy to scale out horizontally. The technology of scale-out of database cluster is more complex, difficult to be configured and maintained, and cost is high.

In this regard, the document report was 5 points and the database report was 2 points.

Viii. Security

The database has a private way of storing data, and even with operating system permissions, the external system can access the data only through the secure channel provided by the database. Data files are exposed to operating system users and must be strictly managed by the operating system user rights to ensure security.

In this regard, the document report was 3 points and the database report was 5 points.

Overall, file reports and database reports are compared in various ways:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/44/43/wKioL1PgguOg_PbcAADczZd9U1A027.jpg "style=" float: none; "title=" 1.jpg "alt=" Wkiol1pgguog_pbcaadczzd9u1a027.jpg "/>

As you can see from the diagram above, for a report application, the overall use of files to organize your data is slightly more advantageous than a database. File reports also have short boards, not the file system itself, but the development of language Java problems. The most obvious of the short board is Java for the report data source calculation of development difficulties, poor computing power.

If the use of dry company's latest accounting report, with the file storage data scheme, and other reporting tools plus database storage scheme compared, the advantages are very obvious.

The Data calculation engine collector (Esproc) is built into the collection report, which provideseasy access to file data. Let's compare the various aspects of the aggregate report + file Storage scenario (abbreviated to a set report scenario) and other reporting tools + Database reports (referred to as other reporting scenarios).

Data Read speed

The collection report calls the collector to read the data file, after calculation, through the collector The JDBC interface is provided to the set report because the collector is also Java-based , so that the JDBC of the collector does not need to be converted to the object of the data stream much faster than the traditional database.

in this respect, the Collector report scenario can be 5 points.

Data Computing Performance

both the collection report and the collector are based on the Java Development, the single process of the collector run without the traditional database computing speed is fast. If you use the set-up report to call a single-machine multithreading mechanism or multi-computer parallel mechanism to achieve distributed computing, the speed is more than the traditional database. For detailed test data, see the test report for the collector.

in this respect, the Accounting report scheme can be 5 points.

Data consistency

The access to the file by the collection report can be implemented by the collector, which encapsulates the data file through the collector's code, which can guarantee the consistency of the data to some extent. Because the report is calculated to read and analyze the data, the collector does not provide transaction management functionality, and the benefit is that there is a better performance advantage when reading and sequentially writing data, with the disadvantage that data consistency is slightly worse than the database.

in this respect, the Accounting report scheme can be 3 points.

Easy to manage

In the file data source of the collection report, the data file is a natural support multi-level directory, so the copy, transfer, split is much simpler and more efficient than the database, which allows users to manage the data according to the rules of business module, time order and so on, the application can also delete the corresponding data according to the directory. Data management has thus become simple and clear, with significantly reduced workloads.

in this respect, the Accounting report scheme can be 5 points.

Easy to develop

the built-in calculator is a specialized development language for structured semi-structured data computing, with step-by-steps calculations, complete set operations, ordered set operations, and object reference mechanisms, closer to natural thinking, simpler programming, and less code. The collector provides a rich set of built-in objects and library functions that can implement complex business logic that reduces the transition threshold from business logic to program code compared to SQL.

in this respect, the Accounting report scheme can be 5 points.

Low cost

The Collection report is sold at a much lower price than the database. At the same time, the requirements of the software and hardware environment are much lower than the database, the general report project does not need to purchase a dedicated data storage device and the corresponding authorization can be used, it supports Windows/linux/unix, both high-end servers and low-cost PC support . Therefore, the integrated cost of the aggregate report is much lower than the database.

in this respect, the Accounting report scheme can be 3 points.

Scalability

Data files are easier to scale on the data store.

In terms of computing, the advanced version of the collector can be used to achieve a powerful distributed computing management function: Distributed computing engine, controllable task assignment, node selection and fault tolerance, intra-node data sharing, inter-node data exchange, which can balance fault tolerance and performance according to task characteristics.

in this respect, the Collector report scenario is 5 points.

Security

The collection report can call the advanced version of the Collector and provide a standalone server, and if the set server and the set report are installed on different physical servers, the data security can be effectively guaranteed if the collection server only has a specific port number open to the external computer.

The collector is programmed to implement login and rights management by itself, which has a certain amount of work relative to the database, but is also more flexible.

in this respect, the Accounting report scheme can be 4 points.

The comparison between the collection report scenario and other reporting scenarios is as follows:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/44/43/wKioL1PgguPj8kBwAADGkpqlZHs349.jpg "style=" float: none; "title=" 2.jpg "alt=" Wkiol1pggupj8kbwaadgkpqlzhs349.jpg "/>

from the visible, in the report project, the advantages of the integrated report + file scheme are more obvious, in a considerable number of occasions can replace other reporting tools + database of traditional solutions, especially historical data as the main source of the report project, Large data volumes require greater scalability and computational performance, but are much less demanding for data consistency and security.

In fact, the aggregate report does not only support the file data source, it still has good support for the database data source, so we can also consider combining the database and the file system to provide the data source for the aggregate report, thus achieving the best result. Among them, the database holds the most recent data with large changes, and the file system stores the historical data with smaller changes. Such as:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/44/43/wKiom1PggcvgM1TfAAES3_kZb1I826.jpg "style=" float: none; "title=" 3.jpg "alt=" Wkiom1pggcvgm1tfaaes3_kzb1i826.jpg "/>

the latest data volume is small, but the change is more frequent, the data consistency requirements are relatively high, put in the database can not only ensure data consistency, but also do not add too much pressure on the database load. The historical data volume is large, the change is few, can give full play to the file system advantage. This solution can gain the dual benefits of file data sources and database data sources, and can reach 5 points in all of these areas .

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More