At work yesterday morning, the customer took 7 GB of data for analysis. At first, we gave an oracle export file (DMP file), which is MB, I started at the table.
After decompression, the customer told me that I had 7 GB to scare me. I even unzipped it, and started hard disk sorting. My computer was a 120g hard disk, well, I deleted all the items that can be deleted before I sorted out the 20 GB space.
When there is enough space to start the import operation, the import process is cut off because a result is generated in one day. Now, the recorded data is shown as follows:
The starting edisk space is 12.4 GB. It is considered to have ended when three tables are imported.
331398 rows Table1
2820669 Table2
15554217 table3
Starting Time 10:42:43
Endtime 12:10:10
The data in the first table consumes 0.33 million MB of hard disk space, and the remaining space on the edisk is 5.05 GB, which may be time-consuming due to the hardware of the machine, I didn't have any experience dealing with such a large amount of data before, but the import process is still normal, the CPU consumption is not much, and my two GB physical memory is just 1 GB of memory consumption...
The next processing is even more interesting. Because our products do not support direct connection to Oracle for data processing, we must first import the data to sqlserver, in the first analysis, we used the table with 0.33 million records. First, we used the software to establish an analysis model. In this model, two entities are connected, the addition of all object attributes is just 20, no professional data extraction tools were used. I was surprised by the results of this import process. When 0.33 million rows of data were imported to 0.197 million rowsProgramBecause the space of the d disk is exhausted, 0.197 million rows of data actually consumes MB of space. We will not discuss the reasons for the software, however, in this process, a SQL Server process consumes 1 GB of memory. SQL server really needs to learn from Oracle. Haha, together, we use a visual analysis tool, using the model that will be created later to analyze the 0.197 million rows of data, this analysis software cannot store the data because it is cache-consuming 2 GB of memory, it was only after six o'clock in the evening that some results were generated ......
The experience of this day is really very precious. First of all, let's talk about the cache mechanism. The first reaction to writing something is to cache it, after a while in a project, I realized that it is best not to cache all the data when logging on to the software, what kind of capabilities can the software accept big data processing? The second is the efficiency of software execution. I have heard people say that one ETL is six to eight times the efficiency of another ETL. What can we do to improve the processing efficiency? This is the software price gap ....