Design a structure, use Berkeley DB to complete Big data storage, backup, query function.
Existing reserves:
Basic operation of 1.Berkeley db.
2. Data is not lost after the data is dumped.
3. Storage of more than hundred GB of data.
In the case of data flow, individuals are called data streams, and conflicts with other terminology are not considered.
Each part features:
A: Responsible for depositing the data into Berkeley DB in order to convert the source data into a format that Berkeley DB can access. Convenient for subsequent use. Because A's disk is limited, and a is consuming a lot of system resources during the continuous insertion process, and once it goes down, the consequences are serious (perhaps the processing method is not found). At the same time, the use of a process to insert data in a, to use another process to retrieve a can also be processed separately. The retrieval efficiency is not necessarily high at this point. So make the corresponding changes.
Node[]: Periodically get the data on a. Get the rules based on the actual situation. A bit like the principle of map. After our tests, we did this step. There is only the first node in node that can access the data. Other node open environments will make an error. So there is a node in the merge that collects all the node data so that there is no problem opening the environment. The data can also be accessed.
Merge: As stated above, the merge is responsible for merging data. This Berkeley DB can be accessed. The goal is to be able to search on the merge.
Problem:
1.A Downtime causes node node jdb file data to be out of order. If a is a program exception and the Jdb file in the environment is lost, then the environment fails if you restart the program. Even if the contents of the environment are emptied, the re-generated files start from scratch. Can no longer be merged with the previous data. One scenario is if you change the file name of the newly generated files to the next one of the previous maximum filenames, and then merge them into the merge. Through testing, this scheme is not feasible, can not be two different environment files according to the file name in an orderly combination of an environment still want to visit him. After all, the memory database is not a hard disk. Another option is to use a merge to synthesize the files after the outage into merge2. To retrieve this, we traverse the next merge to access the database. (All of the computer communications, connections, etc. are considered for the time being using SSH). This method can theoretically, in fact, be researched.
2. I personally think that the biggest problem is the use of too much resources. Node can be considered to be removed. node's role is to prevent data loss on the merge and all data is lost. But in fact the data on the merge is hard to lose . If merge is used only for retrieval operations, there is basically no possibility of loss. Unless you manually delete the hard drive's files. So now it's thinking of changing the structure. Get the following structure and give a corresponding explanation
The function of a does not have to say much, as before. Node data. However, the condition for switching to the next node is that the Prenode disk is exhausted or a is down. We guarantee that the data stored on each node is a environment. When I go down, I switch to the next node. Laboratory computers can now make the program abnormal, terminating, the amount of data can reach at least 200G. A computer with good performance can make Berkeley DB manage terabytes of files without pressure. A's disk is best to be as large as possible, so you can consider if the small data does not have to delete the data on a.
Search entry requirements are not high, basic can run the program on the line, of course, you can directly use a node as a search portal.
Personally, the above plan is better than the first one. There are some hidden problems not found, after all, lack of experience.
Reprint description Source: http://www.cnblogs.com/ickelin/p/3975676.html
Berkeley DB data processing