Scheme:
1,) at present, our program, a single computer analysis of 100G of XML data within a day, there is a choice of data to be required in the database (sqlserver2008 R2 64) records nearly 100 million, a 128G of memory, 32-core computer barely able to complete the task;
2,) as the market expands, the amount of data we get is about 1T of XML data per day, and a single computer running has analyzed the completion time as a bottleneck, which can take 10 days or more.
Solution:
In order to enable our products to have a stronger survivability, to attract more users, the project team has a discussion:
Scenario 1, using Hadoop for this big data processing, but due to the current limited depth of the company's understanding of Hadoop technology, normal application to the product still need some time, so the Hadoop scenario is only as low as the implementation level, but not too table we will not do, the time of the problem.
Scenario 2,) further expansion based on our current platform, how to expand?
2.1,) Let our tools run on multiple computers, splitting tasks into different computations. Assuming 1T of data, we have 10 of computers, each to the average distribution of 100G of data, so that the pressure on the database and computer will be less, scale-out is a must-have program that we can not immediately go online at present;
2.2,) After the tool expands, the storage database must also need to expand, each computer best can correspond to one storage database, to the business realization and the database pressure offloading, all has the benefit.
2.3,) After the database is expanded, how the application side merges the data becomes a must have to consider the problem. So what do we plan to do with the merger? First of all, the server on each compute node, while inserting the necessary data, more business needs to insert the required data into the summary database, and the details are only saved to the corresponding compute node corresponding to the database, the application side directly access the database as a summary database, but to view the specific information of a piece of information, The data stored in the information can be found in the information, and then the detailed information is obtained from the corresponding database.
Decided:
Program 2 has passed the research, how to implement? How difficult is the implementation? What's the technical problem?
Other problems first, talk about the technical difficulties, since the distribution of multi-calculation execution, there must be a scheduler, and the difficulty of the scheduler everyone is clear------heartbeat monitoring task execution status, communication stability, efficiency, accuracy, Message Queuing How to plan?
Distributed multi-Computer scheduling platform