GFS (published in 2003) uses a commercial hardware cluster to store massive amounts of data. The file system replicates data between nodes redundantly. MapReduce (2004) is a complement to the GFS architecture because it leverages the large number of CPUs provided by all the Low-cost servers in the GFS cluster. Together with GFS, it forms the core force for handling massive amounts of data, including building Google's search index. But both systems lack the ability to access data in real time, meaning they are not enough to handle Web services.
Another drawback of GFS is that it is suitable for storing a few very, very large files, rather than storing hundreds of tens of thousands of of small files, such as pictures on social platforms, because files without data are ultimately stored in the master's memory, and the more the file master is, the greater the pressure.
There is a need for a solution that can drive an interactive application, and can take advantage of both of these infrastructures and rely on the data redundancy and data availability features of GFS storage. The stored data should be split into particularly small entries, then aggregated by the system into very large files, and provide some sort of index that allows the user to find the fewest disks to get the data. Ultimately, it will be able to store the results of the crawler in time and generate search indices in collaboration with MapReduce. Then consider abandoning the characteristics of the relationship, using a simple API for the increase and deletion of the operation, plus a scan function to the larger key range or the entire table iterative scanning, and eventually form a management of structured data distributed Storage System BigTable (2006).
It is worth mentioning the cap theorem, which states that a distributed system can only achieve two of consistency, availability, and partitioning tolerance (independence) at the same time, not three. The need to loosen consistency increases the availability of the system.