KeywordsThese Google disk network giants store data
The technology needed to support Google's home search box: Algorithms behind it, cached search terms, and other features that come with it, such as when you enter a query in the data store, which is essentially the equivalent of a full text snapshot of most networks. When you submit a search to thousands of others at the same time, the snapshot is constantly being updated with these changes. At the same time, the data is processed by thousands of independent server processes, each with its own responsibilities, from calculating the associated ads to you, to determining the order in which the search results are ordered.
The storage systems that support Google's search engine must be able to withstand the millions of of read and write requests made every day by thousands of separate processes running on thousands of servers, almost without downtime for backup or maintenance, and must be constantly scaled up to accommodate the growing number of pages added by Google's web crawler. Overall, Google handles more than 20PB a day.
This is not what Google can do from a ready-made storage architecture. And for other networks and cloud-computing giants that run huge data centers, such as Amazon and Facebook. Although most data centers have solved the problem of extended storage by adding more hard disk capacity to a single storage area network, more storage servers, usually more database servers, have failed because of the performance limitations of the cloud environment. In a cloud environment, there may be tens of thousands of active users at any one time, and data can be read and written at any time to reach thousands of TB.
This is not just a simple question about the speed of disk reading and writing. The main problem with the data flow on these volumes is the throughput of the storage network; Even with the best switches and storage servers, traditional San architectures can be a performance bottleneck for data processing.
Then there is the cliché about the cost of expanding storage. Ultra-Large network companies increase the frequency of capacity (for example, Amazon now increases its data center capacity every day for the entire company in 2001, according to Amazon Vice President James Hamilton, and uses the same approach in most data centers to balance the required storage, according to the required management, Hardware and software costs will be huge. This cost is even higher when a relational database is added to a mixed database, depending on how an organization handles their segmentation and replication.
For this ever-expanding and persistent storage demand, the internet giants-Google, Amazon, Facebook, Microsoft and so on-have a different storage solution: object-based storage based Distributed file systems. These systems are at least partially inspired by other distributed cluster file systems, such as Red Hat's global file system and IBM's common parallel file system.
The architecture of these cloud Giants ' distributed file systems separates metadata (data about content) from the data it stores. This enables a large number of parallel reads and writes of data over multiple replicas and throws away concepts like "file Locking".
The impact of these distributed file systems extends well beyond the scope they create for large data centers – which directly affect how companies that use public cloud services (such as Amazon's EC2, Google's app and Microsoft's Azure) develop and deploy programs. Companies, universities and government agencies looking for a fast way to store and provide massive data access are increasingly becoming a new phase of the data storage system inspired by cloud giants. Therefore, it is necessary to understand their history and the process of engineering compromise.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.