Data Facebook shifts 30PB data to IDC

Source: Internet
Author: User
Keywords Facebook copy new brand-new blog

As the world's largest social network, Facebook has accumulated more data a day than many big companies do in a year. Currently, according to Facebook, the cluster stores 30000 trillion bytes of data, probably 3,000 times times the amount of information stored in the Library of Congress. Facebook http://www.aliyun.com/zixun/aggregation/8302.html "> Data warehouses have grown by more than one-third over the past year.

To accommodate the current breakneck speed of data, Facebook took a positive step early this year to move its ever-expanding Hadoop cluster to a new, more massive Facebook data center in Prineville, Oregon State, USA. According to Facebook, the largest-ever data-migration operation on Facebook has been completed last month.

Paul Young, an engineer at Facebook's data infrastructure team, introduced details of the move this week on the company's blog. Yang said it was necessary to move to a new Facebook data center because the company had no available energy and space to add new nodes to the Hadoop cluster.

But Yang didn't talk about it in an interview with Computer weekly.

Facebook's experience and interest in the Hadoop cluster is growing, and more and more companies are starting to use Apache open source software to capture and analyze large volumes of structured and unstructured data.

The attraction of the Hadoop cluster is the ability to decompose very large datasets into small chunks of data, after which the decomposed chunks are allocated to a range of commercial hardware systems for faster and quicker processing.

An increasing number of business users are using Hadoop to collect and analyze large volumes of unstructured and machine-generated information such as log and event data, search engine results, text and multimedia content from social networking sites, according to a Ventana research report released this week.

Facebook says they use Hadoop technology to collect and store millions of of files that their members generate every day. This data is analyzed using the open source Apache hive Data Warehouse toolset.

Other High-volume data companies that use Hadoop to do similar work include ebay, Amazon and Yahoo. Yahoo is the main contributor to the Hadoop code.

Facebook's Hadoop cluster has become the world's largest computer cluster, according to a March 2010 blog. The cluster consists of 2000 computers, 800 16 nuclear systems and 1200 8 nuclear systems. Each system in the cluster stores approximately 12 trillion to 24 trillion bytes of data.

Facebook has developed a series of moves to migrate the cluster to a new data center, Yang wrote in his blog.

The company moved each node from the physical to the new location, according to Yang, a task that was completed in a few days under the concerted effort of a sufficient staff. The company decided to take this route in order to avoid an unexpectedly long downtime.

Facebook has decided to build a new, larger Hadoop cluster that simply replicates the data on the old cluster. The method chosen is more complicated because Facebook is going to replicate the source data on a real-time system filled with files that are constantly being created and deleted, Yang wrote in his blog.

So Facebook's engineers have built new replication systems to cope with the unprecedented size of clusters and data loads. "Because replication maximizes downtime, we decided to choose this approach to complete a mass migration," Yang said.

According to Yang's introduction, the Data replication project is completed in two steps:

First, most of the data and directories from the original Hadoop cluster are replicated to the new cluster using an Open-source tool named Distcp.

All changes in files and data after the bulk copy is completed will also be replicated to new clusters using Facebook's newly developed file replication system. File changes captured by the Hive plug-in are also in the process of Facebook developers ' research.

During the transfer, Facebook temporarily shuts down the Hadoop cluster's ability to create new files, allowing its replication system to complete the task of copying all the data to the new cluster. Then change their domain name server settings so that they can point to the new server cluster.

According to Yang, the rapid internal construction of data replication tools is the key to ensuring the success of the migration project.

In addition to using data migrations, replication tools are used to provide new disaster recovery capabilities for the Hadoop cluster.

"We think it is possible to effectively retain a large cluster of activities that are properly replicated, with only a small number of delays," Yang said. By replicating systems, when an emergency occurs, the entire business operation uses relatively little work to be transferred to the replication cluster.

(Responsible editor: Lu Guang)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.