Prism: Real Big Data project from Facebook

Source: Internet
Author: User
Keywords But this Google data center server

Any web site today may have to deal with a huge amount of online data that Facebook has been tackling five years earlier. Facebook technology Cow Jay Parikh says these sites are much easier to deal with than they were.

This is because many internet companies, including Facebook, have spent the past few years investing a lot of energy in developing software platforms that can analyze and process online data on tens of thousands of servers. When the software that deals with "big Data" is completed, these companies make the results public and can be used by anyone interested.

Facebook, like Yahoo, is a pioneer in the development of Hadoop. Hadoop is a powerful software platform for processing and analyzing the vast amounts of data generated by modern networks. Yahoo started this software as the index information needed to build its own search engine, but other companies quickly used it for their own online data analysis and continued to improve hadoop to achieve that goal.

As a result of these efforts, the Hadoop platform can handle more than 100PB (1 billion GB) of data. "Five years ago when we started using these technologies, there were a lot of limitations on the type and speed of computing." With the efforts of the open source community, these limitations and barriers have been solved, so people can accomplish tasks faster than we did. "Parikh said. He now manages the huge number of hardware and software architectures that Facebook needs to run.

But Facebook now faces much more data than it used to, and there are a number of limitations that need to be addressed by existing platforms like Hadoop. In a reporter's report this week at Facebook's Monroe Campus headquarters, Parikh revealed that the company has developed two new platforms that are more scalable than Hadoop and that Facebook intends to open up two platforms.

The first platform, called Corona, allows you to run a large number of tasks on many Hadoop servers without having to worry about the entire cluster being crashed by a single task. Another more appealing, called Prism, is the ability to run an oversized Hadoop cluster that can connect global data centers.

Parikh said the system would "allow the data to move at our request, whether it was Oregon State's Prineville, Dara City in North Carolina, or Sweden." ”

Hadoop was built over a decade ago by Google's two paper describing the massive software platform (principle), which Google uses to build GFS and MapReduce platforms. GFS is the acronym for Google's file system, which is able to store data on thousands of servers, and MapReduce lets you compute the desired results using all the server's computing resources. Hadoop works like this, with GFs and MapReduce called HDFs and Hadoop MapReduce.

The two Hadoop platforms have been in use for several years by companies such as Yahoo and Facebook, but they are not perfect, especially after Facebook has more users than 900 million. The most compelling issue is the "single Point of Error" feature, where the entire cluster is hung (at least temporarily) if the primary server that manages the cluster is hung off.

In recent months, Facebook has developed a technology called Avatarnode to avoid a single point of error in the HDFs platform, and the Hadoop open source community has implemented a similar HA Namenode solution that improves usability. However, there is a problem of single point failure in MapReduce. Now Facebook has solved the problem through corona.

Traditionally Mapreduc uses a separate task tracker to manage tasks in a server cluster, while Corona creates multiple task trackers. Parikh says this helps Facebook perform more tasks on the same MapReduce platform, with increased overall throughput, so that more teams and products can run tasks on the cluster.

In the past, if there was a problem with the task tracker, it would cause all the tasks in the system to die, forcing you to restart everything. As long as there is a server failure, the entire system will be affected. Now there are many mini task trackers in the system that are responsible for their own tasks.

Tomer Shiran, one of the earliest employees of Silicon Valley start-up MapR, has a similar feature in the version of Hadoop that the company releases, and points out that there is no similar multitask tracker implementation in the open source version of Hadoop. He had seen a version of Corona, and felt that the MapReduce task on the platform was starting much faster.

Jay Parikh says very little about the Corona platform, but obviously the system is already in use within Facebook-and really needs it. Parikh says Facebook runs the world's largest Hadoop cluster, which contains more than 100PB of data and can handle 105TB data in half an hour.

But this cluster is not going to be enough for Facebook. 900 million users constantly update status, send photos, videos, write comments--the speed of data growth you know. This is why Parikh and colleagues build cluster prism across data centers.

Because the network between data centers is not fast enough, Hadoop computing is generally not running between geographically separated data centers. "One of the big drawbacks of Hadoop is that all servers must be next to each other," he said. "The system is very coupled, and if the latency between servers increases by dozens of microseconds, the entire system will be slow to explode." ”

Prism is different. In short, its specialty is the ability to automatically replicate and transmit data between different network computing nodes as needed. "This allows us to create multiple separate data centers, but what we see in the system is a system," he said. "We can move data based on cost, performance, and technical factors ... We are no longer limited to the computing power of a single data center. ”

Prism reminiscent of Google's spanner platform. There's not much news about spanner--Google's low-key design of its infrastructure--but in 09, Google publicly described the system as "a storage and computing facility that leverages all of our data centers (disk and computing power), The redistribution of data is automatically replicated and calculated based on resource constraints and usage patterns. ”

Google claims the platform provides "the ability to automatically allocate resources on any server", covering 36 data centers around the world.

Parikh admits Prism is similar to spanner, but he cautions that he doesn't know much about spanner. And Prism may be able to instantly distribute data (to other centers) in a data center when it's dropped.

Tomer Shiran says such platforms are used only within Google or Facebook and are not open to implementation. But he also pointed out that there are not many companies need such a high-level, "there is no company (data) to reach the level of data processed by Google."

Facebook has no actual deployment prism,parikh yet to give a clear time. But he said it might be open source. Corona systems can also be open source. It's true that no company needs to deal with so many online data like Google and Facebook, but in the future. "They need to face the challenge of the next wave of data-scale growth," Parikh said.

(Responsible editor: Lu Guang)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.