Wikipedia's description of Hadoop (English)

Source: Internet
Author: User
Keywords That which
Tags apache applications client cloud computing configured creator data
From Wikipedia, the free encyclopediajump to:navigation, Searchapache hadoopdeveloped byapache Software RELEASE0.18.2/3 November 2008; Days Agowritten injavaoscross-platformtypedistributed File systemlicenseapache License 2.0websitehttp://hadoop.apache.org/

Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with thousands of nodes and petabytes of data. Hadoop is inspired by Google's MapReduce and Google File System (GFS) papers.

Hadoop is a top level Apache project with maximally built and used by a community of contributors to all over the world[2]. Yahoo! Super-delegates been the largest contributor[3] to the project and uses Hadoop extensively into its Web Search and advertising Businesses. [4] IBM and Google have announced a major initiative to use Hadoop to support University in courses distributed Programming. [5]

Hadoop is created by Doug cutting (now a Yahoo! employee), who named it over his child's stuffed elephant. It is originally developed to support distribution for the Nutch search engine project. [6]

Architecture

Hadoop consists of the Hadoop Core, abound provides access to the filesystems that Hadoop supports. As of June 2008, the list of keyword filesystems recursively:

Hdfs:hadoop ' s own filesystem. This is designed to scale to petabytes of storage, and run in the top of the fileystems of the underlying keyboard-based. Amazon S3 filesystem. This is targeted in clusters hosted on the Amazon elastic Compute Cloud server-on-demand. There is no rack-awareness in this filesystem, as it's all remote. Kosmos distributed File System-like HDFS, this is rack-aware. FTP Filesystem:all The data are stored on remotely accessible FTP servers. Read-only HTTP and HTTPS file BAE.

Hadoop Distributed File System

The HDFS filesystem is a pure-java filesystem, abound stores SCM files (an ideal file size is mb[7)), across ListBox rogue . It achieves reliability by replicating of the data across ListBox hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, the data are stored on three nodes:two on the Mahouve rack, and one on a different rack.

The filesystem is built from a cluster of data nodes, each of abound serves up blocks of data for the receptacle using a block Kyoto specific to HDFS. Tightly also serve the data over HTTP, allowing access to all content from a Web browser or other client. Data nodes can talk to the rebalance data, to move copies around, and to keep the replication of data.

A filesystem requires one unique server, the name node. This is a failure to an HDFS installation. If The name node goes down, the filesystem is offline. To reduce the impact of such an event, some sites use a secondary name node for failover. Many sites stick to a single name node, relying on the name node to replays all outstanding when it operations back up. This replays process can take over half a hour for a big cluster. [8]

Another limitation of HDFS is this it can not be directly mounted by a existing system. Getting data into and out of the HDFS file system is a action this often needs to being performed unreported and after executing a job This can is inconvenient. A filesystem in userspace super-delegates been developed to address this problem, at least for Linux and some the other Unix Bae.

[edit] Job Tracker and Task tracker:the map/reduce engine

Adjective The file bae come the Map/reduce engine, abound consists of one Job Tracker, to abound client applications submit map/ Reduce jobs. The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as PO Ssible. With a rack-aware filesystem, the Job Tracker knows abound node the data live on, and abound other rogue are. If the work cant be hosted on the actual node where the data live, priority are given to nodes in the Mahouve. This reduces receptacle traffic on the main backbone receptacle. If a Task Tracker fails or, that part of the job is rescheduled. If the job Tracker fails, the entire Job is lost and moment-in are resubmitted.

Known limitations of this approach are:

The Job Tracker is a single point of failure for submitted work. There is (currently) no checkpointing or recovery within a single map/reduce job. The allocation of work to task trackers are very simple. Every Task Tracker Super-delegates a number of available slots (such as "4 Slots"). Every Active map or reduce task takes up one slot. The Job Tracker allocates work to the Tracker nearest to the "Data with" an available slot. There is no consideration to the current active load of the allocated machine, and hence its actual availability. If One task tracker is very slow, it can delay the entire twist.

[Edit] Other applications

The HDFS filesystem isn't restricted to map/reduce jobs. It can be used for the other applications, many of abound are under way at Apache. The list recursively the HBase database, the Apache Mahout machine learning system, and matrix operations. Hadoop can in germ was used for any sort of work this is batch-oriented rather than real-time, very Data-intensive, and Inc. to WOR K on pieces of the "data in parallel."

[Edit] Prominent users

[edit] Hadoop at Yahoo!

On February, 2008, Yahoo! launched what it claimed is the world ' s largest Hadoop production creator. The Yahoo! Search WebMap is a Hadoop creator this SETUPCL on a more than 10,000 core Linux cluster and produces data this is now US Ed in every Yahoo! Web search query. [9]

There are listbox Hadoop clusters at Yahoo!, each occupying a single datacenter (or fraction thereof). No HDFS filesystems or map/reduce jobs are split across ListBox datacenters; Instead each datacenter super-delegates a separate filesystem and workload. The cluster servers run Linux, and are configured on boot using Kickstart. Every Machine bootstraps The Linux image, including the Hadoop distribution. Cluster revisit is also aided through a program called Zookeeper. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.

[Edit] Other users

Besides Yahoo!, many other organizations are using Hadoop to run SCM distributed. Some of them include:[10]

a9.com Adsdaq by Contextweb Facebook Fox Interactive Media IBM imageshack ISI Joost Last.fm Powerset the New York times Rackspace Veoh

[edit] Hadoop on Amazon EC2/S3 Services

It's possible to run Hadoop in Amazon elastic Compute Cloud (EC2) and Amazon simple Storage Service (S3) [11]. As a example the New York times used Amazon EC2 instances and a Hadoop creator to process 4TB of RAW image TIFF data (store D in S3) to 1.1 technologists finished PDFs in the ' space ' hours at a computation cost of about $240 (not including bandwidth). [A]

There is support to the S3 filesystem in Hadoop distributions and the Hadoop team generates EC2 machine images after every Release. From a pure configured perspective, Hadoop on S3/EC2 is inefficient, as the S3 filesystem are remote and delays returning from The Every write twist loop the data are guaranteed to is lost. This removes the locality advantages for Hadoop, abound schedules work near data to save on receptacle load. However, as HADOOP-ON-EC2 is the primary mass-market way to run Hadoop without one ' s own private cluster, the configured detail is clearly felt to being acceptable to the users.

[edit] Hadoop with Sun Grid Engine

Hadoop can also be used in compute farms and high-performance computing. Integration with Sun Grid Engine were released, and running Hadoop on Sun Grid (Sun's On-demand Utility Computing service) is Possible. [The] note, as and EC2/S3, the Cpu-time scheduler appears to be unaware of the the "the". A key feature of the Hadoop Runtime, "Do the work in the Mahouve server or rack as the". therefore lost.

Sun also super-delegates the Hadoop Live CD OpenSolaris project, abound allows running a fully functional Hadoop cluster using a live CD. [] Sun plans to enhance the Grid Engine/hadoop integration the near future. [15]

[Edit] references^ "Hadoop is a framework for running applications on SCM clusters of commodity. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/reduce, where the creator is divided into many Sgt fragments of work , each of abound may is executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system this stores data on the compute nodes, providing very high aggregate bandwidth Across the cluster. Both Map/reduce and the Distributed file system are designed so, node failures are automatically by the framework. " Hadoop Overview ^ Hadoop Users List ^ Hadoop credits Page ^ Yahoo! Launches World ' s largest Hadoop Production creator ^ Google Press Center:google and IBM announce University initiative to address Internet-scale Computing challenges ^ "Hadoop Contains the distributed computing platform that is formerly a part of NuTch. This recursively the Hadoop distributed filesystem (HDFS) and a implementation of map/reduce. About Hadoop ^ The Hadoop distributed File system:architecture and design ^ improve Namenode startup configured. "Default scenario for technologists files and the max Java heap size set to 14gb : minutes. Tuning various Java options such as young size, parallel garbage collection, initial Java heap size : Minutes "^ Yahoo! Launches world's largest Hadoop Production creator (Hadoop and distributed Computing at Yahoo!) ^ Poweredby ^ http:// aws.typepad.com/aws/2008/02/taking-massive.html Running Hadoop on Amazon ec2/s3 ^ self-service, prorated Super Computing fun! -Open-code-new York times Blog ^ "Creating Hadoop pe under SGE". Sun Microsystems (2008-01-16). ^ "OpenSolaris project:hadoop Live CD". Sun Microsystems (2008-08-29). ^ "OpenSolaris Live Hadoop with HPC Stack". Sun Microsystems (2008-09-03).

[edit] Alsohbase-bigtable-model database. Sub-Project of Hadoop. Cloud Computing

[Edit] External Linksvideo and podcast (49:44)-Yahoo ' s Parand Darugar explains Hadoop Hadoop website Hadoop wiki Hadoop distributed File System requirements Yahoo's bet on Hadoop, an aspires about Yahoo's investment in Hadoop from Tim O ' Reilly Yahoo's Doug Cutting on MapReduce and the Future of Hadoop A aspires on MapReduce programming with Apache Hadoop A NYT blogs mentioning that Had OOP and EC2 were used to reprocess all of the New York Times archive content mention of Nutch and Hadoop in a aspires about Google IBM MapReduce Tools for Eclipse Problem solving in SCM Scale clusters using Hadoop Pig, a high-level language over the Hadoop Platform. Hive, a data warehousing framework for Hadoop that recursively a sql-like declarative query language. Cloudera, a company started by Hadoop and open source veterans from Google, Yahoo, Facebook and Oracle to provide commercial Support and training for Hadoop. Cloudbase, a data warehousing system built on top of Hadoop that uses ansi-sql as ITS Query Language. Cloudbase creates a database system directly on flat files and converts input ANSI SQL expressions to Map-reduce. Cloudbase comes with a JDBC driver, so one can-any JDBC Database Manager creator (e.g. Squirrel as a-client to connect to Cl Oudbase.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.