The rise of Hadoop from a small elephant to a giant
Source: Internet
Author: User
KeywordsLarge data work can from small
With the rapid development of Internet, mobile Internet, IoT of things and cloud computing, the massive data of explosive growth in all walks of life will once again subvert the cloud era, the big data age of information explosion has sounded the horn.
How can users extract information that is useful to them from this vast database? This requires large data analysis techniques and tools, and traditional business intelligence (BI) tools cannot withstand such a large amount of data information. Referring to large data, we have to say is the technical terminology related to large data: Hadoop, MapReduce, HBase, NoSQL, etc. Many manufacturers in the industry are beginning to build their own large data solutions. For a time, Hadoop was red all over the world, just like the Linux open source software system of that year, became the mainstream platform of research and design big data solution.
Gorgeous Metamorphosis
The development of Hadoop has basically gone through a process: from an open source Apache Foundation project, as more and more users join, continue to be used, contribute and improve, gradually formed a strong ecosystem.
With the development of cloud computing and large data, Hadoop is now a distributed computing platform that allows users to easily navigate and use it. Users can easily develop and run applications that process massive amounts of data in Hadoop without knowing the underlying details of the distribution, and can leverage the power of the cluster to enable high-speed computing and storage. Hadoop implements a distributed filesystem (Hadoop Distributed File System), referred to as HDFs. HDFs features high fault tolerance and is designed to be deployed on inexpensive hardware, and it provides high transmission rates to access application data for applications with large datasets. HDFs relaxes POSIX requirements so that data in the file system can be accessed in the form of streams.
Hadoop is most popular as a tool for classifying search keywords on the internet, but it can also solve many of the problems that require great scalability. For example, what would happen if you were to grep a 100TB mega file? On a traditional system, this will take a long time. But Hadoop is designed to take these issues into account, using a parallel execution mechanism that can greatly improve efficiency.
Today, applications based on Hadoop have sprung up: Yahoo runs Hadoop through clusters to support research on ad systems and web search; Facebook runs Hadoop to support its data analysis and machine learning, while Baidu uses Hadoop to analyze search logs and web data mining, and Taobao's Hadoop system is used to store and process data on e-commerce transactions.
Nine years of long-distance running, Hadoop has been transformed from the fledgling small elephant, became the industry giant, but still need to rashness, constantly improve.
Great performance improvement
Hadoop is also a software framework that enables distributed processing of large amounts of data. Hadoop is handled in a reliable, efficient, scalable way.
Hadoop is reliable because it assumes that the compute element and store will fail, so it maintains multiple copies of the work data, ensuring that the processing can be redistribution for failed nodes.
Hadoop is efficient because it works in parallel and speeds up processing through parallel processing. Hadoop can put thousands of nodes into the calculation, which is very performance potential. But not all work can be done in parallel, such as data analysis with user interaction. If you're designing applications that aren't specifically optimized for Hadoop clusters, performance is not ideal because each map/reduce task waits for the previous job to complete.
Intel's Open architecture core product line for large data has launched the Intel Hadoop distribution to enable users to achieve "soft and hard synergy, experience first" innovative results. For example, using the Intel Xeon processor platform to optimize network and I/O technology, and a strong combination of Intel Hadoop distribution, the previous analysis of 1TB data need 4 hours to complete processing, now only need a short 7 minutes to complete, greatly improve the speed of large data analysis.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.