Ubuntu System (I use the version number is 140.4)
The Ubuntu system is a desktop-based Linux operating system, and Ubuntu is built on the Debian distribution and GNOME desktop environments. The goal of Ubuntu is to provide an up-to-date, yet fairly stable, operating system that is primarily built with free software for the general user, free of charge and with community and professional support.
As a Hadoop big data development test environment, it is recommended that you do not install Cgywin on Windows to learn or study, and to learn directly with Vmware+ubuntu.
Download www. vmware. com download here vmware, www. ubuntu. com here to download Ubuntu.
Introduction to Hadoop (I use the version number is 1.2.1)
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Take advantage of the power of the cluster to perform high-speed operations and storage. Hadoop implements a distributed filesystem (Hadoop Distributed File System), referred to as HDFs. HDFs has a high level of fault tolerance and is designed to be deployed on inexpensive (low-cost) hardware, and it provides high throughput (hi throughput) to access application data for applications with very large datasets (large data set). HDFs relaxes the requirements of (relax) POSIX and can access data in a stream (streaming access) file system. The core design of the Hadoop framework is: HDFs and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.
The idea of MapReduce programming
MapReduce is a programming model for parallel operations of large datasets (larger than 1TB). The concepts "map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as the features borrowed from vector programming languages. It is greatly convenient for programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrent reduce (return) function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys.
What can Hadoop do?
Many people may not have access to a large number of data development, such as a website daily visits of more than tens of millions of, the site server will generate a large number of various logs, one day the boss asked me want to count what area of people visit the site the most, the specific data about how much? I have asked in a Hadoop group, many people say that I write a program can be achieved, some people say I write a distributed system to specifically calculate. Can write one of your own, of course, can prove your ability, but one day the boss asked me if I want to know what age group to visit the most, and write a distributed system to calculate? This is a waste of manpower and material things. And even if the writing is very perfect, also did not pass the market user's examination, exists the uncertainty. Hadoop can help you achieve all aspects of the problem, you only need to write some specific Java Business Process code can be stable and can continue to grow as the business and data grow. Hadoop is commonly used in data statistics, for example, in the dozens of G file to count the occurrence of a word several times, in countless numbers to find the largest value, through the logs collected by your program statistics all the marketing data, help you achieve market positioning and promotion direction.
[Introduction to Hadoop]-1 Ubuntu system Hadoop Introduction to MapReduce programming ideas