Knowing and learning about Hadoop, we have to understand the composition of Hadoop, and based on my own experience, I introduce the Hadoop component, the big data processing process, and the three aspects of Hadoop core:
Hadoop
1. Introduction:Import the source code to eclipse to easily read and modify the source.2. Description of the environment:MacMVN Tools (Apache Maven 3.3.3)3.hadoop (CDH5.4.2)1. Go to the Hadoop root and execute:MVN org.apache.maven.plugins:maven-eclipse-plugin:2.6: eclipse-ddownloadsources=true - Ddownloadjavadocs=truNote:If you do not specify the version number of Eclipse, you will get the following error,
As a matter of fact, you can easily configure the distributed framework runtime environment by referring to the hadoop official documentation. However, you can write a little more here, and pay attention to some details, in fact, these details will be explored for a long time. Hadoop can run on a single machine, or you can configure a cluster to run on a single m
explains the capabilities of each component. The Hadoop ecosystem contains more than 10 components or sub-projects, but there are challenges in terms of installation, configuration, deployment of cluster size, and management. The Hadoop main components include: Hadoop: A s
first, the core components of Hadoop
The components of Hadoop are shown in the figure, but the core components are: MapReduce and HDFs.
1, the system structure of HDFSWe first introduce the architecture of HDFs, which uses a master-slave (Master/slave) architecture model,
basic components, but also bigtable (such as HBase, hypertable) The underlying distributed file system. HDFs adopts Master/slave architecture. A HDFs cluster is composed of a namenode and a certain number of datanode. Namenode is a central server responsible for managing file System namespace and client access to files. Datanode is typically a node in a cluster, responsible for managing the storage that comes with them on the node. Internally, a file
assist to restore NameNode, but secondary Namenode is not a hot preparation for namenode.3. Mapreduce (distributed computing framework)The MapReduce paper from Google, published in December 2004, is the Google MapReduce clone version of Hadoop MapReduce. A mapreduce paper from Google MapReduce is a computational model for the calculation of large data volumes. Where map specifies the operation of a separat
more columns to form a columnfamily, and a column under fmaily in a hfile, which is easy to cache data. Tables are loosely stored, so users can define different columns for rows. In HBase, the data is sorted by primary key, and the table is divided into multiple hregion by primary key, as shown in the following figure (HBase data Table structure chart):
Ok, writing to this, seemingly voluminous near thousands of miles, but if the reader to create a burden of reading, it is not my intention. Ne
data, using only the outermost words of an n-tuple can also help avoid duplicate computations. In general, we will calculate on 2, 3, 4 and 5 metadata datasets.
MapReduce pseudocode to implement this solution is similar to this:
def map (record):
[Ngram, year, count] = unpack (record)
//ensures that word1 is the first word in the dictionary
(word1, word2) = sorted (ngram[ Ngram[last])
key = (word1, Word2, year)
emit (key, count)
def reduce (key, values):
emit (Key, su
Hadoop New MapReduce Framework Yarn detailed: http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/launched in 2005, Apache Hadoop provides the core MapReduce processing engine to support distributed processing of large-scale data workloads. 7 years later, Hadoop
addition to being more sensitive to possible sparse N-metadata, using only the outermost words of the n-tuple helps to avoid duplicate computations. In general, we will calculate on the 2, 3, 4 and 5 metadata datasets.
MapReduce pseudo-code to implement this solution is similar to this:
def map (record): (Ngram, year, count) = Unpack (record) //Make sure Word1 is the first word of the dictionary (word1, word2) = sorted (ngram[ First], Ngram[last]) key = (word1, Word2, year) emit (Key, c
(key, values): emit (key, sum (values ))Hardware
These MapReduce components are executed on a random subset of approximately 20 GB of data. The complete dataset contains 1500 files. We use this script to select a random subset. It is important to keep the file name complete because the file name determines the value of n in the n-element of the data block. The Hadoop cluster contains five virtual nodes tha
for analysis and processing(5)/app-non-data files, such as: Configuration files, jar files, SQL files, etc. Mastering the above four steps for the application of HDFs has important role and significance, but we should be based on their own situation gradually, pay attention to practice, can continue to make progress. I usually like to find some case analysis, so as to exercise to improve their skills, this is more like "Big Data CN" This service platform. But the truth is more from practice, on
Hadoop is a distributed storage and computing platform for Big dataArchitecture of HDFs: Master-Slave architectureThe primary node has only one namenode, and there can be many datanode from the node.Namenode is responsible for:(1) Receiving User action request(2) Maintaining the directory structure of the file system(3) Managing the relationship between the file and block, and the connection between block and DatanodeDatanode is responsible for:(1) St
Remote debugging is very useful for application development. For example, develop programs for low-end machines that cannot host the development platform, or debug programs on dedicated machines (such as Web servers that cannot interrupt services. Other scenarios include Java applications (such as mobile devices) running on devices with small memory or low CPU performance, or developers who want to separate applications from the development environment.
To perform remote debugging, you must use
final String Input_path = "Hdfs://liaozhongmin:9000/hello";
//define Output path
private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
public static void Main (string[] args) {
try {
//Create configuration information
Configuration conf = new configuration ();
/**********************************************/
//Compress the map-side output
//conf.setboolean ("Mapred.compress.map.output", true);
//Set the compression class used for map-side outp
1. Resource management http://dongxicheng.org/mapreduce-nextgen/hadoop-1-and-2-resource-manage/in Hadoop 2.0Hadoop 2.0 refers to the version of the Apache Hadoop 0.23.x, 2.x or CDH4 series of Hadoop, the core consists of HDFs, mapreduce and yarn three systems, wherein yarn is a resource management system, In charge of
Developed by vs2005 + VB.net + Oracle + ado.netProgram, An error occurs on a customer's machine:
. NET Framework data providers require Microsoft Data Access Components (MDAC ). Install Microsoft Data Access Components (MDAC) 2.6 or later
When mdac2.8 is used for installation, the system prompts that the installation cannot be performed on the current v
with no intermediate state.6, Sequential: For all servers, the same message is published in a consistent order.Basic principle650) this.width=650; "Src=" Http://s1.51cto.com/wyfs02/M00/85/4C/wKiom1efTNnA4ZCeAAX4DF7vo0w159.png-wh_500x0-wm_3 -wmp_4-s_1223101739.png "title=" Zookeeper2. PNG "alt=" Wkiom1eftnna4zceaax4df7vo0w159.png-wh_50 "/>Server many, there are master and slave points, but there is a leader, the other is follower,Each server, in memory, holds a piece of data that, when launched,
originates from the MRV1 (traditional Hadoop MR) described above, such as:
Limited extensibility;
Jobtracker single point of failure;
It is difficult to support calculations other than Mr;
Multi-computing framework fighting each other, data sharing difficulties, such as Mr (offline computing framework), storm real-time computing
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.