Recently consider using Hadoop mapreduce to analyze the data on MongoDB, from the Internet to find some demo, patchwork, finally run a demo, the following process to show youEnvironment
Ubuntu 14.04 64bit
Hadoop 2.6.4
MongoDB 2.4.9
Java 1.8
Mongo-hadoop-core-1.5.2.jar
Mongo-java-driver-3.0.4.jar
Download and configuration of Mongo-hadoop-core-1.5.2.jar and Mongo-java-driver-3.0.4.jar
Compiling Mongo-hadoop-co
Today began to MapReduce design patterns this book on the MapReduce example, I think this book on learning MapReduce programming very well, the book finished, basically can meet the mapreduce problems can also be dealt with. Let's start with the first piece. This procedure is to count a word frequency in the comment.xm
Label:SummaryThe previous article introduced several simple aggregation operations for COUNT,GROUP,DISTINCT, where group was a bit more troublesome. This article will learn about the relevant content of MapReduce.Related articlesGetting started with [MongoDB] [MongoDB] additions and deletions change [Mongodb]count,gourp,distinctBatToday suddenly found that every time the MongoDB server and client open, too often. So think of a way to get them to batch order. Open Server @echo off
" cd/d C:\Prog
following screen appears, configure the Hadoop cluster information. It is important to note that the Hadoop cluster information is filled in. Because I was developing the Hadoop cluster "fully distributed" using Eclipse Remote Connection under Windows, the host here is the IP address of master. If Hadoop is pseudo-distributed, localhost can be filled in. "Jser name" fill in the user name of the Windows computer, right-click on "My Computer"-"manage"-"Local Users and Groups"-"Modify user name" A
Mapreduce Working Principles
Body:1. mapreduce job running process
Process Analysis:
1. Start a job on the client.
2. Request a job ID from jobtracker.
3. Copy the resource files required for running the job to HDFS, including the jar files packaged by the mapreduce program, configuration files, and input Division information calculated by the client.
Input data is as follows: separated by \ t
0-3 years old parenting encyclopedia book-5 V Liquid Level Sensor 50-5 bearings 20-6 months milk powder-6 months C2C Report-6 months online shopping rankings-6 months milk powder market prospects-6 months formula milk powder 230.001g E tianping 50.01t aluminum furnace 20.01 tons of melting Aluminum Alloy Furnace 20.03 tons of magnesium furnace 250.03 tons of Induction Cooker 11Here, the left side is the search term and the right side is the category, w
Mapreduce and Spark are the two core of data processing layer, understand and learn big data must focus on the link, according to their own experience and everyone to do the knowledge sharing. 650) this.width=650; "Src=" Http://s5.51cto.com/wyfs02/M00/8B/2B/wKioL1hGbEiSjW3wAAEP-Bn8CcE114.jpg-wh_500x0-wm_3 -wmp_4-s_2651010867.jpg "title=" 11111.jpg "alt=" Wkiol1hgbeisjw3waaep-bn8cce114.jpg-wh_50 "/>First Look atMapreduce, its two most essential process
This section mainly analyzes the principles and processes of mapreduce.
Complete release directory of "cloud computing distributed Big Data hadoop hands-on"
Cloud computing distributed Big Data practical technology hadoop exchange group:312494188Cloud computing practices will be released in the group every day. welcome to join us!
You must at least know the following points about mapreduce:
1.
= filesystem.getlocal (conf);Set input directory and output filePath InputDir = new Path (args[0]);Path hdfsfile = new Path (args[1]);try{ Get a list of local filesfilestatus[] Inputfiles = Local.liststatus (InputDir);Generate HDFs output streamFsdataoutputstream out = Hdfs.create (Hdfsfile ();for (int i=0; iSystem.out.println (Inputfiles[i].getpath (). GetName ());Open Local input streamFsdatainputstream in = Local.open (Inputfiles[i].getpath ());byte buffer[] = new byte[256];int bytesread = 0
MapReduce implements a simple word counting function.One, get ready: Eclipse installs the Hadoop plugin:Download the relevant version of Hadoop-eclipse-plugin-2.2.0.jar to Eclipse/plugins.Second, realize:New MapReduce ProjectMap is used for word segmentation, reduce count. PackageTank.demo;Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.h
Hadoop mapreduce sequencing principle Hadoop Case 3 Simple problem----sorting data (Entry level)"Data Sorting" is the first work to be done when many actual tasks are executed,such as student performance appraisal, data indexing and so on. This example and data deduplication is similar to the original data is initially processed, for further data operations to lay a good foundation. Enter this example below.1. Requirements DescriptionSorts the data in
The command to run the MapReduce jar package is the Hadoop jar **.jar
The command to run the jar package for the normal main function is Java-classpath **.jar
Because I have not known the difference between the two commands, so I stubbornly use Java-classpath **.jar to start the MapReduce. Until today there are errors.
Java-classpath **.jar is to make the jar package run locally, then
Abstract: The MapReduce program processes a patent data set.
Keywords: MapReduce program patent Data Set
Data Source: Patent reference Data set Cite75_99.txt. (the dataset can be downloaded from the URL http://www.nber.org/patents/)
Problem Description:
Read the patent reference dataset and reverse it. For each patent, find the patent that cites it and merge it. TOP5 output results are as follows:
1 3964859
Sorting can be sorted into four sorts:General sortPartial sortGlobal sortingTwo orders (for example, there are two columns of data, and the first column is the same, you need to sort the second column.) ) General Sort
The general sort is mapreduce itself with the sort function;The text object is not suitable for sorting, intwritable,longwritable and other objects that implement the Writablecomparable type can be sorted; Partial sorting
The order of ke
1iiiiiiiiiijjjjjjjjjjkkkkkkkkkkllllllllllmmmmmmmmmmnnnnnnnnnnoooooooooopppppppp w[o| |:NH, 2qqqqqqqqqqrrrrrrrrrrssssssssssttttttttttuuuuuuuuuuvvvvvvvvvvwwwwwwwwwwxxxxxxxx ^Eu)
Describe: Each line is a piece of data. Each piece, consisting of 2 parts, is preceded by a key consisting of 10 characters, followed by a 80-character value.
Sort tasks: Order by key.
So where does 1TB of data come from? The answer is generated by the program, with a mapreduce
a job. This parameter indicates the maximum number of parallel tasks in the cluster.
3. My understanding: For more information, see the code of fileinputformat. java.The number of map tasks depends on splitsize, and the number of map tasks that a file is divided into based on splitsize. The splitsize calculation (see the source code of fileinputformat): splitsize = math. Max (minsize, math. Min (maxsize, blocksize); andMinsize = math. max (getformatm
Case SIX: Map direct output aloneNever used this map to output the mode alone, even if the output of some simple I will pass the reduce output, but found that the map output is a bit different from what I expected, I always thought that the shuffle process will be at the end of the map, reduce the beginning of the There will be a merger, but shuffle only do the division, sorting, and then directly listed out, this is a rise posture, before the merger of unde
rewrite the partition rule, year%2 as the rule, even years for Reduce1 processing, odd years by Reduce2 processing, the results found part-r-00002014 17201232201017200837part-r-00012015 99201329200799200129One of their own in the reduce side did two orders, two order of the concept is for this group of relative key how to output results, the default card rules are dictionary sorting, according to the order of the English alphabet, of course, they can rewrite the output of the rules, their own a
MapReduce is a pattern borrowed from a functional programming language, and in some scenarios it can greatly simplify the code. First look at what is MapReduce:
MapReduce is a software architecture proposed by Google for parallel operations in large-scale datasets (larger than 1TB). Concepts such as "map" and "Reduce", and their main ideas, are borrowed from fun
Configure Hadoop MapReduce development environment 1 with Eclipse on Windows. System environment and required documents
Windows 8.1 64bit
Eclipse (Version:luna Release 4.4.0)
Hadoop-eclipse-plugin-2.7.0.jar
Hadoop.dll Winutils.exe
2. Modify the hdfs-site.xml of the master nodeAdd the following contentproperty> name>dfs.permissionsname> value>falsevalue>property>Designed to remove permission checks because I configure
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.