The command to run the MapReduce jar package is the Hadoop jar **.jar
The command to run the jar package for the normal main function is Java-classpath **.jar
Because I have not known the difference between the two commands, so I stubbornly use Java-classpath **.jar to start the MapReduce. Until today there are errors.
Java-classpath **.jar is to make the jar package run locally, then MapReduce will only run on this node. So very slow.
At that time in order to detect why so slow, began to think that the number of mapper is too much, in the code set a lot of log information, and then Java-classpath **.jar start MapReduce to observe log log.
However, after the correction was made using the Hadoop jar **.jar to start the MapReduce, it was found that the original log log will not be generated immediately, because Mapper is sent to many machines running, so can not immediately get the return results.
Since it was a locally initiated mapreduce, the intermediate files were all generated on the node (this node has only 50G of space), which was later detected by the operator to remove the intermediate files.
A problem to be noted when playing jar packs is that when Maven runas is used, the resulting jar packages are all under Lib and only their contents in the current program's jar package. So you need to use a compression program to open the jar package, create a new lib directory inside it, and then put the jar package you need (Hadoop's jar pack) so that you can just put the jar package on the server and start.
Since Java-classpath **.jar is run locally, only the jar packages needed for the project need to be placed in the same directory,
However, the Hadoop jar **.jar is run on a cluster, and you need to put the jar package needed for the project inside the project Jar bundle.
The dependent jar package is called into the project jar package to form a total jar package, which can be configured in the Pom.xml file for Maven with the following configuration:
<build> <sourceDirectory>src</sourceDirectory> <resources> <resource>
<directory>conf</directory> <excludes> <exclude>**/*.java</exclude>
</excludes> </resource> </resources> <pluginManagement> <plugins>
<!--ignore/execute plugin execution--> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuratio
N> <lifecycleMappingMetadata> <pluginExecutions> <!--copy-dependency plugin--> <pluginExecution> <pluginExecutionFilter> <groupid>org.apache.maven.plugins& Lt;/groupid> <artifactId>maven-dependency-plugin</artifactId> <versionrange>[1.0. 0,) </versionRange>
<goals> <goal>copy-dependencies</goal> </goals> </plug inexecutionfilter> <action> <ignore/> </action> </pluginexecu tion> </pluginExecutions> </lifecycleMappingMetadata> </configuration> </plu gin> </plugins> </pluginManagement> <plugins> <plugin> <groupid>org.apache
.maven.plugins</groupid> <artifactId>maven-dependency-plugin</artifactId> <executions>
<execution> <id>copy-dependencies</id> <phase>test</phase> <goals> <goal>copy-dependencies</goal> </goals> <configuration> <excludearti Factids>hadoop-core</excludeartifactids> <excludeGroupIds>org.slf4j</excludeGroupIds> & Lt;outputdirectory> Here is the key <span style= "color: #FF0000;"
>target/classes/lib</span> </outputDirectory> </configuration> </execution>
</executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>1.6</source> <target& gt;1.6</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plu Gins> </build>
But need to runas two times to do, do not know why.
If you are simply running MapReduce without using other jar packages, you do not need to hit other jar packs into the project, which means that the project's jar's lib directory cannot have a Hadoop jar package, because in the runtime environment, Here are just a few other things like a jar bag that you've written.