Hadoop Spark Ubuntu16

Source: Internet
Author: User
Tags log log tmp file file permissions ssh server hdfs dfs jupyter jupyter notebook pyspark

To create a new user:

$sudo useradd-m hadoop-s/bin/bash
To set the user's password:
$sudo passwd Hadoop
To add Administrator privileges:
$sudo adduser Hadoop sudo

Install SSH, configure SSH login without password:

To install SSH Server:

$ sudo apt-get install Openssh-server
Use SSH to log in to this machine:
$ ssh localhost
Launched Shh Locahost:
Exit
Generate the key using Ssh-keygen:
CD ~/.ssh/# If you do not have this directory, first execute SSH localhost once
SSH-KEYGEN-T RSA # will be prompted, press ENTER to
Cat./id_rsa.pub >>./authorized_keys #将rsa. Pub is appended to the end of the Authorized_keys.

Installing the Java Environment

sudo apt-get install Openjdk-8-jre openjdk-8-jdk
Locate the installed directory
Dpkg-l OPENJDK-8-JDK | grep '/bin/javac ' (dpkg used to install, create and manage software,-l displays package associated files, grep is a text search tool to filter/search for specific characters)

The output path should be/usr/lib/jvm/java-8-openjdk-amd64/bin/javac and do not know why there is no output, but can be found from this machine.
To edit the user's environment variables:
sudo gedit ~/.BASHRC
Change the path of the JDK to the above path
Export JAVA_HOME=/USR/LIB/JVM/JAVA-8-OPENJDK-AMD64 (Note that there are no spaces here)
Let the environment variable take effect:
SOURCE ~/.BASHRC
Verify variable values
echo $JAVA _home # Verify variable values
Java-version
$JAVA _home/bin/java-version # As with direct execution java-version

Hadoop Installation:

URL http://mirror.bit.edu.cn/apache/hadoop/common/

$ sudo tar-zxf ~/download/hadoop-3.0.0.tar.gz-c/usr/local # Extract to/usr/local
$ cd/usr/local/
$ sudo mv./hadoop-3.0.0/./hadoop # Change folder name to Hadoop
$ sudo chown-r Hadoop./hadoop # Modify file permissions (chown change file permissions, here-r means that the specified directory and all files under subdirectories "owner" "File" can view detailed chown commands with the man command: Man Chown)

Hadoop stand-alone configuration:

Hadoop is a stand-alone configuration by default

See all the examples./bin/hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar
Select the grep example to input all the folders in the input file and filter them to match the regular expression dfs[a-z. + The number of occurrences of the word, and the final output to the Outputs folder.
Cd/usr/local/hadoop
mkdir./input
CP./etc/hadoop/. Xml./input # Add a configuration file as an input file
./bin/hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-
. jar grep./input./output ' dfs[a-z.] +
Cat./output/* (runtime will have permissions issues, so I did not generate output files when executing)

Note that Hadoop does not overwrite the result file by default, so running the above instance again prompts an error and requires that the./output be deleted first.

Hadoop Pseudo-distributed configuration

Hadoop can run in a pseudo-distributed manner on a single node, and the Hadoop process runs as a separate Java process, with nodes as both NameNode and DataNode, while reading the files in HDFS.

The configuration file for Hadoop is located in/usr/local/hadoop/etc/hadoop/, and pseudo-distributed requires the modification of 2 configuration files Core-site.xml and Hdfs-site.xml. The configuration file for Hadoop is in XML format, and each configuration is implemented in a way that declares the property's name and value.

Modifying the configuration file Core-site.xml (edited by Gedit is convenient: gedit./etc/hadoop/core-site.xml),




Switch


Similarly, modify the configuration file Hdfs-site.xml:



Hadoop configuration file Description
Hadoop runs in a configuration file (the configuration file is read when running Hadoop), so if you need to switch back from pseudo-distributed mode to non-distributed mode, you need to remove the configuration items from the Core-site.xml.

In addition, pseudo-distributed, although only need to configure FS.DEFAULTFS and dfs.replication can be run (the official tutorial), but if not configured Hadoop.tmp.dir parameters, the default use of the temporary directory is/tmp/hadoo-hadoop, This directory may be removed by the system when it restarts, causing the format to be re-executed. So we set it up, and we also specify Dfs.namenode.name.dir and Dfs.datanode.data.dir, otherwise you might get an error in the next step.

After the configuration is complete, perform the formatting of the NameNode:

./bin/hdfs Namenode-format

The NameNode and DataNode daemons are then turned on.

./sbin/start-dfs.sh
Java_home Note that the configuration is set in hadoop/etc/hadoop_env.sh, otherwise it will be an error.

In addition, if the DataNode does not start, try the following method (note that this will delete all the data in HDFS, if the original data is important please do not):

After the boot completes, the command JPS can be used to determine whether the startup is successful, and if successful, the following processes are listed: "NameNode", "DataNode", and "Secondarynamenode" (if Secondarynamenode does not start, please run Sbin/stop-dfs.sh close the process, and then try to start the attempt again. If there is no NameNode or DataNode, that is, the configuration is unsuccessful, please double-check the previous steps, or check the startup log for troubleshooting reasons.

Solutions for DataNode that cannot be started

./sbin/stop-dfs.sh # Close
Rm-r./TMP # Delete the TMP file, note that this will remove all data from HDFS
./bin/hdfs Namenode-format # reformatting Namenode
./sbin/start-dfs.sh # Restart
Here itself did not succeed, and then I add the following in Hdfs-site.xml: (0.0.0.0 local address, to see their own local specific IP settings)



Running Hadoop pseudo-distributed instances

The above stand-alone mode, the grep example reads local data, and the pseudo-distributed read is the data on the HDFS. To use HDFs, you first need to create a user directory in HDFs:

./bin/hdfs Dfs-mkdir-p/user/hadoop

The XML file in./etc/hadoop is then copied to the Distributed file system as an input file, and the/usr/local/hadoop/etc/hadoop is copied to/user/hadoop/input in the Distributed File system. We are using a Hadoop user and have created the appropriate user directory/user/hadoop, so you can use relative paths such as input in the command, and the corresponding absolute path is/user/hadoop/input:

./bin/hdfs Dfs-mkdir Input
./bin/hdfs dfs-put./etc/hadoop/*.xml input//Put this step in their own operation when the error, pay attention to the log log, the cause of the error clusterid incompatibility problem, this time, re-close, and then remove the file, Re-format

Notice BASHRC inside the change, otherwise will error. The same configuration is also best set to the middle of the hadoop_env.sh (here I have not configured the following code snippet to the middle of hadoop_env.sh because of an error)
Export Hadoop_home=/usr/local/hadoop
Export hadoop_common_lib_native_dir=\ (hadoop_home/lib/native export hadoop_opta= "-djava.library.path=\) Hadoop_home/lib: $HADOOP _common_lib_native_dir

After the copy is complete, you can view the list of files with the following command:

./bin/hdfs Dfs-ls Input

Pseudo-distributed runs the MapReduce job in the same way as the standalone mode, except that the pseudo-distributed read is the file in HDFs (you can verify this by deleting the local input folder created in the stand-alone step and outputting the output folder).

./bin/hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input Output ' dfs[a-z. +

View the commands that run the results (view the output in HDFS):

./bin/hdfs Dfs-cat output/*

We can also retrieve the running results locally:

Rm-r./output # Delete the local output folder first (if present)
./bin/hdfs dfs-get output./output # Copy the output folder on HDFs to the native
Cat./output/*

When Hadoop runs the program, the output directory cannot exist, otherwise it will prompt the error "Org.apache.hadoop.mapred.FileAlreadyExistsException:Output directory hdfs:// Localhost:9000/user/hadoop/output already exists ", so to do this again, you need to execute the following command to delete the output folder:

./bin/hdfs dfs-rm-r Output # Delete the output folder
Shell command
The output directory cannot exist when you run the program
When running a Hadoop program, the output directory specified by the program (such as output) cannot be present to prevent overwriting the result, otherwise an error is prompted, so the output directory needs to be deleted before running. When you actually develop your application, consider adding the following code to your program to automatically delete the output directory each time you run it, avoiding tedious command-line operations:

(JAVA)
Configuration conf = new Configuration();
Job job = new Job(conf);

/* 删除输出目录 */
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(conf).delete(outputPath, true);

To turn off Hadoop, run the

./sbin/stop-dfs.sh

Attention
The next time you start Hadoop, you don't need to do NameNode initialization, just run./sbin/start-dfs.sh! It can be started by $hadoop_home to the HADOOP directory.

Start yarn

(pseudo-distributed does not start YARN can also, generally does not affect program execution)

Some readers may wonder how to start Hadoop and not see the Jobtracker and Tasktracker in the book, because the new version of Hadoop uses the latest MapReduce framework (MapReduce V2, also known as Yarn,yet Anot Her Resource negotiator).

YARN is isolated from MapReduce and is responsible for resource management and task scheduling. Yarn runs on the MapReduce, providing high availability, high scalability, yarn's more introduction is not started here, interested in the relevant information can be consulted.

The above through./sbin/start-dfs.sh start Hadoop, just start the MapReduce environment, we can start yarn and let yarn take charge of resource management and task scheduling.

First modify the configuration file Mapred-site.xml, this way you need to first rename:

mv./etc/hadoop/mapred-site.xml.template./etc/hadoop/mapred-site.xml (MV implements renaming of files)
Then edit, also use Gedit edit will be more convenient gedit./etc/hadoop/mapred-site.xml:



Then modify the configuration file Yarn-site.xml:



Then you can start YARN (you need to do it first./sbin/start-dfs.sh):

./sbin/start-yarn.sh # Start yarn
./sbin/mr-jobhistory-daemon.sh start Historyserver # Turn on the history server to see the task run in the Web

After turning on the JPS view, you can see more NodeManager and ResourceManager two background processes.

After moving YARN, the method of running the instance is the same, only the resource management method and the task scheduling are different. Observation log information can be found, when YARN is not enabled, is "mapred." Localjobrunner "After running the task, enabling YARN, is" mapred. Yarnrunner "in the running task. One of the benefits of starting YARN is that you can view the operation of a task through the Web interface: Http://localhost:8088/cluster.

But YARN is mainly for the cluster to provide better resource management and task scheduling, but this in a single machine can not reflect the value, but will make the program run slightly slower. Therefore, whether to open YARN on a single machine to see the actual situation.

Do not start YARN requires renaming Mapred-site.xml
If you do not want to start YARN, be sure to rename the configuration file Mapred-site.xml, change to mapred-site.xml.template, need to change back on the line. Otherwise, the runtime prompts the "retrying connect to server:0.0.0.0/0.0.0.0:8032" error if the profile exists without YARN turned on, which is why the profile initial file name is Mapred-site.xml.template.

Similarly, the script to close YARN is as follows:

./sbin/stop-yarn.sh
./sbin/mr-jobhistory-daemon.sh Stop Historyserver
When running here, it is suggested that this mr-jobhistory-daemon has been replaced with mapred--daemon stop, but there is still mr-jobhistory-daemon in the file to see the shell. So follow the code above.

Spark Installation

Http://spark.apache.org/downloads.html
The spark-2.3.0-bin-hadoop2.7 of the machine-mounted

sudo tar-zxf ~/download/spark-1.6.0-bin-without-hadoop.tgz-c/usr/local/
Cd/usr/local
sudo mv./spark-1.6.0-bin-without-hadoop/./spark
sudo chown-r hadoop:hadoop./spark # Here's Hadoop for your username

After installation, you need to modify the spark's Classpath in./conf/spark-env.sh and execute the following command to copy a configuration file:

Cd/usr/local/spark
CP./conf/spark-env.sh.template./conf/spark-env.sh

Edit./conf/spark-env.sh (vim./conf/spark-env.sh), with the following line at the end:

Export spark_dist_classpath=$ (/usr/local/hadoop/bin/hadoop CLASSPATH)

Running the Spark sample
Note that Hadoop must be installed in order to use spark, but it is also possible to not start Hadoop if you use the spark process without using HDFS. In addition, the commands and directories that appear in the next tutorial, if not described, are generally the current path with the installation directory (/usr/local/spark) of Spark, and be aware of the distinction.

There are sample programs for Spark in the./examples/src/main directory, which are available in the languages of Scala, Java, Python, and R. We can run a sample program SPARKPI (that is, calculate the approximate value of π) and execute the following command:

Cd/usr/local/spark
./bin/run-example SPARKPI

Execution will output a lot of running information, the output is not easy to find, can be filtered through the grep command (2>&1 in the command can output all the information to the STDOUT, otherwise, due to the nature of the output log, or output to the screen):

./bin/run-example SPARKPI 2>&1 | grep "Pi is roughly"
As shown in the filtered run results, you can get the 5 decimal approximation of π.

The Python version of SPARKPI is required to run through Spark-submit:

./bin/spark-submit examples/src/main/python/pi.py

Interactive analysis with Spark Shell
The Spark shell provides an easy way to learn the API and also provides an interactive way to analyze the data. The Spark Shell supports Scala and Python, and this tutorial chooses to use Scala for introductions.

Scala
Scala is a modern, multi-paradigm programming language that expresses common programming patterns in a concise, elegant, and type-safe manner. It seamlessly integrates the features of object-oriented and functional languages. Scala runs on the Java platform (Jvm,java virtual machines) and is compatible with existing Java programs.

Scala is the main programming language for spark, and if you're just writing spark applications, you don't have to use Scala, as in Java and Python. The advantage of using Scala is that development is more efficient, the code is streamlined, and interactive real-time queries can be made through the Spark Shell for easy troubleshooting.

Execute the following command to start the Spark Shell:

./bin/spark-shell

Connect Jupyter notebook and spark via Pyspark


export SPARK_HOME=/usr/local/spark
export PYTHONPATH=\(SPARK_HOME/python:\)SPARK_HOME/python/lib/py4j-0.10.6-src.zip:\(PYTHONPATH export PATH=\)HADOOP_HOME/bin:\(SPARK_HOME/bin:\)PATH
export LD_LIBRARY_PATH=\(LD_LIBRARY_PATH:/usr/local/hadoop/lib/native\){LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3

export HIVE_HOME=/usr/local/hive
export PATH=\(PATH:\)HIVE_HOME/bin

Run Pyspark
$SPARK _home/bin/pyspark

Reference documents:
Https://wangchangchung.github.io/2017/09/28/Ubuntu-16-04%E4%B8%8A%E5%AE%89%E8%A3%85Hadoop%E5%B9%B6%E6%88%90%E5 %8a%9f%e8%bf%90%e8%a1%8c/
http://www.powerxing.com/install-hadoop/
http://www.powerxing.com/hadoop-build-project-by-shell/
http://dblab.xmu.edu.cn/blog/1689-2/
A Python program can be written via Hadoop stream

Hadoop Spark Ubuntu16

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.