10 Build a Hadoop standalone environment and use spark to manipulate Hadoop files

Last Update:2018-07-24 Source: Internet

Author: User

Tags hdfs dfs kafka streams

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous several are mainly Sparkrdd related foundation, also used Textfile to operate the document of this machine. In practical applications, there are few opportunities to manipulate common documents, and more often than not, to manipulate Kafka streams and files on Hadoop.

Let's build a Hadoop environment on this machine. 1 Installation configuration Hadoop

Download the Hadoop package first, http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz

I am using 2.8.3 version, download and unzip to a folder.

Hadoop relies on Java, so you need to first put Java on your computer and have the environment variables of Java in place. Configuring the Hadoop environment variables

Hadoop executables are in the Sbin directory and the bin directory, and we need to match the two directories to the environment variable path.

Take Mac for example, see environment variable configuration, Vi. bash_profile:

Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home
Export hadoop_home=/users/ wuwf/downloads/hadoop-2.8.3
export path= $PATH: $JAVA _home/bin: $HADOOP _home/bin: $HADOOP _home/sbin

When the configuration is complete, source. Bash_profile makes the environment variable effective. Executing Hadoop version

weifengdemacbook-pro:~ wuwf$ Hadoop version
Hadoop 2.8.3
Subversion https://git-wip-us.apache.org/repos/asf /hadoop.git-r b3fe56402d908019d99af1f1f4fc65cb1d1436a2
Compiled by Jdu in 2017-12-05t03:43z
Compiled with Protoc 2.5.0 from
source with checksum 9ff4856d824e983fa510d3f843e3f19d This
command is run using/users/wuwf/ Downloads/hadoop-2.8.3/share/hadoop/common/hadoop-common-2.8.3.jar

individual configuration files for Hadoop

Each file is modified under Etc/hadoop under the Hadoop installation directory hadoop-env.sh

Join Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home

The following is the path to your java_home. Modify Core-site.xml

<configuration>
<!--set up a temporary directory--
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/wuwf/Hadoop/hadoop-2.8.3/data</value>
    </property>
    <!--setup file system-- >
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.55:9999 </value>
    </property>
</configuration>

The above temporary directory is a local directory, the following IP is the IP of the machine, notice the error after using localhost, you need to fill in their own IP.
Modify Hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1 </value>
    </property>
</configuration>

Only native one node, set replication to 1 add Mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value >yarn</value>
    </property>
</configuration>

Create a file in the directory, fill in the above content configuration Yarn-site.xml

<configuration>

<!--Site specific YARN Configuration Properties---
    <property>
        < name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </ property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value >192.168.1.55:9999</value>
    </property>
</configuration>

start Hadoop

Execute First: Hadoop namenode-format

Then start hdfs:start-dfs.sh, if the Mac computer shows localhost port 22:connect refused, need to set-share-tick telnet, allow access to that add current user.

You will be asked to enter the password 3 times after executing start-dfs.sh.

Then: start-yarn.sh

After two commands are executed, access via browser: localhost:50070

Represents a successful Hadoop configuration. push files into Hadoop

Execute the following command

HDFs DFS-MKDIR/WC

HDFs Dfs-put A/wc/1.log

HDFs Dfs-put A/wc/2.log

HDFs Dfs-put A/wc/3.log

First create a directory on HDFs, and then push the native file A to HDFs and rename it.

HDFs DFS-LS/WC can view the files under the directory. Spark reads Hadoop files

Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;

Import org.apache.spark.sql.SparkSession;

Import java.util.List;
 /** * @author Wuweifeng wrote on 2018/4/27. */public class Test {public static void main (string[] args) {sparksession sparksession = Sparksession.build

        ER (). AppName ("Javawordcount"). Master ("local"). Getorcreate ();
        Javasparkcontext javasparkcontext = new Javasparkcontext (Sparksession.sparkcontext ());
        javardd<string> Javardd = Javasparkcontext.textfile ("Hdfs://192.168.1.55:9999/wc/1.log");
        Take 10% of the data, the random number of seeds themselves set, you can also do not set javardd<string> sample = Javardd.sample (False, 0.1, 1234);
        Long sampledatasize = Sample.count ();
        Long rawdatasize = Javardd.count ();

        System.out.println (Rawdatasize + "and after the sampling:" + sampledatasize);
        Take a specified number of random data list<string> List = Javardd.takesample (false, 10);

   SYSTEM.OUT.PRINTLN (list);     Take the ordered number of data list<string> orderlist = javardd.takeordered (10);
    System.out.println (orderlist); }
}

The same is done with the Textfile method, as is the local file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More