10 Build a Hadoop standalone environment and use spark to manipulate Hadoop files

Source: Internet
Author: User
Tags hdfs dfs kafka streams

The previous several are mainly Sparkrdd related foundation, also used Textfile to operate the document of this machine. In practical applications, there are few opportunities to manipulate common documents, and more often than not, to manipulate Kafka streams and files on Hadoop.

Let's build a Hadoop environment on this machine. 1 Installation configuration Hadoop

Download the Hadoop package first, http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz

I am using 2.8.3 version, download and unzip to a folder.

Hadoop relies on Java, so you need to first put Java on your computer and have the environment variables of Java in place. Configuring the Hadoop environment variables

Hadoop executables are in the Sbin directory and the bin directory, and we need to match the two directories to the environment variable path.

Take Mac for example, see environment variable configuration, Vi. bash_profile:

Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home
Export hadoop_home=/users/ wuwf/downloads/hadoop-2.8.3
export path= $PATH: $JAVA _home/bin: $HADOOP _home/bin: $HADOOP _home/sbin

When the configuration is complete, source. Bash_profile makes the environment variable effective. Executing Hadoop version

weifengdemacbook-pro:~ wuwf$ Hadoop version
Hadoop 2.8.3
Subversion https://git-wip-us.apache.org/repos/asf /hadoop.git-r b3fe56402d908019d99af1f1f4fc65cb1d1436a2
Compiled by Jdu in 2017-12-05t03:43z
Compiled with Protoc 2.5.0 from
source with checksum 9ff4856d824e983fa510d3f843e3f19d This
command is run using/users/wuwf/ Downloads/hadoop-2.8.3/share/hadoop/common/hadoop-common-2.8.3.jar
individual configuration files for Hadoop

Each file is modified under Etc/hadoop under the Hadoop installation directory hadoop-env.sh

Join Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home

The following is the path to your java_home. Modify Core-site.xml

<configuration>
<!--set up a temporary directory--
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/wuwf/Hadoop/hadoop-2.8.3/data</value>
    </property>
    <!--setup file system-- >
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.55:9999 </value>
    </property>
</configuration>
The above temporary directory is a local directory, the following IP is the IP of the machine, notice the error after using localhost, you need to fill in their own IP.
Modify Hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1 </value>
    </property>
</configuration>

Only native one node, set replication to 1 add Mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value >yarn</value>
    </property>
</configuration>

Create a file in the directory, fill in the above content configuration Yarn-site.xml

<configuration>

<!--Site specific YARN Configuration Properties---
    <property>
        < name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </ property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value >192.168.1.55:9999</value>
    </property>
</configuration>
start Hadoop

Execute First: Hadoop namenode-format

Then start hdfs:start-dfs.sh, if the Mac computer shows localhost port 22:connect refused, need to set-share-tick telnet, allow access to that add current user.

You will be asked to enter the password 3 times after executing start-dfs.sh.

Then: start-yarn.sh

After two commands are executed, access via browser: localhost:50070


Represents a successful Hadoop configuration. push files into Hadoop

Execute the following command

HDFs DFS-MKDIR/WC

HDFs Dfs-put A/wc/1.log

HDFs Dfs-put A/wc/2.log

HDFs Dfs-put A/wc/3.log

First create a directory on HDFs, and then push the native file A to HDFs and rename it.

HDFs DFS-LS/WC can view the files under the directory. Spark reads Hadoop files

Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;

Import org.apache.spark.sql.SparkSession;

Import java.util.List;
 /** * @author Wuweifeng wrote on 2018/4/27. */public class Test {public static void main (string[] args) {sparksession sparksession = Sparksession.build

        ER (). AppName ("Javawordcount"). Master ("local"). Getorcreate ();
        Javasparkcontext javasparkcontext = new Javasparkcontext (Sparksession.sparkcontext ());
        javardd<string> Javardd = Javasparkcontext.textfile ("Hdfs://192.168.1.55:9999/wc/1.log");
        Take 10% of the data, the random number of seeds themselves set, you can also do not set javardd<string> sample = Javardd.sample (False, 0.1, 1234);
        Long sampledatasize = Sample.count ();
        Long rawdatasize = Javardd.count ();

        System.out.println (Rawdatasize + "and after the sampling:" + sampledatasize);
        Take a specified number of random data list<string> List = Javardd.takesample (false, 10);

   SYSTEM.OUT.PRINTLN (list);     Take the ordered number of data list<string> orderlist = javardd.takeordered (10);
    System.out.println (orderlist); }
}

The same is done with the Textfile method, as is the local file.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.