The previous several are mainly Sparkrdd related foundation, also used Textfile to operate the document of this machine. In practical applications, there are few opportunities to manipulate common documents, and more often than not, to manipulate Kafka streams and files on Hadoop.
Let's build a Hadoop environment on this machine. 1 Installation configuration Hadoop
Download the Hadoop package first, http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.3/hadoop-2.8.3.tar.gz
I am using 2.8.3 version, download and unzip to a folder.
Hadoop relies on Java, so you need to first put Java on your computer and have the environment variables of Java in place. Configuring the Hadoop environment variables
Hadoop executables are in the Sbin directory and the bin directory, and we need to match the two directories to the environment variable path.
Take Mac for example, see environment variable configuration, Vi. bash_profile:
Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home
Export hadoop_home=/users/ wuwf/downloads/hadoop-2.8.3
export path= $PATH: $JAVA _home/bin: $HADOOP _home/bin: $HADOOP _home/sbin
When the configuration is complete, source. Bash_profile makes the environment variable effective. Executing Hadoop version
weifengdemacbook-pro:~ wuwf$ Hadoop version
Hadoop 2.8.3
Subversion https://git-wip-us.apache.org/repos/asf /hadoop.git-r b3fe56402d908019d99af1f1f4fc65cb1d1436a2
Compiled by Jdu in 2017-12-05t03:43z
Compiled with Protoc 2.5.0 from
source with checksum 9ff4856d824e983fa510d3f843e3f19d This
command is run using/users/wuwf/ Downloads/hadoop-2.8.3/share/hadoop/common/hadoop-common-2.8.3.jar
individual configuration files for Hadoop
Each file is modified under Etc/hadoop under the Hadoop installation directory hadoop-env.sh
Join Export Java_home=/library/java/javavirtualmachines/jdk1.8.0_40.jdk/contents/home
The following is the path to your java_home. Modify Core-site.xml
<configuration>
<!--set up a temporary directory--
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/wuwf/Hadoop/hadoop-2.8.3/data</value>
</property>
<!--setup file system-- >
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.55:9999 </value>
</property>
</configuration>
The above temporary directory is a local directory, the following IP is the IP of the machine, notice the error after using localhost, you need to fill in their own IP.
Modify Hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1 </value>
</property>
</configuration>
Only native one node, set replication to 1 add Mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value >yarn</value>
</property>
</configuration>
Create a file in the directory, fill in the above content configuration Yarn-site.xml
<configuration>
<!--Site specific YARN Configuration Properties---
<property>
< name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</ property>
<property>
<name>yarn.resourcemanager.address</name>
<value >192.168.1.55:9999</value>
</property>
</configuration>
start Hadoop
Execute First: Hadoop namenode-format
Then start hdfs:start-dfs.sh, if the Mac computer shows localhost port 22:connect refused, need to set-share-tick telnet, allow access to that add current user.
You will be asked to enter the password 3 times after executing start-dfs.sh.
Then: start-yarn.sh
After two commands are executed, access via browser: localhost:50070
Represents a successful Hadoop configuration. push files into Hadoop
Execute the following command
HDFs DFS-MKDIR/WC
HDFs Dfs-put A/wc/1.log
HDFs Dfs-put A/wc/2.log
HDFs Dfs-put A/wc/3.log
First create a directory on HDFs, and then push the native file A to HDFs and rename it.
HDFs DFS-LS/WC can view the files under the directory. Spark reads Hadoop files
Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;
Import org.apache.spark.sql.SparkSession;
Import java.util.List;
/** * @author Wuweifeng wrote on 2018/4/27. */public class Test {public static void main (string[] args) {sparksession sparksession = Sparksession.build
ER (). AppName ("Javawordcount"). Master ("local"). Getorcreate ();
Javasparkcontext javasparkcontext = new Javasparkcontext (Sparksession.sparkcontext ());
javardd<string> Javardd = Javasparkcontext.textfile ("Hdfs://192.168.1.55:9999/wc/1.log");
Take 10% of the data, the random number of seeds themselves set, you can also do not set javardd<string> sample = Javardd.sample (False, 0.1, 1234);
Long sampledatasize = Sample.count ();
Long rawdatasize = Javardd.count ();
System.out.println (Rawdatasize + "and after the sampling:" + sampledatasize);
Take a specified number of random data list<string> List = Javardd.takesample (false, 10);
SYSTEM.OUT.PRINTLN (list); Take the ordered number of data list<string> orderlist = javardd.takeordered (10);
System.out.println (orderlist); }
}
The same is done with the Textfile method, as is the local file.