Objective
Apache Zeppelin is a Web-based notebook (similar to Ipython notebook) that supports interactive data analysis, an interactive data query analysis tool in the form of Web notes. You can use Scala and SQL online to query and analyze data and generate reports. Native support for Spark, Scala, SQL, Shell ,markdown, and more. And it is fully open source, and is still in the Apache incubation stage. It has been used in major companies, such as the United States, Microsoft and so on.
Zeppelin's background data engine can be spark, or you can add a data engine for Zeppelin by implementing more interpreters. Building a Zeppelin locally makes spark easier to use, and it's easy to showcase your work to customers.
Prepare
sudo apt-get update // update apt
Installing the JDK
sudo apt-getopenjdk-8-jre openjdk-8-jdk
installationHadoopInstall SparkInstall git
sudo apt-get install git
Installing Maven
sudo apt-get install maven
Installing NPM
sudo apt-get install NPM //npm home:/usr/share/npm
Installing PHANTOMJS
Download "phantomjs-1.9.8-linux-x86_64.tar.bz2"
extract to :/usr/local/phantomjs
Installing Apache Zeppelin
Https://github.com/apache/incubator-zeppelinhttp://zeppelin.apache.org/download.html
Apache Zeppelin officially provides the source package and binary packages that we can download as needed to install the relevant packages.
- By downloading Zeppelin's binary package:http://ftp.meisei-u.ac.jp/mirror/apache/dist/incubator/zeppelin/0.5.6-incubating/ Zeppelin-0.5.6-incubating-bin-all.tgz, and then unzip the installation.
TAR-XZVF zeppelin-0.5. 6-incubating-bin-all.tgz
By compiling the source code to install Apache Zeppelin, I am here to download the latest source code from the Zeppelin git repository to compile.
$ git clone https:/ /// download latest, unzip to:/usr/local/zeppelin
Compiling Apache Zeppelin
- Local mode: mvn clean package-dskiptests
- Cluster mode: MVN package-pspark-2.0-dhadoop.version=2.7.1-phadoop-2.7-dskiptests-x
In the installation process may have a variety of problems, but it is usually caused by network problems, re-execute the following compile command. However, if you are compiling an oom, you need to add the following command:
Export maven_opts="-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m"
- Configure environment Variables
[email protected]:~$ vim. BASHRC
//vim edit path
Export Java_home=/usr/lib/jvm/java-8 -openjdk- Amd64export spark_home =/usr/local/sparkexport hadoop_home =/usr/local/ hadoopexport phantomjs_home =/usr/local/ Phantomjsexport zeppelin_home =/usr/local/zeppelinexport PATH =.: $PATH :/usr/local/hadoop/bin:/usr/local/phantomjs/bin:/usr/local/spark/bin:/usr/local/zeppelin/bin:/usr/lib/jvm/ Java-8 -openjdk-amd64/bin;
[email protected]:~$ source. BASHRC
- Cluster Mode compilation
[Email protected]:~$ cd/usr/local/zeppelin[email protected]:/usr/local/zeppelin$ mvn package-pspark- 2.0 -dhadoop.version=2.7. 1 -phadoop-2.7 -dskiptests-x
If you need to use yarn, you must specify the-pyarn option when compiling Zeppelin.
Configuration
The configuration file is the environment variable file (conf/zeppelin-env.sh) and the Java Properties File (Conf/zeppelin-site.xml). Configure according to your requirements.
- Copy/usr/local/zeppelin/conf/zeppelin-env.sh.template and/usr/local/zeppelin/conf/zeppelin-site.xml.template to/usr/ Local/zeppelin/conf/zeppelin-env.sh and/usr/local/zeppelin/conf/zeppelin-site.xml.
- Edit conf/zeppelin-env.sh
Export java_home=/usr/lib/jvm/java-8-openjdk-amd64export spark_home=/usr/local/sparkexport Hadoop_conf_dir=/usr/local/hadoopexport spark_submit_options="--packages com.databricks:spark-csv_2.10:1.2.0"
Start
Execute the following command in the Zeppelin_home directory:
[Email protected]:/usr/local/zeppelin$./bin/zeppelin-daemon.sh start
Its start/stop command: bin/zeppelin-daemon.sh start/stop.
After startup, open localhost:8080 to access the Zepplin home page.
Test
- Configuring the Spark Interpreter
- Zeppelin Getting Started with
1.text
The text content is output by default in the Scala language:
println ("Hello Yuan siping! ")
2.html
Shell Output HTML:
" %html "
3.table
Scala:
Print (S"" "%table name\tsize\nsun\t100\nmoon\t10""")
4.Tutorial with Local File
Data Refine:
Download Bank data: http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip, convert CSV format data to the Bank object Rdd, and filter the header column:
Val Banktext = Sc.textfile ("/usr/data/bank/bank-full.csv") Case classBank (Age:integer, Job:string, marital:string, Education:string, Balance:integer) Val Bank= Banktext.map (S=>s.split (";"). Filter (S=>s (0)!="\ "Age\""). Map (S=>bank (S (0). ToInt, S (1). ReplaceAll ("\"",""), S (2). ReplaceAll ("\"",""), S (3). ReplaceAll ("\"",""), S (5). ReplaceAll ("\"","") . ToInt) bank.todf (). Registertemptable ("Bank")
Data Retrieval:
You can see the age distribution by executing the following statement:
Select Age, Count (1fromwhere the group by age
Dynamic Input MaxAge parameter (default is 30 years old), view age distribution less than maxage years old:
Select Age, Count (1fromwhere < ${maxage=)GROUP by age
Depending on the marital status option, check the age distribution status:
Select Age, Count (1fromwhere marital="${marital=single,single| divorced|married}" GROUP By Age"
Under Ubuntu based SAPRK installation Zeppelin