This week's work is divided into two parts.
One: Build Jstorm environment (three machines cluster)
Since Microsoft Azure's virtual machines have not yet been applied, I've built a lab environment
1. Build Zookeeper Cluster
A) Download zookeeper version 3.4.5, unzip to/xxx/xxx/zookeeper-3.4.5
b) Configure environment variables (in ~/.BASHRC)
Export Zookeeper_home=/xxx/xxx/zookeeper-3.4.5
Export Path= $PATH: $HOME/bin: $ZOOKEEPER _home/bin
Export Classpath= $CLASSPATH: $ZOOKEEPER _home/lib
c) configuration $zookeeper_home/conf/zoo.cfg, mainly
Datadir=/home/yangrenkai/data/zookeeper/data
clientport=5181
server.1=blade5:2881:3881
server.2=blade7:2881:3881
server.3=blade8:2881:3881
d) Create a myID file under DataDir, with a content of 1 or 2 or 3, depending on the x of server.x.
2. Install java1.7 and Python 2.6, because Jstorm is written by a large number of Java and Python.
3. Installing JStorm-0.9.3.1
A) Download JStorm-0.9.3.1 version and unzip to/xxx/xxx/jstorm-0.9.3.1
b) Configure environment variables (in ~/.BASHRC)
Export jstorm_home=/xxx/xxx/jstorm-0.9.3.1
Export path= $PATH: $JSTORM _home/bin
c) Configuration $jstorm_home/conf/storm.yaml
Storm.zookeeper.servers: Represents the address of the zookeeper
Storm.zookeeper.port: The port representing the zookeeper
Nimbus.host: Represents the address of the Nimbus
Storm.zookeeper.root: Represents jstorm in the Zookeeper root directory, when multiple Jstorm share a zookeeper, you need to set this option, the default is "/jstorm"
Storm.local.dir: Indicates jstorm temporary data storage directory, need to ensure that Jstorm program has write permission to the directory
installation directory for JAVA.LIBRARY.PATH:ZEROMQ and Java ZEROMQ Library, default "/usr/local/lib:/opt/local/lib:/usr/lib"
Supervisor.slots.ports: Represents the list of port slots provided by supervisor, note that there is no conflict with other ports, default is 68XX, and Storm is 67xx
Supervisor.disk.slot: Indicates a data directory, when a machine has more than one disk, can provide disk read-write slot, easy to have heavy IO operation of the application
Topology.enable.classloader:false, by default, turns off ClassLoader, if the app jar conflicts with Jstorm's dependent jar, such as the app uses Thrift9, but Jstorm uses Thrif T7, you need to open ClassLoader
Nimbus.groupfile.path: If you need to do resource isolation, such as how much resources the Data warehouse uses, how much resources the technology department uses, how many resources the wireless department uses, it needs to open the grouping function, set the absolute path of a configuration file, and change the configuration file such as source code Group_f As shown in Ile.ini
Local temp directory used by Storm.local.dir:jstorm, if a machine is running storm and jstorm at the same time, do not share a directory, you must leave the two
d) Enter the command on the node that submitted the topology
#mkdir ~/.jstorm
#cp-F $JSTORM _home/conf/storm.yaml ~/.jstorm
e) Start ZK First, at startup Nimbus and supervisor, and Nimbus and supervisor preferably not on a node, I am 1 Nimbus and 2 supervisor, one supervisor configuration four ports
4. Jstorm requires Tomcat to display the UI, so you need to install Tomcat
A) Download Tomcat8.0.9, unzip to/xxx/xxx/Tomcat-8.0.9
b) Run the command:
cd/xxx/xxx/tomcat-8.0.9/webapps/
CP $JSTORM _home/jstorm-ui-0.9.3.war. /
MV ROOT Root.old
Ln-s jstorm-ui-0.9.3 ROOT
c) Start,/xxx/xxx/tomcat-8.0.9/bin/startup.sh
Two: Finish writing the first version of Topk_on_jstorm (project address)
1. Establishment of JSTORM-TOPK project
2. The entire project provides a simple TOPK calculation process, which is provided by scoreproducespout with a concurrency of 1 to provide random number data (Id,score), Computebolt with concurrency of 4 provides TOPK calculation, Rollup printing for Printandstorebolt with a concurrency of 1.
3. Establish Topkservertopology
Topologybuilder builder = new Topologybuilder ();
Builder.setspout ("spout", New Scoreproducespout (), 1);
Builder.setbolt ("Compute", new Computebolt (), 4). shufflegrouping ("spout");
Builder.setbolt ("Print", New Printandstorebolt (), 1). Shufflegrouping (
"Compute");
4. Establish Scoreproducespout, inherit irichspout (details in the next weekly report)
_collector.emit (New Values (Tupleid, ID, score), Tupleid);
Where Tupleid is a long type increment from 0, the ID is a four-bit [0-9a-za-z] constituent character, and Socre is a random number within 1000000. The Tupleid parameter of the Emit method represents the communication with Acker, which achieves record-level not lost. (ACK mechanism is described in other posts)
5. Establish Computebolt, inherit Irichbolt (details in the next weekly report)
The original data set is divided into 4 parts, parallel processing, each task calculates the TOPK on its own stream, even if the task is down or the tuple fail, the accumulation calculation is re-accumulated. The Excute () method implements the TOPK algorithm, which is more complex and can look at the source code in the project address.
Inherit Irichbolt, you can control whether the data is ACK or continue down a bolt send (that is, the next bolt to control the ACK)
6. Establish Printandstorebolt, inherit Irichbolt (details in the next weekly report)
All of the results are summarized here, Excute () still implements and Computebolt similar algorithms, different data volumes are smaller (filtered by Computebolt) and printed (for later persistence/output) leaving the interface.
7. Run on jstorm cluster, jstorm jar Topk.jar com.msopentech.jstorm.topk.topology.TopKServerTopology
The basic needs of TOPK can be completed.
Next week's plan
1. Can build a cluster on Microsoft Azure and run the TOPK algorithm (for the time being without an account and are looking for mentor help).
2. The TOPK algorithm can continue to be improved.
3. Implementation of the REST API input.
Thank Csdn Open-source summer camp and Shang Teacher's guidance and support!
(topkonjstorm) Second week work report: 2014-07-14~2014-07~20