R as a source of data statistical analysis language is imperceptibly in the enterprise to expand their influence. Unique extensions provide free extensions and allow the R language engine to run on the Hadoop cluster. Today, Oracle's Big Data solution also appears in the R language Pack.
R language is mainly used for statistical analysis, drawing language and operating environment. R was originally developed by Ross Ihaka and Robert Gentleman from Oakland University in New Zealand. (also known as R) is now being developed by the R Development core team. R is a GNU project based on the S language, so it can also be implemented as an S language, and code written in S language can be run without modification in R environment. The syntax of R is from scheme.
R's source code is free to download and has compiled executable versions that can be downloaded and run on a variety of platforms, including UNIX (also including FreeBSD and Linux), Windows and OS. R is primarily a command-line operation, and several graphical user interfaces have been developed.
Now statisticians can use the R language, R language to excel in the analysis of unstructured data stored in a Hadoop Distributed file system. R can now run on HBase, a relational database, and a column-oriented distributed data store. The main imitation of Google's bigtable. This is essentially equivalent to using Hadoop to hold a database of structured data. Just like the subproject hbase of the Apache Software Foundation Hadoop project. At the same time R can be with Cassandra
Reading data from Cassandra
The Illuminati condition is the RJDBC module, and the Cassandra version is at least 1.0 or more and Cassandra JDBC-driven. In the following example, the driver and Cassandra are in the same directory
The example code assumes you have run through the Portfolio Manager Demo which comes with Dsc/dse
#Load RJDBC Library (RJDBC) #Load in the Cassandra-jdbc diver cassdrv <-JDBC (" Org.apache.cassandra.cql.jdbc.CassandraDriver ", List.files ("/users/jake/workspace/bdp/resources/cassandra/lib/") , pattern= "jar$", full.names=t)) #Connect to Cassandra node and keyspace Casscon <-(dbconnect, "cassdrv ://localhost:9160/portfoliodemo ") #Query timeseries Data res <-dbgetquery (Casscon," select * from Stockhist limit 10 ") #Transpose tres <-t (res[2:10]) #Plot boxplot (tres,names=res$key,col=topo.colors (Length (Res$key))) title ("BoxPlot Of the stock price histories ")
And the Rcassandra package is a good choice.
R, Cassandra and Hive
Use R to access hive and Cassandra, where you use DataStax Enterprise to start Hive server first: DSE hive–service hiveserver
#Load rjdbc library (RJDBC) #Load hive jdbc driver hivedrv <- &NBSP;JDBC ("Org.apache.hadoop.hive.jdbc.HiveDriver", C ( List.files ("/users/jake/workspace/bdp/resources/hadoop", pattern= "jar$", full.names=t), list.files ("/users/jake/workspace/bdp/resources/hive/lib", pattern= "jar$", full.names=t)) #Connect to hive service hivecon <- dbconnect ( hivedrv, "Jdbc:hive://localhost:10000/default") #Create hive table mapping to cassandra columnfamily tmp <- dbsendquery (Hivecon, "create external table stockhist (row_key string, column_name string, value double) stored by ' Org.apache.hadoop.hive.cassandra.CassandraStorageHandler ' with Serdeproperties (' cassandra.ks.name ' = ' Portfoliodemo ')) #Run Hive Query to Get returns hres <- dbgetquery (Hivecon, "select a.row_key ticker, AVG" ( B.value - a.value) ret From stockhist a join stockhist b on (A.row_key = b.row_key and date_add (a.column_name,10) = B.column_name) Group by a.row_key order by ret ") #Plot Barplot (hres[,2],names.arg=hres[,1],col = topo.colors Length (hres[,2)), border = na) title ("Avg 10 day return for all stocks")
Conclusion
The above example shows that using R to access Cassandra data is very simple, and the combination of the two also adds a powerful combination of statistical methods.