R and Cassandra large data analysis strong combination

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Nbsp; jdbc name Strong

Tags analysis apache big data code core team data developed development

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

R as a source of data statistical analysis language is imperceptibly in the enterprise to expand their influence. Unique extensions provide free extensions and allow the R language engine to run on the Hadoop cluster. Today, Oracle's Big Data solution also appears in the R language Pack.

R language is mainly used for statistical analysis, drawing language and operating environment. R was originally developed by Ross Ihaka and Robert Gentleman from Oakland University in New Zealand. (also known as R) is now being developed by the R Development core team. R is a GNU project based on the S language, so it can also be implemented as an S language, and code written in S language can be run without modification in R environment. The syntax of R is from scheme.

R's source code is free to download and has compiled executable versions that can be downloaded and run on a variety of platforms, including UNIX (also including FreeBSD and Linux), Windows and OS. R is primarily a command-line operation, and several graphical user interfaces have been developed.

Now statisticians can use the R language, R language to excel in the analysis of unstructured data stored in a Hadoop Distributed file system. R can now run on HBase, a relational database, and a column-oriented distributed data store. The main imitation of Google's bigtable. This is essentially equivalent to using Hadoop to hold a database of structured data. Just like the subproject hbase of the Apache Software Foundation Hadoop project. At the same time R can be with Cassandra

Reading data from Cassandra

The Illuminati condition is the RJDBC module, and the Cassandra version is at least 1.0 or more and Cassandra JDBC-driven. In the following example, the driver and Cassandra are in the same directory

The example code assumes you have run through the Portfolio Manager Demo which comes with Dsc/dse

#Load RJDBC Library (RJDBC) #Load in the Cassandra-jdbc diver cassdrv <-JDBC (" Org.apache.cassandra.cql.jdbc.CassandraDriver ", List.files ("/users/jake/workspace/bdp/resources/cassandra/lib/") , pattern= "jar$", full.names=t)) #Connect to Cassandra node and keyspace Casscon <-(dbconnect, "cassdrv ://localhost:9160/portfoliodemo ") #Query timeseries Data res <-dbgetquery (Casscon," select * from Stockhist limit 10 ") #Transpose tres <-t (res[2:10]) #Plot boxplot (tres,names=res$key,col=topo.colors (Length (Res$key))) title ("BoxPlot Of the stock price histories ")

And the Rcassandra package is a good choice.

R, Cassandra and Hive

Use R to access hive and Cassandra, where you use DataStax Enterprise to start Hive server first: DSE hive–service hiveserver

#Load  rjdbc library (RJDBC)    #Load  hive jdbc driver hivedrv <- &NBSP;JDBC ("Org.apache.hadoop.hive.jdbc.HiveDriver",                 C ( List.files ("/users/jake/workspace/bdp/resources/hadoop", pattern= "jar$", full.names=t),                   list.files ("/users/jake/workspace/bdp/resources/hive/lib", pattern= "jar$", full.names=t))    #Connect  to hive service hivecon <- dbconnect ( hivedrv,  "Jdbc:hive://localhost:10000/default")    #Create  hive table mapping  to cassandra columnfamily tmp <- dbsendquery (Hivecon, "create external  table stockhist (row_key string, column_name string, value double)     stored by  ' Org.apache.hadoop.hive.cassandra.CassandraStorageHandler '    with  Serdeproperties  (' cassandra.ks.name '  =  ' Portfoliodemo '))    #Run  Hive Query to  Get returns hres <- dbgetquery (Hivecon, "select a.row_key ticker,  AVG" ( B.value - a.value)  ret   From stockhist a join stockhist b  on   (A.row_key = b.row_key and date_add (a.column_name,10)  =  B.column_name)    Group by a.row_key order by ret ")     #Plot   Barplot (hres[,2],names.arg=hres[,1],col = topo.colors Length (hres[,2)),  border = na)  title ("Avg 10 day return for all stocks")  

Conclusion

The above example shows that using R to access Cassandra data is very simple, and the combination of the two also adds a powerful combination of statistical methods.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More