R and Cassandra large data analysis strong combination

Source: Internet
Author: User
Keywords Nbsp; jdbc name Strong
Tags analysis apache big data code core team data developed development

R as a source of data statistical analysis language is imperceptibly in the enterprise to expand their influence. Unique extensions provide free extensions and allow the R language engine to run on the Hadoop cluster. Today, Oracle's Big Data solution also appears in the R language Pack.

R language is mainly used for statistical analysis, drawing language and operating environment. R was originally developed by Ross Ihaka and Robert Gentleman from Oakland University in New Zealand. (also known as R) is now being developed by the R Development core team. R is a GNU project based on the S language, so it can also be implemented as an S language, and code written in S language can be run without modification in R environment. The syntax of R is from scheme.

R's source code is free to download and has compiled executable versions that can be downloaded and run on a variety of platforms, including UNIX (also including FreeBSD and Linux), Windows and OS. R is primarily a command-line operation, and several graphical user interfaces have been developed.

Now statisticians can use the R language, R language to excel in the analysis of unstructured data stored in a Hadoop Distributed file system. R can now run on HBase, a relational database, and a column-oriented distributed data store. The main imitation of Google's bigtable. This is essentially equivalent to using Hadoop to hold a database of structured data. Just like the subproject hbase of the Apache Software Foundation Hadoop project. At the same time R can be with Cassandra

Reading data from Cassandra

The Illuminati condition is the RJDBC module, and the Cassandra version is at least 1.0 or more and Cassandra JDBC-driven. In the following example, the driver and Cassandra are in the same directory

The example code assumes you have run through the Portfolio Manager Demo which comes with Dsc/dse

#Load RJDBC Library (RJDBC) #Load in the Cassandra-jdbc diver cassdrv <-JDBC (" Org.apache.cassandra.cql.jdbc.CassandraDriver ", List.files ("/users/jake/workspace/bdp/resources/cassandra/lib/") , pattern= "jar$", full.names=t)) #Connect to Cassandra node and keyspace Casscon <-(dbconnect, "cassdrv  ://localhost:9160/portfoliodemo ") #Query timeseries Data res <-dbgetquery (Casscon," select * from Stockhist limit 10 ") #Transpose tres <-t (res[2:10]) #Plot boxplot (tres,names=res$key,col=topo.colors (Length (Res$key))) title ("BoxPlot Of the stock price histories ")

And the Rcassandra package is a good choice.

R, Cassandra and Hive

Use R to access hive and Cassandra, where you use DataStax Enterprise to start Hive server first: DSE hive–service hiveserver

#Load  rjdbc library (RJDBC)    #Load  hive jdbc driver hivedrv <-  JDBC ("Org.apache.hadoop.hive.jdbc.HiveDriver",                 C ( List.files ("/users/jake/workspace/bdp/resources/hadoop", pattern= "jar$", full.names=t),                   list.files ("/users/jake/workspace/bdp/resources/hive/lib", pattern= "jar$", full.names=t))    #Connect  to hive service hivecon <- dbconnect ( hivedrv,  "Jdbc:hive://localhost:10000/default")    #Create  hive table mapping  to cassandra columnfamily tmp <- dbsendquery (Hivecon, "create external  table stockhist (row_key string, column_name string, value double)     stored by  ' Org.apache.hadoop.hive.cassandra.CassandraStorageHandler '    with  Serdeproperties  (' cassandra.ks.name '  =  ' Portfoliodemo '))    #Run  Hive Query to  Get returns hres <- dbgetquery (Hivecon, "select a.row_key ticker,  AVG" ( B.value - a.value)  ret   From stockhist a join stockhist b  on   (A.row_key = b.row_key and date_add (a.column_name,10)  =  B.column_name)    Group by a.row_key order by ret ")     #Plot   Barplot (hres[,2],names.arg=hres[,1],col = topo.colors Length (hres[,2)),  border = na)  title ("Avg 10 day return for all stocks")  

Conclusion

The above example shows that using R to access Cassandra data is very simple, and the combination of the two also adds a powerful combination of statistical methods.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.