1, Cassandra Preparation
Start Cqlsh,
cqlsh_host=172.16.163.131 Bin/cqlsh
Cqlsh>create keyspace productlogs with REPLICATION = {' class ': ' Org.apache.cassandra.locator.SimpleStrategy ', ' Replication_factor ': ' 2 ' } cqlsh>CREATE TABLE productlogs.logs ( IDs uuid, app_name text, App_version text, city text, client_time timestamp, country text, created_at timestamp, int , device_id text ,int, modle_name text, Province text, remote _ip text, updated_at timestamp, PRIMARY KEY (IDS))
2. Spark Cassandra Conector jar Package
Create a new empty project, use SBT, introduce connector, package as Spark-cassandra-connector-full.jar
The significance of this step is that the official connector package does not have to be relied on, so, when using the official package directly, you need to find out the dependencies. Different versions depend on the package and version is not the same, for simplicity, directly hit a full package
3. Start Spark-shell
/opt/db/spark-1.5.2-bin-hadoop2.6/bin/spark-shell--master Spark://u1:7077 --jars ~/ Spark-cassandra-connector-full.jar
The following is the Sparkshell command
4. Prepare the data source:
//Most documents may stop the current SC, and then restart one, in fact, there is no need, directly on the original SC add Cassandra parameters just fineScala>sc.getconf.set ("Spark.cassandra.connection.host", "172.16.163.131")//reading a data source on HDFsScala>val df = sc.textfile ("/data/logs")//introducing the required command spaceScala>ImportOrg.apache.spark.sql._scala>ImportOrg.apache.spark.sql.types._scala>ImportCom.datastax.spark.connector._scala>ImportJava.util.UUID//define SHCMEAScala>val schema =Structtype (Structfield ("IDs", StringType,true):: Structfield ("id", Integertype,true):: Structfield ("App_name", StringType,true):: Structfield ("App_version", StringType,true):: Structfield ("Client_time", Timestamptype,true):: Structfield ("device_id", StringType,true):: Structfield ("Modle_name", StringType,true):: Structfield ("Cs_count", Integertype,true):: Structfield ("Created_at", Timestamptype,true):: Structfield ("Updated_at", Timestamptype,true):: Structfield ("Remote_ip", StringType,true):: Structfield ("Country", StringType,true):: Structfield ("Province", StringType,true):: Structfield ("City", StringType,true):: Nil)//Specify the schema of the data sourceScala>val Rowrdd = Df.map (_.split ("\ T")). Map (P = = Row (Uuid.randomuuid (). toString (), p (0). ToInt, P (1), P (2), Java.sql.Timestamp.valueOf (P (3)), P (4), P (5), P (6). ToInt, Java.sql.Timestamp.valueOf (P (7)), Java.sql.Timestamp.valueOf (P (8)), p (9), p (Ten), p (one), p (12)) Scala>val df=Sqlcontext.createdataframe (Rowrdd, schema) Scala>df.registertemptable ("Logs")//look at the results .Scala>sqlcontext.sql ("SELECT * from Logs limit 1"). Show
5. Deposit data into Cassandra
scala>Import org.apache.spark.sql.cassandra._scala>df.write.format (" Org.apache.spark.sql.cassandra "). Options (Map (" table "," Logs "," Keyspace "," Productlogs ")). Save ()
6. Remove the data that has just been saved:
Scala>import org.apache.spark.sql.cassandra._scala>val cdf = sqlcontext.read. Format ("Org.apache.spark.sql.cassandra"). Options (Map ("table", "Logs", "Keyspace", "Productlogs")). Load (). Registertemptable ("Logs") scala>sqlcontext.sql ("Select * from Logs_jsut_save limit 1"). Show
Spark Cassandra Connector use