Last time in a group of Spark, The great God argued: Will the DataSet replace the RDD?
Big God 1: Heard after the mlib will use a dataset to achieve, whining, rdd to dog belt
Big God 2:dataset is mainly used to achieve SQL, and mlib not much relationship, you say why use a dataset?
Great God 3: Because the boss likes it. -------looking for a meeting in the market will write SQL and do spark development is two salary grade, two words "save money".
Conclusion: The above-mentioned thing is really so, many times we see the results in fact, some degree is the result of market choice.
-------------------------------------------------------------------------------Gorgeous split-line------------------------------------ -------
To my own understanding of sparksql learning methods, I prefer to realize it first, and then understand the specific principles, however, for the Sparksql data type or to first understand, otherwise do things may not be done.
Sparksql inside the class:
These are in: Http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql
The key point is SqlContext this is a dataframe container, Datafram is equivalent to a table, row format is often used;
Others can go online to understand the following: Dataframe/rdd the difference between the contact, the current mlib are mostly written with Rdd;
Here is an pyspark to write:
# # #first Table
From Pyspark.sql import Sqlcontext,row
Ccdata=sc.textfile ("/home/srtest/spark/spark-1.3.1/examples/src/main/resources/cc.txt")
Ccpart = Ccdata.map (Lambda le:le.split (",")) # #我的表是以逗号做出分隔
Cc1=ccpart.map (Lambda P:row (Sid=p[0],age=int (p[1)), yz=p[2],yf=p[3],yc=p[4],hf=p[5],hk=p[6]) # # # #这就是将数据变成ROW的格式, Also determine the data type
Schemacc1=sqlcontext.createdataframe (CC1) ###### #源码中createDataframe (Row,schema), so if the previous step did not convert to ROW is not complete conversion to Dataframe
Schemacc1.registertemptable ("cc1") ############ #注册临时表
Xx=sqlcontext.sql ("select * from CC1 WHERE age=20") ####### #直接用写sql就能实现表的运算
Point1: said the above example, you will probably use in,exist such a relationship, the current 2.0 version of Spark is not support in,exist. After 2.0, you can do whatever you want.
Then someone will certainly ask, if you want to use the in,exist, how to do, I can only say more than build a table, with join implementation;
Point2: Next blog, I intend to directly use the dataframe to implement SQL without registering as a table
Sparksql---implemented by Pyspark