Sparksql---implemented by Pyspark

Source: Internet
Author: User
Tags pyspark

Last time in a group of Spark, The great God argued: Will the DataSet replace the RDD?

Big God 1: Heard after the mlib will use a dataset to achieve, whining, rdd to dog belt

Big God 2:dataset is mainly used to achieve SQL, and mlib not much relationship, you say why use a dataset?

Great God 3: Because the boss likes it. -------looking for a meeting in the market will write SQL and do spark development is two salary grade, two words "save money".

Conclusion: The above-mentioned thing is really so, many times we see the results in fact, some degree is the result of market choice.

-------------------------------------------------------------------------------Gorgeous split-line------------------------------------ -------

To my own understanding of sparksql learning methods, I prefer to realize it first, and then understand the specific principles, however, for the Sparksql data type or to first understand, otherwise do things may not be done.

Sparksql inside the class:

These are in: Http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql

The key point is SqlContext this is a dataframe container, Datafram is equivalent to a table, row format is often used;

Others can go online to understand the following: Dataframe/rdd the difference between the contact, the current mlib are mostly written with Rdd;

Here is an pyspark to write:

# # #first Table
From Pyspark.sql import Sqlcontext,row
Ccdata=sc.textfile ("/home/srtest/spark/spark-1.3.1/examples/src/main/resources/cc.txt")
Ccpart = Ccdata.map (Lambda le:le.split (",")) # #我的表是以逗号做出分隔
Cc1=ccpart.map (Lambda P:row (Sid=p[0],age=int (p[1)), yz=p[2],yf=p[3],yc=p[4],hf=p[5],hk=p[6]) # # # #这就是将数据变成ROW的格式, Also determine the data type
Schemacc1=sqlcontext.createdataframe (CC1) ###### #源码中createDataframe (Row,schema), so if the previous step did not convert to ROW is not complete conversion to Dataframe
Schemacc1.registertemptable ("cc1") ############ #注册临时表
Xx=sqlcontext.sql ("select * from CC1 WHERE age=20") ####### #直接用写sql就能实现表的运算

Point1: said the above example, you will probably use in,exist such a relationship, the current 2.0 version of Spark is not support in,exist. After 2.0, you can do whatever you want.

Then someone will certainly ask, if you want to use the in,exist, how to do, I can only say more than build a table, with join implementation;

Point2: Next blog, I intend to directly use the dataframe to implement SQL without registering as a table

Sparksql---implemented by Pyspark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.