Sparksql easy to get started

Last Update:2015-01-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sparksql Manipulating text files

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass pageviews (track_time:string, url:string, session_id:string,referer:string, ip:string,end_user_id:string, city_id:string) Val page_views= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/page_views.dat"). Map (_.Split("\ t"). Map (p = pageviews (P)0), P (1), P (2), P (3), P (4), P (5), P (6)) ) page_views.registertemptable ("page_views") Val sql1= SQL ("SELECT track_time, url, session_id, Referer, IP, end_user_id, city_id from page_views WHERE city_id = -1000 limit 10
    ") Sql1.collect () Val sql2= SQL ("SELECT session_id, COUNT (*) c from page_views Group by session_id ORDER BY c desc limit") Sql2.collect ()

Sparksql manipulating Parquet files

Sparksql supports reading data from parquet, supporting schema information to save metadata when writing to parquet, and Columnstore avoids reading unwanted data, improving query efficiency and reducing GC;

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass Person (name:string, Age:int) Val people= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/resources/people.txt"). Map (_.Split(","). Map (p ~ = person (p) (0), P (1) . Trim.toint) People.saveasparquetfile ("Hdfs://hadoop000:8020/sparksql/resources/people.parquet")//SaveVal parquetfile = Sqlcontext.parquetfile ("Hdfs://hadoop000:8020/sparksql/resources/people.parquet")//ReadParquetfile.registerastable ("Parquetfile") Val Teenagers= SQL ("SELECT name from Parquetfile WHERE age >= and <=") Teenagers.map (t="Name:"+ t (0). Collect

Sparksql Manipulating JSON files

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) Val path="Hdfs://hadoop000:8020/sparksql/resources/people.json"val People=sqlcontext.jsonfile (path) import Sqlcontext._people.printschema () people.registertemptable ("people") Val Teenagers= SQL ("SELECT name from people WHERE age >= and <=") Teenagers.collectval Anotherpeoplerdd=Sc.parallelize ("""{"Name":"Yin","Address":{"City":"Columbus","State":"Ohio"}}""":: Nil) Val anotherpeople=Sqlcontext.jsonrdd (anotherpeoplerdd) Anotherpeople.collect

Sparksql manipulating DSLs

With DSL, we can perform SQL operations directly based on the read RDD data, without registering as a table, using Scala's symbols to represent each column in the table;

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass Person (name:string, Age:int) Val people= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/resources/people.txt"). Map (_.Split(","). Map (p ~ = person (p) (0), P (1). Trim.toint)) Val Teenagers= People.where ('Age >=). WHERE ('Age <= +).Select('name)Teenagers.toDebugStringteenagers.map (t="Name:"+ t (0). Collect (). foreach (println)

Sparksql operation of existing hive tables

Spark-shell mode access :

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("  SELECT track_time, URL, session_id, Referer, IP, end_user_id, city_id from page_views WHERE city_id = -1000 limit c3> "). Collect (). foreach (println) SQL ("Select session_id, COUNT (*) c from Page_views Group by session_id ORDER BY c desc limit"). Collect (). foreach (println)

spark-sql mode access :

Need to copy Hive-site.xml to $spark_home/conf

SELECT  from WHERE = -  + Ten ; SELECT Count (*fromgroupbyorderbydesc10 ;

Hive-thriftserver Mode access:

1) Start Hive-thriftserver:

CD $SPARK _home/sbinstart-thriftserver. SH

Specify port mode start: start-thriftserver.sh--hiveconf hive.server2.thrift.port=14000

2) Start the Beeline client:

CD $SPARK _home/-u jdbc:hive2://hadoop000:10000/default-n SPARK

SELECT  from WHERE = -  + Ten ; SELECT Count (*fromgroupbyorderbydesc10 ;

Sparksql Cache Table

Note After the Spark1.2 version:

1) Use Schemardd.cache or sqlcontext.cachetable, all in the form of Columnstore cache into memory;

2) Sqlcontext.cachetable/uncachetable are eager, not lazy; no need to manually trigger the action before caching;

3) You can manually set the lazy or eager by using the cache [lazy] TABLE tb1 [as SELECT ...];

Observing changes of WebUI interface Stroage after cachetable

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("Cache Table Page_views") SQL ("Select session_id, Count (session_id) as C from Page_views  GROUP BY session_id ORDER BY c desc limit " ). Collect (). foreach (println) SQL ("uncache table page_views")

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("CACHE TABLE Page_views_cached_eager as SELECT * from Page_views") SQL ("Select session_id , Count (session_id) as C from Page_views_cached_eager  Group by session_id ORDER BY c desc limit") . Collect (). foreach (println) uncachetable ("page_views_cached"

Val Hivecontext =New Org.apache.spark.sql.hive.HiveContext (SC) Import Hivecontext._sql ("CACHE LAZY TABLE page_views_cached_lazy as SELECT * from Page_views") SQL ("Select COUNT (*) as C from Page_views_cached_lazy"). Collect (). foreach (println) SQL ("Select session_id, Count (session_id) as C from Page_views_cached_lazy Group by session_id ORDER BY c desc limit 10"). Collect (). foreach (println) uncachetable ("page_views_cached")

Sparksql easy to get started

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sparksql easy to get started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sparksql easy to get started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support