Sparksql easy to get started

Source: Internet
Author: User

Sparksql Manipulating text files

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass pageviews (track_time:string, url:string, session_id:string,referer:string, ip:string,end_user_id:string, city_id:string) Val page_views= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/page_views.dat"). Map (_.Split("\ t"). Map (p = pageviews (P)0), P (1), P (2), P (3), P (4), P (5), P (6)) ) page_views.registertemptable ("page_views") Val sql1= SQL ("SELECT track_time, url, session_id, Referer, IP, end_user_id, city_id from page_views WHERE city_id = -1000 limit 10
    ") Sql1.collect () Val sql2= SQL ("SELECT session_id, COUNT (*) c from page_views Group by session_id ORDER BY c desc limit") Sql2.collect ()

Sparksql manipulating Parquet files

Sparksql supports reading data from parquet, supporting schema information to save metadata when writing to parquet, and Columnstore avoids reading unwanted data, improving query efficiency and reducing GC;

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass Person (name:string, Age:int) Val people= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/resources/people.txt"). Map (_.Split(","). Map (p ~ = person (p) (0), P (1) . Trim.toint) People.saveasparquetfile ("Hdfs://hadoop000:8020/sparksql/resources/people.parquet")//SaveVal parquetfile = Sqlcontext.parquetfile ("Hdfs://hadoop000:8020/sparksql/resources/people.parquet")//ReadParquetfile.registerastable ("Parquetfile") Val Teenagers= SQL ("SELECT name from Parquetfile WHERE age >= and <=") Teenagers.map (t="Name:"+ t (0). Collect

Sparksql Manipulating JSON files

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) Val path="Hdfs://hadoop000:8020/sparksql/resources/people.json"val People=sqlcontext.jsonfile (path) import Sqlcontext._people.printschema () people.registertemptable ("people") Val Teenagers= SQL ("SELECT name from people WHERE age >= and <=") Teenagers.collectval Anotherpeoplerdd=Sc.parallelize ("""{"Name":"Yin","Address":{"City":"Columbus","State":"Ohio"}}""":: Nil) Val anotherpeople=Sqlcontext.jsonrdd (anotherpeoplerdd) Anotherpeople.collect

Sparksql manipulating DSLs

With DSL, we can perform SQL operations directly based on the read RDD data, without registering as a table, using Scala's symbols to represent each column in the table;

Val SqlContext =New Org.apache.spark.sql.SQLContext (SC) import sqlcontext._ Caseclass Person (name:string, Age:int) Val people= Sc.textfile ("Hdfs://hadoop000:8020/sparksql/resources/people.txt"). Map (_.Split(","). Map (p ~ = person (p) (0), P (1). Trim.toint)) Val Teenagers= People.where ('Age >=). WHERE ('Age <= +).Select('name)Teenagers.toDebugStringteenagers.map (t="Name:"+ t (0). Collect (). foreach (println)

Sparksql operation of existing hive tables

Spark-shell mode access :

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("  SELECT track_time, URL, session_id, Referer, IP, end_user_id, city_id from page_views WHERE city_id = -1000 limit c3> "). Collect (). foreach (println) SQL ("Select session_id, COUNT (*) c from Page_views Group by session_id ORDER BY c desc limit"). Collect (). foreach (println)

spark-sql mode access :

Need to copy Hive-site.xml to $spark_home/conf

SELECT  from WHERE = -  + Ten ; SELECT Count (*fromgroupbyorderbydesc10 ;

Hive-thriftserver Mode access:

1) Start Hive-thriftserver:

CD $SPARK _home/sbinstart-thriftserver. SH

Specify port mode start: start-thriftserver.sh--hiveconf hive.server2.thrift.port=14000

2) Start the Beeline client:

CD $SPARK _home/-u jdbc:hive2://hadoop000:10000/default-n SPARK
SELECT  from WHERE = -  + Ten ; SELECT Count (*fromgroupbyorderbydesc10 ;

Sparksql Cache Table

Note After the Spark1.2 version:

1) Use Schemardd.cache or sqlcontext.cachetable, all in the form of Columnstore cache into memory;

2) Sqlcontext.cachetable/uncachetable are eager, not lazy; no need to manually trigger the action before caching;

3) You can manually set the lazy or eager by using the cache [lazy] TABLE tb1 [as SELECT ...];

Observing changes of WebUI interface Stroage after cachetable

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("Cache Table Page_views") SQL ("Select session_id, Count (session_id) as C from Page_views  GROUP BY session_id ORDER BY c desc limit " ). Collect (). foreach (println) SQL ("uncache table page_views")

Val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext._sql ("CACHE TABLE Page_views_cached_eager as SELECT * from Page_views") SQL ("Select session_id , Count (session_id) as C from Page_views_cached_eager  Group by session_id ORDER BY c desc limit") . Collect (). foreach (println) uncachetable ("page_views_cached"
Val Hivecontext =New Org.apache.spark.sql.hive.HiveContext (SC) Import Hivecontext._sql ("CACHE LAZY TABLE page_views_cached_lazy as SELECT * from Page_views") SQL ("Select COUNT (*) as C from Page_views_cached_lazy"). Collect (). foreach (println) SQL ("Select session_id, Count (session_id) as C from Page_views_cached_lazy Group by session_id ORDER BY c desc limit 10"). Collect (). foreach (println) uncachetable ("page_views_cached")

Sparksql easy to get started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.