Sun Qiqung accompany you to learn the--spark regularization and Sparksql

Source: Internet
Author: User
Tags log log

This blog program is to read the files in the Hadoop HDFs, using the regular dissolve the data in the specified format, and then loaded into the Sparksql database.

Regularization if you're not sure about it, see Regular expressions 30-minute Getting Started tutorial

The contents of the file are:

CREATE TABLE IF not EXISTS ' Rs_user ' (
' ID ' mediumint (8) unsigned not NULL auto_increment,
' UID ' mediumint (8) unsigned DEFAULT NULL,
' URL ' varchar (255) DEFAULT NULL,
The ' title ' varchar (1024x768) DEFAULT NULL,
PRIMARY KEY (' id ')
) Engine=innodb DEFAULT CHARSET=GBK auto_increment=59573;

INSERT into ' rs_user ' (' id ', ' uid ', ' url ', ' title ') VALUES
(1, 269781, ' http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=721360 ', ' [sports] [other][2002-year Asian Games badminton men's singles final Tawfik vs Li Yu one] [RMVB] [Mandarin] '),
(2, 256188, ' http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=721360 ', ' [sports] [other][2002-year Asian Games badminton men's singles final Tawfik vs Li Yu one] [RMVB] [Mandarin] '),

Package com.spark.firstApp

Import Org.apache.spark.SparkContext
Import Org.apache.spark._
Import org.apache.log4j. {level, Logger}

Object Hellospark {
Case class Person (id:int,uid:string,url:string,title:string)
def main (args:array[string]): Unit = {
Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn)
Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.off)//Remove log log
Val conf = new sparkconf (). Setappname ("Hellospark")
Val sc = new Sparkcontext (conf)
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)
Import Sqlcontext.implicits._
Val r = "" "\d*, \d*, ' http://[a-z/.? &=0-9]* ', ' [^ ']+ ' "" ". R
Val data=sc.textfile ("/user/root/home/rs_user.sql"). Map (s=>s.mkstring).
Map (Z=>r.findallin (z). toList). Filter (_.length>0). Map (_.head.split (","). ToList)
Val People=data.map (P=>person (P (0). Toint,p (1), P (2), P (3)). TODF ()
People.registertemptable ("People")
Val teen=sqlcontext.sql ("Select title from people where uid= ' 199988 '")
Teen.map (t = "title:" + t). Collect (). foreach (println)
Sc.stop ()
}

}

Submit a task:

[Email protected]:/# spark-submit--master spark://192.168.0.10:7077--class Com.spark.firstApp.HelloSpark-- Executor-memory 100m/root/ideaprojects/firstsparkapp/out/artifacts/firstsparkappjar/firstsparkappjar.jar

Output Result:

Spark assembly have been built with Hive, including DataNucleus jars on Classpath
15/04/15 21:53:56 INFO slf4j. Slf4jlogger:slf4jlogger started
15/04/15 21:53:56 INFO remoting:starting Remoting
15/04/15 21:53:57 INFO remoting:remoting started; Listening on addresses: [Akka.tcp://[email protected]:52584]
15/04/15 21:53:57 INFO Server. Server:jetty-8.y.z-snapshot
15/04/15 21:53:57 INFO Server. abstractconnector:started [Email protected]:54183
15/04/15 21:54:03 INFO Server. Server:jetty-8.y.z-snapshot
15/04/15 21:54:03 INFO Server. abstractconnector:started [Email protected]:4040
15/04/15 21:54:12 WARN util. sizeestimator:failed to check whether Usecompressedoops is set; Assuming yes
15/04/15 21:54:21 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
15/04/15 21:54:21 WARN Snappy. Loadsnappy:snappy Native Library not loaded
15/04/15 21:54:21 INFO mapred. Fileinputformat:total input paths to process:1
Title: [' [Other] [Video][lol][Smile Curl January 13 Double row three occasions set] [smile curl commentary][mp4]
Title: [' [Other]] [video][lol][smz24 commentary: S5 Blind monk Li Qing's full Gank Tour _ HD][smz24 commentary][mp4]


Sun Qiqung accompany you to learn the--spark regularization and Sparksql

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.