This blog program is to read the files in the Hadoop HDFs, using the regular dissolve the data in the specified format, and then loaded into the Sparksql database.
Regularization if you're not sure about it, see Regular expressions 30-minute Getting Started tutorial
The contents of the file are:
CREATE TABLE IF not EXISTS ' Rs_user ' (
' ID ' mediumint (8) unsigned not NULL auto_increment,
' UID ' mediumint (8) unsigned DEFAULT NULL,
' URL ' varchar (255) DEFAULT NULL,
The ' title ' varchar (1024x768) DEFAULT NULL,
PRIMARY KEY (' id ')
) Engine=innodb DEFAULT CHARSET=GBK auto_increment=59573;
INSERT into ' rs_user ' (' id ', ' uid ', ' url ', ' title ') VALUES
(1, 269781, ' http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=721360 ', ' [sports] [other][2002-year Asian Games badminton men's singles final Tawfik vs Li Yu one] [RMVB] [Mandarin] '),
(2, 256188, ' http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=721360 ', ' [sports] [other][2002-year Asian Games badminton men's singles final Tawfik vs Li Yu one] [RMVB] [Mandarin] '),
Package com.spark.firstApp
Import Org.apache.spark.SparkContext
Import Org.apache.spark._
Import org.apache.log4j. {level, Logger}
Object Hellospark {
Case class Person (id:int,uid:string,url:string,title:string)
def main (args:array[string]): Unit = {
Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn)
Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.off)//Remove log log
Val conf = new sparkconf (). Setappname ("Hellospark")
Val sc = new Sparkcontext (conf)
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)
Import Sqlcontext.implicits._
Val r = "" "\d*, \d*, ' http://[a-z/.? &=0-9]* ', ' [^ ']+ ' "" ". R
Val data=sc.textfile ("/user/root/home/rs_user.sql"). Map (s=>s.mkstring).
Map (Z=>r.findallin (z). toList). Filter (_.length>0). Map (_.head.split (","). ToList)
Val People=data.map (P=>person (P (0). Toint,p (1), P (2), P (3)). TODF ()
People.registertemptable ("People")
Val teen=sqlcontext.sql ("Select title from people where uid= ' 199988 '")
Teen.map (t = "title:" + t). Collect (). foreach (println)
Sc.stop ()
}
}
Submit a task:
[Email protected]:/# spark-submit--master spark://192.168.0.10:7077--class Com.spark.firstApp.HelloSpark-- Executor-memory 100m/root/ideaprojects/firstsparkapp/out/artifacts/firstsparkappjar/firstsparkappjar.jar
Output Result:
Spark assembly have been built with Hive, including DataNucleus jars on Classpath
15/04/15 21:53:56 INFO slf4j. Slf4jlogger:slf4jlogger started
15/04/15 21:53:56 INFO remoting:starting Remoting
15/04/15 21:53:57 INFO remoting:remoting started; Listening on addresses: [Akka.tcp://[email protected]:52584]
15/04/15 21:53:57 INFO Server. Server:jetty-8.y.z-snapshot
15/04/15 21:53:57 INFO Server. abstractconnector:started [Email protected]:54183
15/04/15 21:54:03 INFO Server. Server:jetty-8.y.z-snapshot
15/04/15 21:54:03 INFO Server. abstractconnector:started [Email protected]:4040
15/04/15 21:54:12 WARN util. sizeestimator:failed to check whether Usecompressedoops is set; Assuming yes
15/04/15 21:54:21 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
15/04/15 21:54:21 WARN Snappy. Loadsnappy:snappy Native Library not loaded
15/04/15 21:54:21 INFO mapred. Fileinputformat:total input paths to process:1
Title: [' [Other] [Video][lol][Smile Curl January 13 Double row three occasions set] [smile curl commentary][mp4]
Title: [' [Other]] [video][lol][smz24 commentary: S5 Blind monk Li Qing's full Gank Tour _ HD][smz24 commentary][mp4]
Sun Qiqung accompany you to learn the--spark regularization and Sparksql