Yesterday spent an afternoon to install spark, and Pyspark shell editing interface to Jupyter notebook, and then in accordance with the "Spark fast large data analysis" This book taste fresh, feel the power of spark.
My system is Win7,spark 1.6,anaconda 3,python3. The code is as follows:
Lines = Sc.textfile ("D://program files//spark//spark-1.6.0-bin-hadoop2.6//readme.md")
print ("Number of lines of text", Lines.count ()) from
pyspark import sparkcontext
logFile = "D://program files//spark// Spark-1.6.0-bin-hadoop2.6//readme.md " # Should is some file on your system
SC = sparkcontext (' local '," Simple App ")
Logdata = Sc.textfile (logFile). Cache ()
Numas = Logdata.filter (lambda s: ' A ' in s). Count ()
numbs = Logdata . Filter (lambda s: ' B ' s). Count ()
pythonlines = Lines.filter (lambda line: "Python" in line)
print ("lines with a: %i, lines with B:%i "% (Numas, numbs))
The results are as follows, error ValueError:
Number of lines of text---------------------------------------------------------------------------ValueError Traceback (most recent) <ipython-input-3-70ecab39b7ea> in <module> () 5 6 logFile = "D://program files//spark//spark-1.6.0-bin-hadoop2.6//readme.md" # Should to some file on your system----> 7 SC = Sp Arkcontext ("local", "Simple App") 8 Logdata = Sc.textfile (logFile). Cache () 9 D:\spark\spark-1.6.0-bin-hadoop 2.6\python\pyspark\context.py in __init__ (self, master, AppName, Sparkhome, Pyfiles, Environment, batchsize, serializer , conf, Gateway, JSC, PROFILER_CLS) "" "Self._callsite = First_spark_call () or Callsite (No NE, none, none)--> 112 sparkcontext._ensure_initialized (self, gateway=gateway) 113 try:114 Self._do_init (Master, AppName, Sparkhome, Pyfiles, Environment, batchsize, serializer, D:\spark\spark-1.6.0-bin -hadoop2.6\python\pyspark\context.py in _ensure_initialized (CLS, instance, Gateway) 259 ' created by%s at%s:%s ' 26 0% (Currentappname, Currentmaster,--> 261 callsite.function, call Site.file, Callsite.linenum)) 262 else:263 Sparkcontext._active_spark_context = Instance Valueerror:cannot run multiple sparkcontexts at once; Existing Sparkcontext (App=pysparkshell, master=local[*]) created by <module> at D:\Program files\anaconda3\lib\
site-packages\ipython\utils\py3compat.py:186
Bloggers Google the problem, and finally found the answer in the stack overflow this site. Originally, Valueerror:cannot run multiple sparkcontexts at once; Existing Sparkcontext (App=pysparkshell, master=local[*]) created by at D:\Program Files\anaconda3\lib\site-packages\ ipython\utils\py3compat.py:186. This means that you cannot open multiple SC (sparkcontext) at once because there is already a spark contexts, so creating a new SC will make an error. So the way to solve the problem is to turn off the existing SC to create a new SC. Then how to close it. We can use the Sc.stop () function to shut down.
Let's change the code and run it to see the result:
Lines = Sc.textfile ("D://program files//spark//spark-1.6.0-bin-hadoop2.6//readme.md")
print ("Number of lines of text", Lines.count ())
sc.stop () #退出已有的sc from
pyspark import sparkcontext
logFile = "D://program files//spark// Spark-1.6.0-bin-hadoop2.6//readme.md " # Should is some file on your system
SC = sparkcontext (' local '," Simple App ")
Logdata = Sc.textfile (logFile). Cache ()
Numas = Logdata.filter (lambda s: ' A ' in s). Count ()
numbs = Logdata . Filter (lambda s: ' B ' s). Count ()
pythonlines = Lines.filter (lambda line: "Python" in line)
print ("lines with a" :%i, lines with B:%i "% (Numas, numbs))
The results are as follows:
The number of lines of text
Lines with a:58 and Lines with b:26
The result is easy to fix, and mark, beware of the next problem.