From Pandas to Apache Spark ' s DataFrameAugust by Olivier Girardot Share article on Twitter Share article on LinkedIn Share article on Facebook
This was a cross-post from the blog of Olivier Girardot. Olivier is a software engineer and the co-founder of Lateral Thoughts, where he works on machine learning, Big Data, and D Evops Solutions.
With the introduction in Spark 1.4 of Windows operations, you can finally port pretty much any relevant piece of Pandas ' Da Taframe computation to Apache Spa
Elastic distributed Data Set (RDD)Spark operates at the center of the RDD concept. The RDD is a fault-tolerant collection of elements that can be manipulated in parallel. There are two ways to create an rdd: to parallelize A collection that already exists in your driver, and to reference a dataset from an external storage system. One of the most important features of the RDD is distributed storage, where distributed storage has the greatest benefit of allowing data to be stored in parallel acros
A minimalist development environment built under windowsInstead of contributing code to the Apache Spark Open source project, the Spark development environment here refers to the development of big data projects based on Spark.Spark offers 2 interactive shells, one pyspark (based on Python) and one Spark_shell (based on Scala). These two environments are in fact tied and not interdependent, so if you're just using the
executionPyspark
This shows that the installation is complete and you can enter the appropriate Python code here to perform the operation. using Pyspark in Python
Of course, it's not possible to say that we're developing in such an interpreter in the later development process, so what we're going to do next is let Python load the spark library.
So we need to add the Pyspark to the Python search directory,
The previous section describes the most basic entity classes. This section describes the HQL statement constructor, including query and update.
Advantages: It is faster to construct HQL statements in an object-oriented manner and does not require manual HQL concatenation.
Disadvantages: Encapsulation may reduce performance and only supports common and simple HQL structures.
Some functions are not complete and need to be developed.
1. HQL statement Constructor
Package cn. fansunion. hibernate.
Introduction: Spark was developed by the Amplab lab, which is essentially a high-speed iterative framework based on memory, and "iterative" is the most important feature of machine learning, so it is suitable for machine learning.
Thanks to its strong performance in data science, the Python language fans all over the world, and now meets the powerful distributed memory computing framework Spark, two areas of the strong come together, naturally can touch a more powerful spark (spark can trans
-bin-hadoop2.6.tgz -C /usr/lib/spark
1
Configuring in/etc/profileexport SPARK_HOME=/usr/lib/spark/spark-1.6.1-bin-hadoop2.6export PATH=${SPARK_HOME}/bin:$PATH
1
2
source /etc/profileAfter that, the executionpysparkThis shows that the installation is complete and you can enter the appropriate Python code here to perform the operation.Using Pyspark in PythonOf course, it's not possible to say that we're developing in such
(_.contains ("Spark")). Count
If you feel that the output log is too many, you can create Conf/log4j.properties from the template file:
$ mv Conf/log4j.properties.template conf/log4j.properties
Then modify the log output level to warn:
Log4j.rootcategory=warn, console
If you set the log4j log level to info, you can see such a line of log info sparkui:started Sparkui at http://10.9.4.165:4040, which means that Spark started a Web server and you can Browser Access http://10.9.4.165:4040 to
#-*-Coding:utf-8-*-# The Nineth chapter of Python for data analysis# Data aggregation and grouping operationsImport Pandas as PDImport NumPy as NPImport time# Group operation Process, Split-apply-combine# Split App MergeStart = Time.time ()Np.random.seed (10)# 1, GroupBy technology# 1.1, citationsDF = PD. DataFrame ({' Key1 ': [' A ', ' B ', ' A ', ' B ', ' a '],' Key2 ': [' one ', ' one ', ' one ', ' one ', ' one ',' Data1 ': Np.random.randint (1, 10
related library to the system PATH variable: D:\hadoop-2.6.0\bin; Create a new hadoop_home variable with the value: D:\ hadoop-2.6.0. Go to GitHub and download a component called Winutils address is https://github.com/srccodes/ Hadoop-common-2.2.0-bin if there is no version of Hadoop (at this point the version is 2.6), go to csdn download http://download.csdn.net/detail/luoyepiaoxin/8860033, My practice is to copy all the files in this CSDN package into the Hadoop_home bin directory.T
Build Ubantu machine on VirtualBox, install Anaconda,java 8,spark,ipython Notebook, and WordCount example program with Hello World.
Build Spark EnvironmentIn this section we learn to build a spark environment:
Create an isolated development environment on an Ubuntu 14.04 virtual machine without affecting any existing systems
Installs Spark 1.3.0 and its dependencies.
Installing the Anaconda Python 2.7 Environment contains the required libraries such as pandas, Scikit-learn,
Aspxgridview the prerequisites for implementing data grouping:Aspxgridviewbehaviorsettings.allowgroup=true must be setFirst, data grouping from the server side1. Using the GroupBy method for data groupingSyntax 1:int GroupBy (gridviewcolumn column);Syntax 2:int GroupBy (gridviewcolumn column, int value);Where the parameter value represents the hierarchy of the gr
1. Preface
After a day of cultivation, deeply disgusting, in the virtual environment to configure the Pyspark flower error, because I really do not want to uninstall the 3.6 version of Python, so hard just a day, finally found the configuration method, and configuration success, do not complain, start: 2. Demand Environment
Anaconda3 (mine is the newest version of Anaconda4.3.1 (64-bit)) 3. Install the virtual environment
1, create a Python virtual e
steps, then open a new CMD window again, and if normal, you should be able to run spark through direct input spark-shell .The normal operating interface should look like the following:As you can see, when the command is entered directly spark-shell , Spark starts and outputs some log information, most of which can be ignored, with two sentences to note:as sc.SQL context available as sqlContext.
1
2
Spark contextAnd the SQL context difference is what, follow up again, now only
Environment:
Spark 2.0.0,anaconda2
1.spark Ipython and Notebook installation configuration
Method One: This method can enter Ipython notebook through the webpage, the other open terminal can enter PysparkIf equipped with anaconda can be directly the following way to obtain the Ipython interface of the landing, do not install anaconda reference the bottom of the link to install their own Ipython-related packages.VI ~/.BASHRCExport Pyspark_driver_python=ipythonExport pyspark_driver_python_opts= "
# and B. C # ='01'LeftJoin SC C on a. s # = c. s # and C. C # ='02'Where B. score>Isnull (C. score,0)
-- 2. query the information and scores of students whose scores are lower than those of the "01" course.-- 2.1 check whether "01" course and "02" course exist at the same time.Select a. *, B. Score [score of course '01'], C. Score [score of course '02']From student A, SC B, SC CWhere a. s # = B. S # And a. s # = c. s # and B. C # ='01'And C. C # ='02'And B. score -- 2.2 check whether "01" cours
Coreuse of the Druid query interface
Druid Query interface is the HTTP rest style query method, using the HTTP Rest style query (broker,historical, or Realtime) node data, query parameters are in JSON format, each node type will expose the same REST query interface
Curl-x POST '
Queryable_host:broker node IP port:broker node port default is 8082
Curl-l-H ' content-type:application/json '-xpost--data-binary @quickstart/aa.json http://10.20.23.41:8082/druid/v2/ ? Pretty
the types of query queri
/
# PythonPath: add the Python Environment added to the pySpark module in Spark
Export PYTHONPATH =/opt/spark-hadoop/python
Restart the computer to make the/etc/profile take effect permanently and take effect temporarily. Open the command window and execute source/etc/profile to take effect in the current window.
Test installation result
Open the command window and switch to the Spark root directory.
Run./bin/spark-shell to open the con
=$ {SCALA_HOME}/bin: $ PATH
# Setting Spark environment variable
Export SPARK_HOME =/opt/spark-hadoop/
# PythonPath: add the Python Environment added to the pySpark module in Spark
Export PYTHONPATH =/opt/spark-hadoop/python
Restart the computer to make the/etc/profile take effect permanently and take effect temporarily. Open the command window and execute source/etc/profile to take effect in the current window.
Test installation result
Open th
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.
A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service