Abstract:Pandas is a powerful Python data Analysis Toolkit, Pandas's two main data Structures series (one-dimensional) and dataframe (two-dimensional) deal with finance, statistics, most typical use case science in society, and many engineering fields. In Spark, the Python program can be easily modified, eliminating the need for Java and Scala packaging, and if you want to export files, you can convert the data to pandas and save it to Csv,excel.
What is 1.Pandas?
Pandas is a powerful Python data Analysis Toolkit, a Python package that provides fast, flexible, and expressive data structures designed to make "relational" or "tagged" information simple and intuitive. It is designed to be a basic high-level building block for practical real-world data analysis in Python. In addition, its broader goal is to become the most powerful and flexible open source data analysis/manipulation tool in any language.
2.Pandas Installation
This is installed using the PIP Package Manager (Python version 2.7.13). In Windows, CMD enters the scripts directory under the Python installation path, executing:
Pip Install Pandas
You can install the pandas, after the installation is complete the following prompts:
Description successfully installed Pandas. NumPy is installed here at the same time.
3.Pandas Data types
Pandas is ideal for many different types of data:
- Tabular data with non-uniform types of columns, such as in a SQL table or Excel spreadsheet
- Sequential and unordered (not necessarily fixed frequency) time series data.
- Arbitrary matrix data with row and column labels (homogeneous type or heterogeneous)
- Any other form of observational/statistical data set. The data does not actually need to be tagged to be placed in the pandas data structure
4.Pandas Foundation
Here is a simple way to learn the basics of pandas, in command mode, for example, you first need to import pandas package and NumPy package, NumPy here mainly use its Nan data and generate random numbers:
Import Pandas as PD Import NumPy as NP
4.1 Pandas Series
Create a series by passing a list of values, and let pandas create a default integer index:
The dataframe of 4.2 pandas
Create a dataframe using a datetime index and a tagged column by passing the NumPy array:
To view Dataframe's head and tail data:
Display indexes, columns, and underlying numpy data:
Quick Statistical summary showing the data:
Sort by value:
Select a single column to produce a series:
Select rows through slices by [] selection:
4.2.1 Dataframe read/write CSV file
Save Dataframe data to CSV file:
Here to save to the C drive, you can view the contents of the file:
To read data from a CSV file:
4.2.2 Dataframe Read and write Excel files
To save data to an Excel file:
Here to save to the C drive, you can view the contents of the file:
Note: You need to install OPENPYXL here, same as pandas installation, Pip install OPENPYXL.
Read from Excel file:
Note: because Excel requires separate module support, need to install XLRD, same as pandas installation, Pip install XLRD.
5.Pandas in Spark Python
Here the test reads an existing parquet file, the directory is/data/parquet/20170901/, here read the directory under the name of part-r-00000 start file. Reads and saves two columns of data from the contents of the file to a file. The code is as follows:
#Coding=utf-8ImportSYS fromPysparkImportSparkcontext fromPysparkImportsparkconf fromPyspark.sqlImportSqlContextclassReadspark (object):def __init__(self, paramdate): Self.parquetroot='/data/parquet/%s' # Here is the HDFs pathself.thedate=paramdate self.conf=sparkconf () self.conf.set ("spark.shuffle.memoryFraction","0.5") Self.sc= Sparkcontext (appname='Readsparkdata', conf=self.conf) Self.sqlcontext=SqlContext (Self.sc)defGettypedata (self): BasePath= self.parquetroot%self.thedate Parqfile= Self.sqlContext.read.option ("Mergeschema","true"). Option ('BasePath', BasePath). Parquet ('%s/part-r-00000*'%(basepath)) Resdata= Parqfile.select ('appId','OS') respd=Resdata.topandas () respd.to_csv ('/data/20170901.csv') #这里是Linux系统目录Print("--------------------Data count:"+Str (Resdata.count ()))if __name__=="__main__": Reload (SYS) sys.setdefaultencoding ('Utf-8') RS= Readspark ('20170901') Rs.gettypedata ()
The code is named testsparkpython.py, submitted in the cluster, the command used here is (parameter information is related to the cluster environment):
Spark-submit--master yarn--driver-memory 6g --deploy-mode client--executor-memory 9g --executor-cores 3 --num-executors 50
After execution, view the five elements before the file, head-5/data/20170901.csv:
Summary: Python is very handy for writing spark programs, and the advantages of the pandas package in data processing are obvious. As Python gets more and more hot, it's worth learning more about Python, as Python's Zen wrote ...
Preliminary study on pandas basic learning and spark python