Preliminary study on pandas basic learning and spark python

Last Update:2017-09-02 Source: Internet

Author: User

Tags install pandas pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Abstract:Pandas is a powerful Python data Analysis Toolkit, Pandas's two main data Structures series (one-dimensional) and dataframe (two-dimensional) deal with finance, statistics, most typical use case science in society, and many engineering fields. In Spark, the Python program can be easily modified, eliminating the need for Java and Scala packaging, and if you want to export files, you can convert the data to pandas and save it to Csv,excel.

What is 1.Pandas?

Pandas is a powerful Python data Analysis Toolkit, a Python package that provides fast, flexible, and expressive data structures designed to make "relational" or "tagged" information simple and intuitive. It is designed to be a basic high-level building block for practical real-world data analysis in Python. In addition, its broader goal is to become the most powerful and flexible open source data analysis/manipulation tool in any language.

2.Pandas Installation

This is installed using the PIP Package Manager (Python version 2.7.13). In Windows, CMD enters the scripts directory under the Python installation path, executing:

Pip Install Pandas

You can install the pandas, after the installation is complete the following prompts:

Description successfully installed Pandas. NumPy is installed here at the same time.

3.Pandas Data types

Pandas is ideal for many different types of data:

Tabular data with non-uniform types of columns, such as in a SQL table or Excel spreadsheet
Sequential and unordered (not necessarily fixed frequency) time series data.
Arbitrary matrix data with row and column labels (homogeneous type or heterogeneous)
Any other form of observational/statistical data set. The data does not actually need to be tagged to be placed in the pandas data structure

4.Pandas Foundation

Here is a simple way to learn the basics of pandas, in command mode, for example, you first need to import pandas package and NumPy package, NumPy here mainly use its Nan data and generate random numbers:

Import Pandas as PD Import NumPy as NP

4.1 Pandas Series

Create a series by passing a list of values, and let pandas create a default integer index:

The dataframe of 4.2 pandas

Create a dataframe using a datetime index and a tagged column by passing the NumPy array:

To view Dataframe's head and tail data:

Display indexes, columns, and underlying numpy data:

Quick Statistical summary showing the data:

Sort by value:

Select a single column to produce a series:

Select rows through slices by [] selection:

4.2.1 Dataframe read/write CSV file

Save Dataframe data to CSV file:

Here to save to the C drive, you can view the contents of the file:

To read data from a CSV file:

4.2.2 Dataframe Read and write Excel files

To save data to an Excel file:

Here to save to the C drive, you can view the contents of the file:

Note: You need to install OPENPYXL here, same as pandas installation, Pip install OPENPYXL.

Read from Excel file:

Note: because Excel requires separate module support, need to install XLRD, same as pandas installation, Pip install XLRD.

5.Pandas in Spark Python

Here the test reads an existing parquet file, the directory is/data/parquet/20170901/, here read the directory under the name of part-r-00000 start file. Reads and saves two columns of data from the contents of the file to a file. The code is as follows:

#Coding=utf-8ImportSYS fromPysparkImportSparkcontext fromPysparkImportsparkconf fromPyspark.sqlImportSqlContextclassReadspark (object):def __init__(self, paramdate): Self.parquetroot='/data/parquet/%s' # Here is the HDFs pathself.thedate=paramdate self.conf=sparkconf () self.conf.set ("spark.shuffle.memoryFraction","0.5") Self.sc= Sparkcontext (appname='Readsparkdata', conf=self.conf) Self.sqlcontext=SqlContext (Self.sc)defGettypedata (self): BasePath= self.parquetroot%self.thedate Parqfile= Self.sqlContext.read.option ("Mergeschema","true"). Option ('BasePath', BasePath). Parquet ('%s/part-r-00000*'%(basepath)) Resdata= Parqfile.select ('appId','OS') respd=Resdata.topandas () respd.to_csv ('/data/20170901.csv') #这里是Linux系统目录Print("--------------------Data count:"+Str (Resdata.count ()))if __name__=="__main__": Reload (SYS) sys.setdefaultencoding ('Utf-8') RS= Readspark ('20170901') Rs.gettypedata ()

The code is named testsparkpython.py, submitted in the cluster, the command used here is (parameter information is related to the cluster environment):

Spark-submit--master yarn--driver-memory  6g  --deploy-mode client--executor-memory 9g  --executor-cores 3  --num-executors 50

After execution, view the five elements before the file, head-5/data/20170901.csv:

Summary: Python is very handy for writing spark programs, and the advantages of the pandas package in data processing are obvious. As Python gets more and more hot, it's worth learning more about Python, as Python's Zen wrote ...

Preliminary study on pandas basic learning and spark python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More