Preliminary study on pandas basic learning and spark python

Source: Internet
Author: User
Tags install pandas pyspark

Abstract:Pandas is a powerful Python data Analysis Toolkit, Pandas's two main data Structures series (one-dimensional) and dataframe (two-dimensional) deal with finance, statistics, most typical use case science in society, and many engineering fields. In Spark, the Python program can be easily modified, eliminating the need for Java and Scala packaging, and if you want to export files, you can convert the data to pandas and save it to Csv,excel.

What is 1.Pandas?

Pandas is a powerful Python data Analysis Toolkit, a Python package that provides fast, flexible, and expressive data structures designed to make "relational" or "tagged" information simple and intuitive. It is designed to be a basic high-level building block for practical real-world data analysis in Python. In addition, its broader goal is to become the most powerful and flexible open source data analysis/manipulation tool in any language.

2.Pandas Installation

This is installed using the PIP Package Manager (Python version 2.7.13). In Windows, CMD enters the scripts directory under the Python installation path, executing:

Pip Install Pandas

You can install the pandas, after the installation is complete the following prompts:

Description successfully installed Pandas. NumPy is installed here at the same time.

3.Pandas Data types

Pandas is ideal for many different types of data:

    • Tabular data with non-uniform types of columns, such as in a SQL table or Excel spreadsheet
    • Sequential and unordered (not necessarily fixed frequency) time series data.
    • Arbitrary matrix data with row and column labels (homogeneous type or heterogeneous)
    • Any other form of observational/statistical data set. The data does not actually need to be tagged to be placed in the pandas data structure
4.Pandas Foundation

Here is a simple way to learn the basics of pandas, in command mode, for example, you first need to import pandas package and NumPy package, NumPy here mainly use its Nan data and generate random numbers:

Import Pandas as PD Import NumPy as NP

4.1 Pandas Series

Create a series by passing a list of values, and let pandas create a default integer index:

The dataframe of 4.2 pandas

Create a dataframe using a datetime index and a tagged column by passing the NumPy array:

To view Dataframe's head and tail data:

Display indexes, columns, and underlying numpy data:

Quick Statistical summary showing the data:

Sort by value:

Select a single column to produce a series:

Select rows through slices by [] selection:

4.2.1 Dataframe read/write CSV file

Save Dataframe data to CSV file:

Here to save to the C drive, you can view the contents of the file:

To read data from a CSV file:

4.2.2 Dataframe Read and write Excel files

To save data to an Excel file:

Here to save to the C drive, you can view the contents of the file:

Note: You need to install OPENPYXL here, same as pandas installation, Pip install OPENPYXL.
Read from Excel file:

Note: because Excel requires separate module support, need to install XLRD, same as pandas installation, Pip install XLRD.
5.Pandas in Spark Python

Here the test reads an existing parquet file, the directory is/data/parquet/20170901/, here read the directory under the name of part-r-00000 start file. Reads and saves two columns of data from the contents of the file to a file. The code is as follows:

#Coding=utf-8ImportSYS fromPysparkImportSparkcontext fromPysparkImportsparkconf fromPyspark.sqlImportSqlContextclassReadspark (object):def __init__(self, paramdate): Self.parquetroot='/data/parquet/%s' # Here is the HDFs pathself.thedate=paramdate self.conf=sparkconf () self.conf.set ("spark.shuffle.memoryFraction","0.5") Self.sc= Sparkcontext (appname='Readsparkdata', conf=self.conf) Self.sqlcontext=SqlContext (Self.sc)defGettypedata (self): BasePath= self.parquetroot%self.thedate Parqfile= Self.sqlContext.read.option ("Mergeschema","true"). Option ('BasePath', BasePath). Parquet ('%s/part-r-00000*'%(basepath)) Resdata= Parqfile.select ('appId','OS') respd=Resdata.topandas () respd.to_csv ('/data/20170901.csv') #这里是Linux系统目录Print("--------------------Data count:"+Str (Resdata.count ()))if __name__=="__main__": Reload (SYS) sys.setdefaultencoding ('Utf-8') RS= Readspark ('20170901') Rs.gettypedata ()

The code is named testsparkpython.py, submitted in the cluster, the command used here is (parameter information is related to the cluster environment):

Spark-submit--master yarn--driver-memory  6g  --deploy-mode client--executor-memory 9g  --executor-cores 3  --num-executors 50   

After execution, view the five elements before the file, head-5/data/20170901.csv:

Summary: Python is very handy for writing spark programs, and the advantages of the pandas package in data processing are obvious. As Python gets more and more hot, it's worth learning more about Python, as Python's Zen wrote ...

Preliminary study on pandas basic learning and spark python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.