#-*-Coding:utf-8-*-# The Nineth chapter of Python for data analysis# Data aggregation and grouping operationsImport Pandas as PDImport NumPy as NPImport time# Group operation Process, Split-apply-combine# Split App MergeStart = Time.time ()Np.random.seed (10)# 1, GroupBy technology# 1.1, citationsDF = PD. DataFrame ({' Key1 ': [' A ', ' B ', ' A ', ' B ', ' a '],' Key2 ': [' one ', ' one ', ' one ', ' one '
We used pandas to do some basic operations, then further understand the operation of the data,
Data cleansing has always been a very important part of data analysis.
Data merge
In pandas, you can merge data through merge.
Import NumPy as Npimport pandas as Pddata1 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '], ' numeber ': [1,3,5,7]}) data2=pd
, populating the values in one object with the missing values in an object.
2. Dataframe Merging of database styleA merge or join operation of a dataset is a link to a row by one or more keys. These operations are the core of a relational database. The pandas merge function is the primary entry point for applying these algorithms to the data.In [4]: Import Pandas as Pdin [5]: Import NumPy as Npin [6]: DF1 = P
Summary One, create object two, view data three, select and set four, missing value processing Five, related Operations VI, aggregation seven, rearrangement (reshaping)Viii. Time Series Nine, categorical type ten, drawing Xi. Import and save data content# Coding=utf-8import pandas as PDimport NumPy as NP# # # One, create object# 1. You can pass a list object to create a Series,pandas the integer index is created by defaults = PD. Series ([1, 3, 5,
to the Python Dict object.
A = PD. Series ()
B = pd. Series ([2,5,8])
C = PD. Series ([3, ' X ', b])
d = PD. Series ({' name ': ' Xufive ', ' Age ': 50})
Series's method is dazzling, a simple attempt to add, the original thought is to insert a new element, the result is to do each element add, this and Numpy.array b
.. ... ... ... ... ... ... ... - 86.0Guangyu Splendid Taoyuan Arch Villa1 0 86.44㎡12473.0 the 87.0Kingrex Shenhua one courtyard Arch Villa1 0 89.18㎡21529.0 the 88.0Forte Huanglong and Shanxi Lake0 1 0㎡0.0 the 89.0Middle of Cofco Fangyuan province0 1 0㎡0.0 the 90.0East Ming Xia sha0 - 0㎡0.0 -NaN Total contract: main city216 + 21755.55㎡nan[ theRows X7Columns],2Dataframe ObjectDf.to_json ()And as long as
Dataframe, which make the data operations simpler. Ii. Pandas installation Because pandas is a third-party library of Python, you need to install it before you use it, and automatically install pandas and related components directly using the PIP install Pandas.
Iii. Use of Pandas
Note: This operation is carried out in Ipython
1, Import pandas module and use alias, and Import series module, the following use is based on this import.
In [1]: From pand
Pandas basics, pandas
Pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools.
Similar to Numpy, the core is ndarray, and pandas is centered around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures respectively. Pandas uses the following methods to import
that contains a set of ordered columns (similar to index), each of which can be a different value type (unlike Ndarray can have only one dtype). You can basically think of DataFrame as a collection of Series that shares the same index. DataFrame is constructed in a similar way to Series, except that it can accept multiple one-dimensional data sources at the same time, each of which becomes a separate co
, 85, 112]}# 创建了一个DataFrame数据框student = pd.DataFrame(stu_dic)Query data for the first 5 rows or the end of 5 lines Student.head () Student.tail ()print(student) # 打印这个数据框print(‘前五行:\n‘, student.head()) # 查询这个数据框的前五行print(‘后五行:\n‘, student.tail()) # 查询这个数据框的后五行Querying the specified rowprint(student.loc[[0, 2, 4, 5, 7]]) # 这里的loc索引标签函数必须是中括号[]Querying the specified columnprint(student[[‘Name‘, ‘Height‘, ‘Weight‘]].head()) # 如果多
, line-bound JSON, and remote versions of the above categories
HDF5 (available in both standard and pandas formats), Bcolz, SAS, SQL database (SQLAlchemy supported), Mongo
The into project can efficiently migrate data between any of the two formats in the above data format, with the principle of using a paired-conversion network (the bottom of the article has an intuitive interpretation).
How to use it
The into function has two parameters: source and target. It converts the data from source to t
Organize Pandas Operations
This article original, reproduced please identify the source: http://www.cnblogs.com/xiaoxuebiye/p/7223774.html
Import Data:
Pd.read_csv (filename): Import data from CSV file
pd.read_table (filename): Import data from a delimited text file
pd.read_excel (filename) : Importing data from an Excel file
pd.read_sql (query, Connection_object): Importing data from SQL Tables/Libraries
Pd.read_json (json_string) : Import data from JSON-formatted string
pd.read_html (URL): P
1. Merging data sets①, many-to-one mergerWe need to use the merge function in pandas, where the merge function merges the intersection of two datasets by default (inner connection), and of course other parameters:How there are inner, outer, left and right, four parameters can be selected, respectively: the intersection, the Union, participate in the merging of the Dataframe, and thewhen the column name object is the same: Df1=
', index_col1_01_test1_pd.read_csv('test.csv ', index_col = 0) SexCode = pd. dataFrame ([], index = ['female ', 'male'], columns = ['sexcode']) # converts gender to 01 training = training. join (SexCode, how = 'left', on = training. sex) training = training. drop (['name', 'ticket ', 'barked', 'cabin ', 'Sex'], axis = 1) # delete a few variables that do not participate in modeling, including name, ticket nu
vectorization calculation. Python and JVM Use the same data structure to avoid serialization overhead
The amount of data per batch for vectorization is controlled by the Spark.sql.execution.arrow.maxRecordsPerBatch parameter, which defaults to 10,000. If the columns is particularly numerous at one time, the value can be reduced appropriately. some restrictions
All sparksql data types are not supported, including Binarytype,maptype, Arraytype,timestamptype, and nested Structtype.
Pandas UDFs and
anything, but some languages can only do something in a certain field. SQL is such a language, which can only describe data operations. However, it is classified into programming languages in the case of big classification. It requires lexical analysis and syntax analysis. For those who do not know this process, you can see it.0x02 prepare data
Because the data has been prepared this time, all we need is to write a small script to read it out, and I will package what we need.
: Download
#-*-Co
Pandas data structures and indexes are Getting Started Pandas must learn the content, here in detail to explain to you, read this article, I believe you Pandas There is a clear understanding of data structures and indexes. first, the data structure introductionThere are two kinds of very important data structures in pandas, namely series series and data frame Dataframe. Series is similar to a one-dimensional array in NumPy, in addition to the function
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.