The following for you to share a Python data Analysis Library Pandas basic operation method, has a good reference value, I hope to help you. Come and see it together.
What is Pandas?
Is it it?
。。。。 Apparently pandas is not so cute as this guy ....
Let's take a look at how Pandas's official website defines itself:
Pandas is a open source, easy-to-use data structures and data analysis tools for the Python programming language.
Obviously, pandas is a very powerful data analysis library for Pyth
Import org.apache.spark.SparkConf Import org.apache.spark.SparkContext import Org.apache.spark.sql.SQLContext Object
Rdd2dataframebyreflectionscala {case class person (name:string, Age:int) def main (args:array[string]): unit = { Val conf = new
Sparksql's Createdataframe offers a variety of overloaded methods, and I use these two:
Createdataframe (java.util.list rows, Structtype Schema)
For well-constructed RDD:
Val schemastring = "id Name"
val schema = Structtype (Schemastring.split (""
Case class record (Ts:long, Id:int, value:int) If it is an RDD, we often use Reducebykey to get a record of the latest timestamp, using the following method, Def findlatest (Records:rdd [Record]) (implicit spark:sparksession) = {records.keyby (_.id).
At the time of data processing, especially in the big data contest, often encounter a problem is that multiple forms of merging problems, such as a form has user_id and age two fields, another form has user_id and sex two fields, to merge these two
use shiny to achieve annual, quarterly and monthly value chain updates
achieve Goals
Click button Annual budget update for all promotion percent updatesClick the button Quarterly budget update to update the percentage of the corresponding quarter
concil_set:if each in ans_attend_set:c Oncil_attend_set.add (each) elif each of Ans_notatt_set:concil_notatt_set.add (each) else:concil_n Otans_set.add (each) #3. Display result Def disp (SS, cap, num = True): #ss: List set #cap: Opening description print (Cap, ' ({}) '. Format (len (ss))) for I in rangE (Np.ceil (LEN (ss)/5). Astype (int)): Pre = i * 5 NEX = (i+1) * 5 #调整显示格式 dd = ' for Each in list (ss) [Pre:nex]: If Len (each) = = 2:DD = dd + "+ each Elif len" (ea ch) = = 3:DD = dd + ' + eac
following lists the various data that the Dataframe constructor can accept.Indexed objects#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdfrom Pandas import Series,dataframe#pandas Index object is responsible for managing axis labels and other metadata, When building series and dataframe, any array or other label used in the sequence is converted to In
-dimensional array, consisting of a set of data (various numpy data types) and a set of related labels (that is, indexes). Create series
In most cases, the series data structure is captured directly from the Dataframe data structure, but we can also create the series ourselves. The syntax is as follows:
s = PD. Series (data, Index=index)
Where data can be different content: Dictionary Ndarray scalar
Index
First, introduce
Data mining needs data often distributed in different datasets, and data integration is the process of merging multiple datasets into a consistent data store.
For Dataframe, its connections are sometimes indexed.
Third, code example
# coding:utf-8 # In[2]: From pandas import dataframe import pandas as PD import NumPy as NP # #
result object, together with the original object's index
Df.groupby (' Smoker ', group_keys=false). Apply (Mean)
A column that turns the grouped index into DF
In some cases, the GroupBy as_index=false parameters are not used, and the resulting is a series, this situation is generally in spite of grouping, but the calculation needs to involve several columns, and finally get the Series,series index is a hierarchical index. This turns the series into a data
calculate the mean absolute deviation, a powerful statistical tool similar to the standard deviationMedian: This method is used to return the medianMin: This method will return the minimum valueMax: This method will return the maximum valueMode: This method will return the majorityStd: This method will return the standard deviationVar: This method will return the varianceSkew: This method is used to return the skewness coefficient, which represents the degree of symmetry of the data distributio
[' 2001 '].describe ())# Slice through the yearPrint (ts[' 2001/01 '].describe ())# time SlicesPrint (ts[' 2002/05/01 ': ' 2002/05/06 ')Print (' \ n ')# The above index, slicing method is also applicable to Dataframe# 2.2, time series with repeating indexdate = [' 2001/02/01 ', ' 2001/02/01 ', ' 2001/02/02 ']TS = PD. Series (Range (3), index=date)Print (TS)Print (ts[' 2001/02/01 ']) # Duplicate index return
Data retrieval, processing and storage 1. Write to a CSV file using NumPy and PandasTo write to the CSV file, NumPy's Savetxt () function is a function that corresponds to Loadtxt (), and he can save the array in a partition file format such as CSV:Np.savetxt ('np.csv', a,fmt='%.2f', delimiter=' , ', header='#1, #2, #3, #4")In the above function call, we specify the name, array, optional format, spacer (the default is a space character) and an optional caption for the file to hold the array.Use
first, the initial knowledge of pandas
Pandas is a very useful library based on NumPy, which has two unique basic data Structures series (one-dimensional) and dataframe (two-dimensional) that make data operations simpler. Although pandas has two data structures, it is still a library of Python, so some data types in Python are still available here, and you can also use the class to define the data type yourself.
In the field of financial data analysi
shape produces a random array (number between 0 and 1)
Randint a given shape to produce a random integer
Choice random selection for a given shape
Shuffle is the same as Random.shuffle
Uniform a given shape to produce a random array
Pandas: Data analysis
Pandas is a powerful toolkit for data analysis in Python.
Pandas is built on the basis of numpy.
Main functions of Pandas
A data structure with its functions
values in the dataName or index.name can rename the dataThe Dataframe data frame, also a data structure, is similar to the one in Rdata={' year ': [2000,2001,2002,2003],' Income ': [3000,3500,4500,6000]}DATA=PD. DataFrame (data)Print (data)The result is:Income year0 3000 20001 3500 20012 4500 20023 6000 2003DATA1=PD.
Importmatplotlib fromPandasImportDataFrameImportNumPy as NPImportPandas as PDImportMySQLdbImportMatplotlib.pyplot as Plt#DF =padaas Dataframe Object (two-dimensional tag array)#S =pandas Series object (one-dimensional tag array)db = MySQLdb.connect (host="localhost", port=3306, user="Root", passwd="1234", db='SPJ', charset="UTF8")#connecting to a databasefilename ='Count_day.csv'#File path namequery ='select * FROM J'#SQL query Statements #导入数据
number, as the number of rows, directly with the index + assignment of the way to add.To find the maximum value of a column:Max_calories = food_info["energ_kcal"].max ()First locate the column that requires the maximum value, and then call the Max method directly to find the maximum value for a column.4, pandas the sort operationFood_info.sort_values ("Sodium_ (mg)", inplace=true)Print food_info["Sodium_ (mg)"]Call the Sort_values method on the DATAFRAME
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.