Sometimes you need to do some work on the values in the Pandas series , but without the built-in functions, you can write a function yourself, using the Pandas series 's apply method, You can call this function on each value inside, and then return a new SeriesImport= PD. Series ([1, 2, 3, 4, 5])def add_one (x): return x + 1print s.apply ( Add_one)# results:0 6dtype:int64A chestnut:Names =PD. Serie
Data conversionDelete duplicate elements The duplicated () function of the Dataframe object can be used to detect duplicate rows and return a series object with the Boolean type. Each element pairsshould be a row, if the row repeats with other rows (that is, the row is not the first occurrence), the element is true, and if it is not repeated with the preceding, the metaThe vegetarian is false.A Series object that returns an element as a Boolean is of great use and is particularly useful for fil
rate
names = [' Bob ', ' Jessica ', ' Mary ', ' John ', ' Mel ']
births = [968, 155, 77, 578, 973]
Use the zip function to merge the two lists together.
# Check the zip function's help
zip?
Babydataset = List (zip (names, births))
Babydataset
[(' Bob ', 968), (' Jessica ', "), (' Mary ',), (' John ', 578), (' Mel ', 973)]
We have completed the creation of a basic dataset. We now use Pandas to export this data to a
number, as the number of rows, directly with the index + assignment of the way to add.To find the maximum value of a column:Max_calories = food_info["energ_kcal"].max ()First locate the column that requires the maximum value, and then call the Max method directly to find the maximum value for a column.4, pandas the sort operationFood_info.sort_values ("Sodium_ (mg)", inplace=true)Print food_info["Sodium_ (mg)"]Call the Sort_values method on the DATAF
This article describes how the pandas series with the index index is vectorized:1. Index indexed arrays are the same:S1 = PD. Series ([1, 2, 3, 4], index=['a','b','C','D']) S2= PD. Series ([ten, +, +], index=['a','b','C','D'])PrintS1 +s2a11b22C33D44Dtype:int64Add the values corresponding to each index directly2. Index indexed array values are the same, in different order:S1 = PD. Series ([1, 2, 3, 4], index=['a','b','C','D']) S2= PD. Series ([ten, +,
The pandas Series is much more powerful than the numpy array , in many waysFirst, the pandas Series has some methods, such as:The describe method can give some analysis data of Series :Import= PD. Series ([1,2,3,4]) d = s.describe ()Print (d)Count 4.000000mean 2.500000std 1.290994min 1.00000025% 1.75000050% 2.50000075% 3.250000max 4.000000dtype:float64Second, the bigges
In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: In [58]: dfOut[58]: A B C D0 foo one -1.202872 -0.0552241 bar one -1.814470 2.3959852 foo two 1.018601 1.5528253 bar three -0.595447 0.1665994 foo two 1.395433 0.0476095 bar two -0.392670 -0.1364736 foo one 0.007207 -0.5617577 foo thre
"Original" 10 minutes to fix pandasThis article is a simple translation of "Ten Minutes to Pandas" on the official website of Pandas, the original is here. This article is a simple introduction to pandas, detailed introduction please refer to:Cookbook . As a rule, we will introduce the required packages in the following format:First, create the objectYou can view
It's been a lot of red boxes all afternoon.
Python2 and Python3 version conflicts
Pip version IssuePip-v
Updatesudo apt-get update
sudo apt-get install Python-dev
Finally do not know how to install, feeling is one of the following two ways‘‘‘ C++ sudo easy_install -U setuptools ‘‘‘ ‘‘‘ C++ sudo pip install --upgrade setuptools ‘‘‘
(Just beginning to try also not, do not know why suddenly magic can.) If not again, run both sides, see there is an answer is to run on both
The following for everyone to share a Python solution pandas processing missing value is an empty string problem, has a good reference value, I hope to help you. Come and see it together.
Pit Record:
Use pandas to do CSV missing value processing time found strange bug, that is, Excel open CSV file, obviously there i
']df_obj[' user number '].isin (alist) #将要过滤的数据放入字典中, uses Isin to filter the data, returns the row index and the results of each row filter, and returns if the match is turedf_obj[df_obj[' user number '].isin (alist)] #获取匹配结果为ture的行Filter data using Dataframe blur (like in sql):df_obj[df_obj[' package '].str.contains (R '. * Voice cdma.* ')] #使用正则表达式进行模糊匹配, * match 0 or unlimited, match 0 or 1 timesData conversion using Dataframe (post-supplemental instructions)df_obj[' branches _ Maintenance
was only 85.9 seconds.
The next step is to process the null values in the remaining rows and, after testing, use an empty string in Dataframe.replace () to save some space than the default null value Nan, but for the entire CSV file, the empty column just saves one more "," so the 98 million x removed The 6 column also saved only 200M of space. Further data cleansing is still on the removal of unwanted data and merging.
The drop of the data column,
seconds.The next step is to process the empty values in the remaining rows, and after testing, using an empty string in dataframe.replace () saves some space than the default null value Nan, but for the entire CSV file, the empty column only has one ",", so the removed 98 million The X 6 column also saves 200M of space. Further data cleansing is still the removal of useless data and merging.Discard the data column, in addition to invalid values and r
in the remaining rows, and after testing, using an empty string in Dataframe.replace () saves some space than the default null value Nan, but for the entire CSV file, the empty column is just one more ",", so the 98 million x removed The 6 column also saved only 200M of space. Further data cleansing is still the removal of useless data and merging.
Discard the data column, in addition to invalid values and requirements, some of the table's own redund
Http://www.cnblogs.com/batteryhp/p/5006274.htmlPandas is the preferred library for subsequent content in this book. The pandas can meet the following requirements:
Data structure with automatic or explicit data alignment by axis. This prevents many common errors caused by data misalignment and data from different data sources (indexed differently).
Integrated time series capabilities
Data structures that can handle time series data as
Recent work and Hive SQL to deal with more, occasionally encountered some problems of SQL is not easy to solve, will be downloaded to the file with pandas to deal with, due to the large amount of data, so there are some relevant experience can be shared with you, hope to learn pandas help YOU.Read and write large text dataSometimes we get a lot of text files, full read into the memory, read the process will
conversions
CSV Data Set Read
Structured data file reads
HDF5 Read
JSON data Set Read
Excel reads
Hive Table Read
External database Read
Index indexes
Automatically created
There are no index indexes and you need to create additional columns if needed
Row structure
Series structure, belonging to the pandas
Most of the students who Do data analysis start with excel, and Excel is the most highly rated tool in the Microsoft Office Series.But when the amount of data is very large, Excel is powerless, python Third-party package pandas greatly extend the functionality of excel, the entry takes a little time, but really is the necessary artifact of big data!1. Read data from a filePandas supports the reading of multiple format data, of course the most common a
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.