"""Return object with labels in given axis omitted where alternately anyOr all of the data is missingParameters----------Axis: {0 or ' index ', 1 or ' columns '}, or tuple/list thereofPass tuple or list to drop on multiple axesHow: {"Any", ' all '}* Any:if any NA values is present, drop that label* All:if All values is NA, drop that labelThresh:int, default Noneint value:require that many non-na valuesSubset:array-likeLabels along other axis to consider, e.g. if is dropping rowsThese would is a
deviation of the correlation coefficient between them to estimate the overall standard deviation. Under this premise, the correlation coefficient of the user in different sample sizes is calculated, and the standard deviation is observed.
First, you need to find the one with the most overlapping scores. Create a new user-based column matrix foo, and then fill in the number of overlapping scores of different users one by one:
>>> foo = DataFrame(np.empty((len(data.index),len(data.index)),dtype=i
is to remove the feature (column) or sample (row) containing the identified data from the dataset. The Dropna method can be used to delete rows containing missing values in the dataset (where the Dropna () function is present in the DATAFRAME data structure)Similarly, we can set the axis parameter to 1 to delete a column with at least one Nan value in the data setThe D
, such as empty values and so on. So that we can have a general understanding of the data as a whole.4. Data CleansingBecause the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.In general, remove some null data can use the Dropna method, when you use the method, after the inspection found that
Series,dataframeimport matplotlib. Pyplot as Pltimport timefrom numpy import nan as Nadata = Series ([1,na,3.5,7,na]) #注意返回的是不为NA的值的原来的索引, not the index after removal#有一个函数 Reset_index This function (method?) You can reset index, where the drop = True option discards the original index and sets a new 0-based index, which is only useful for dataframe.Print Data.dropna () #下面的结果一样print data[data.notnull ()]data1 = DataFrame ([[1,2,3],[na,2.3,4],[na,na,na]]) # Note: Because of the dataframe settin
compromise value.
Here we measure the standard deviation of the scoring system by selecting a pair of users with the most overlapping ratings in data and using the standard deviation of the correlation coefficient between them to estimate the overall standard deviation. On this premise, the correlation coefficients of the two users under different sample sizes are statistically analyzed and their standard deviation changes are observed.
First, find a single user with the most overlapping scor
Obj.value_counts () calculates the number of occurrences of each value
Pd.value_counts (obj.values) This can also be used to calculate the count number, which is the top level method
Isin ([]) determines whether the series values are included in the sequence of values passed in
Iv. processing of missing data
Nan Processing method
Dropna Delete null values
Fillna assigning values to null values
IsNull d
Data Source acquisition:
Https://www.kaggle.com/datasets
1,
Look at the some basic stats for the ‘imdb_score’ column: data.imdb_score.describe()Select a column: data[‘movie_title’]Select the first 10 rows of a column: data[‘duration’][:10]Select multiple columns: data[[‘budget’,’gross’]]Select all movies over two hours long: data[data[‘duration’] > 120]
data.country = data.country.fillna(‘’)data.duration = data.duration.fillna(data.duration.mean())data = pd.read_csv(‘movie_metadata.csv’, dtype
seconds, but the inspection found Dropna () after all the lines are gone, checked the pandas manual, Without arguments, Dropna () removes all rows that contain null values. If you want to remove only columns with null values, you need to add axis and how two parameters:DF. Dropna(axis=1, how=' All ') A total of 6 columns in the 14 column were removed, a
The following for everyone to share a Python solution pandas processing missing value is an empty string problem, has a good reference value, I hope to help you. Come and see it together.
Pit Record:
Use pandas to do CSV missing value processing time found strange bug, that is, Excel open CSV file, obviously there is nothing in the lattice, of course, I think with pandas Dropna () or Fillna () to deal with the missing values.
But pandas read the C
false false"print('--------The missing value and index of the output dataframe---------'= Df[df.isnull ( ). values==True]print(data[~data.index.duplicated ()))‘‘‘Missing values and indexes--------output dataframe---------One, threeb nan nan NaNd nan NaN NaNg nan nan nan'print('--------output dataframe column with missing values---------')Print (Df.isnull (). any ())‘‘‘--------output dataframe columns with missing values---------one truetwo truethree truedtype:bo
. Interpolation method: Interpolation method is based on Monte Carlo simulation method, combined with linear model, generalized linear model, decision tree and other methods to calculate the predicted values to replace the missing values.import pandas as pd, numpy as npstu_score = {‘Score‘: [88.0, 76.0, 89.0, 67.0, 79.0, None, None, None, 90.0, None, None, 92.0, None, None, 86.0, 73.0, None, None, 77.0]}stu_score2 = pd.DataFrame(stu_score)s = stu_score2[‘Score‘]print(s)# 结合sum函数和isnull函数来检测数据中含有
rows of data) and row and column statistics. Because the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.
First call the Dataframe.isnull () method to see which null values are in the data table, and the opposite is Dataframe.notnull (), which pandas all the data in the table to be null-evaluated to TRUE/FALSE as a result. As shown in the f
Na in Pandas is Np.nan, and None of Python's built-in will be treated as NA.There are four ways to deal with NA: dropna , fillna , isnull , notnull .is (not) nullThis pair of methods makes an element-level application to the object, and then returns a Boolean array, which is typically used for Boolean indexes.DropnaReturns a Series that contains only non-null data and index values for a series,dropna.The problem is how to deal with DataFrame, because
.
Methods for handling missing data:
Dropna () filters out rows with a value of Nan
Fillna () Fill missing data
IsNull () returns a Boolean array with the missing value corresponding to True
Notnull () returns a Boolean array with the missing value corresponding to False
Filtering Missing data:
Sr.dropna ()
Sr[data.notnull ()]
Fill missing data: Fillna (0)
Pandas:dataframe
condition is that the key is not a value'b' inchSR==true1 inchSR==Flase: The method to take the value is similar to the dictionary Sr.Get('a',0)Judging, slicing, taking valueSR=PD. Series ([1,2,3,4],index=['b','C','D','a']) b1C2D3a4dtype:int64sr.iloc[1] #取索引为1==2sr.ilc[2] #取索引为2==3Fetch IndexSR=PD. Series ([1,2,3,4],index=['b','C','D','a']) SR1=PD. Series ([5,6,7,8,9],index=['a','b','C','D','e']) SR2=PD. Series ([5,6,7,8,9,Ten],index=['a','b','C','D','e','F']) SR+SR1==a9.0b7.0C9.0D11.0e nandtyp
The previous Pandas array (Pandas Series)-(3) Vectorization, said that when the two Pandas series were vectorized, if a key index was only in one of the series , the result of the calculation is nan , so what is the way to deal with nan ?1. Dropna () method:This method discards all values that are the result of NaN , which is equivalent to calculating only the values of the common key index:ImportPandas as Pds1= PD. Series ([1, 2, 3, 4], index=['a','b
columnSort and rankSort:Sort_index () sort the index of the row or column (in dictionary order)Sort_index (by =) sort by values in one or more columnsThe series is sorted by value, and the order methodRanking:Rank ()Axis index with duplicate valuesThe Is_unique () property of the index can tell you if its value is uniqueSummary and calculation of descriptive statisticsSUM ()Mean ()Describe ()Describing and summarizing statistical functionscorrelation coefficients and covarianceThe series and Da
data summaries, including data viewing (the default total output of 60 rows of data) and row and column statistics. Since the source data usually contains some null values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, the invalid data needs to be processed.
The Dataframe.isnull () method is called first to see which of the data table is null, and the opposite is Dataframe.notnull (), where pandas all the data in the table i
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.