Python traversal pandas data method summary, python traversal pandas
Preface
Pandas is a python data analysis package that provides a large number of functions and methods for fast and convenient data processing. Pandas defines two data types: Series and DataFrame, which makes data operations easier. Series is a one-dimensional data structure, similar to combining list data values with index values. DataFrame is a two-dimensional data structure, which is similar to a workbook or mysql database.
In data analysis, traversal query and processing of data are inevitable. For example, we need to divide data in two columns of dataframe and store the results in a new list. This article introduces several methods for pandas data traversal through this routine.
For... in loop iteration method
A for statement is a Python built-in iterator tool used to read elements from iteratable container objects (such as lists, tuples, dictionaries, sets, and files) one by one, until there are no more elements in the container, you only need to follow the iteratable protocol between the tool and the object to perform iterative operations.
Specific iteration process: the iteratable object returns the iterator through the _ iter _ method. The iterator has the _ next _ method, the for loop continuously calls the _ next _ method and returns a value in the iterator in order each time until the iteration ends, when no more elements exist, an exception StopIteration is thrown (python automatically handles exceptions ). The advantage of iteration is that you do not need to load all elements to the memory at a time. You can return elements one by one when calling the next method to avoid insufficient memory space.
>>> X = [1, 2, 3] >>> its = x. _ iter _ () # The list is an iteratable object. Otherwise, the system will prompt that it is not an iteration object >>>> its <list_iterator object at 0x100f32198 >>> next (its) # its contains this method, indicating that its is the iterator 1 >>> next (its) 2 >>> next (its) 3 >>> next (its) traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
The implementation code is as follows:
def haversine_looping(df):disftance_list = []for i in range(0,len(df)): disftance_list.append(df.iloc[i][‘high']/df.iloc[i][‘open']) return disftance_list
With regard to the implementation method of range in the above code, we can also implement the same function iterator (with the iter method and next method) based on the iterator protocol and apply it in the for loop. The Code is as follows:
class MyRange: def __init__(self, num): self.i = 0 self.num = num def __iter__(self): return self def __next__(self): if self.i < self.num: i = self.i self.i += 1 return i else: raise StopIteration() for i in MyRange(10): print(i)
We can also use the list parsing method to implement data processing with less code.
disftance_list = [df.iloc[i][‘high']/df.iloc[i][‘open'] for i in range(0,len(df))]
Iterrows () generator method
Iterrows is a generator for the row iteration of dataframe. It returns the index of each row and the objects containing the row. The so-called generator is actually a special iterator that supports the iterator protocol internally. Python provides generator functions and generator expressions to implement the generator. Each request returns a result without the need to create a result list at a time, saving memory space.
Generator function: it is written as a regular def statement, but a result is returned at a time using the yield statement. The status of each result is suspended and continues.
def gensquares(N): for i in range(N): yield i**2 print gensquares(5)for i in gensquares(5): print(i) <generator object gensquares at 0xb3d37fa4>014916
Generator expression: an object that generates results on demand similar to list parsing.
print (x**2 for x in range(5))print list(x**2 for x in range(5))<generator object <genexpr> at 0xb3d31fa4>[0, 1, 4, 9, 16]
The iterrows () implementation code is as follows:
def haversine_looping(df):disftance_list = []for index,row in df.iterrows(): disftance_list.append(row[‘high']/row[‘open']) return disftance_list
The iterrows code is as follows. The yield statement suspends the function and sends a group of values back to the caller:
def iterrows(self): columns = self.columns klass = self._constructor_sliced for k, v in zip(self.index, self.values): s = klass(v, index=columns, name=k) yield k, s
Apply () method
The apply () method can apply a function to a specific row or column of dataframe. Functions are embedded in the Code by lambda. The end of the lambda function contains the axis parameter, which is used to inform Pandas to apply the function to rows (axis = 1) or columns (axis = 0 ).
The implementation code is as follows:
df.apply(lambda row: row[‘high']/row[‘open'], axis =1)
How Pandas series is vectorized
The data structure of Pandas DataFrame and series basic units is based on the linked list. Therefore, functions can be vectorized on the entire linked list without executing each value in order. Pandas includes a rich array of vector function libraries. We can pass the entire series (column) as a parameter to calculate the entire linked list.
The implementation code is as follows:
dftest4['rate'] = dftest4['high']/dftest4['open']
How Numpy arrays is vectorized
Because only series values are used in function vectoring, you can use the values method to convert the chain table from Pandas series to NumPy arrays and use NumPy array as a parameter, calculate the entire linked list.
The implementation code is as follows:
dftest5['rate'] = dftest5['high'].values/dftest5['open'].values
Summary
The timeit method is used to test the execution time of the preceding traversal methods. The test results are as follows. It can be seen that the speed of loop execution is the slowest. iterrows () is optimized for Pandas dataframe, which is significantly higher than direct loop. The apply () method also loops between rows, but the efficiency is much higher than iterrows because of a series of global optimizations using Cython-like iterators. NumPy arrays runs at the fastest speed, followed by Pandas series. Because vectoring applies to the entire sequence at the same time, it can save more time and be better than scalar operations. NumPy uses pre-compiled C code for optimization at the underlying layer, at the same time, it avoids many overhead during Pandas series operations, such as indexes and data types. Therefore, NumPy arrays operations are much faster than Pandas series operations.