Two ways to solve this problem are the existing solutions on the Internet.
Scenario Description:
There is a data file that is saved as text and now has three columns of user_id,plan_id,mobile_id. The goal is to get new documents only mobile_id,plan_id.
Solution Solutions
Scenario One: Use the Python open file to write the file directly through the data, for loop processing data and write to the new file.
The code is as follows:
defreadwrite1 (Input_file,output_file): F= Open (Input_file,'R') out= Open (Output_file,'W') Print(f) forLineinchF.readlines (): a= Line.split (",") x=a[0] +","+ a[1]+"\ n"out.writelines (x) f.close () out.close ()
Scenario Two: Use pandas read data to DataFrame and then do data segmentation, directly with the DataFrame write function to write to the new file
The code is as follows:
def Readwrite2 (input_file,output_file): date_1=pd.read_csv (input_file,header=0,sep=', ') date_1[['mobile'plan_id []]. To_csv (output_file, sep=',', Header=true,index=false)
From a code perspective, pandas logic is clearer.
Let's take a look at the efficiency of execution!
def getruntimes (Fun, input_file,output_file): begin_time=int (Round (Time.time () *)) Fun ( Input_file,output_file) end_time=int (Round (Time.time () *) )print(" Read and Write run time: ", (end_time-begin_time),"ms") getruntimes (readwrite1,input_ File,output_file) #直接撸数据getRunTimes (readwrite2,input_file,output_file1) #使用dataframe读写数据
Read and write run time: 976 ms
Read and Write run time: 777 MS
Input_file about 270,000 of the data, dataframe efficiency than for loop efficiency or a bit faster, if the amount of data larger, the effect is more obvious?
Try to increase the number of input_file records below and try the following results
Input_file |
Readwrite1 |
Readwrite2 |
27W |
976 |
777 |
456 |
1989 |
1509 |
110W |
4312 |
3158 |
Judging from the above test results, the efficiency of dataframe is increased by about 30%.
Python reads a CSV file, removes a column, and then writes a new file