Search Tool
The first step after determining a task is to find a library to work with. XLRD, XLWT, Xlutils are listed on Python Excel, but
They're older, XLWT don't even support Excel after version 07
Their documents are not very friendly, may need to read the source code, and the elder sister's task is relatively tight, plus I was at the end of the period, there is no time to read the source code
After a search I found OPENPYXL, support 07+ Excel, has been maintained, the document is clear and easy to read, reference tutorial and API documentation will soon be able to get started, is it ~
Installation
This is easy, direct pip install OPENPYXL, OH hehe ~
Because I do not need to deal with the picture, there is no pillow.
Some considerations
The source file is about one in 1~2MB, which is smaller, so it can be read directly into memory processing.
Since is the processing Excel, moreover their entire group obviously is win under the work (the data all uses Excel to save = =, the business person ah ... ), this script is still done under win
This task does not require me to make changes to the existing files at all! I'm so embarrassed ... I just need to read, process, and write another file.
Learn to use
Well, just open cmd and use the Python shell to play this module to get started ... (Win does not install Ipython, embarrassing)
To do this little script basically I just need to import two things
From OPENPYXL import workbookfrom OPENPYXL import Load_workbook
Load_workbook as the name implies is to import files into memory, workbook is the most basic class, used in memory to create a file last written into the disk.
Work
First I need to import this file
INWB = Load_workbook (filename)
What you get is a workbook object.
Then I need to create a new file
OUTWB = Workbook ()
Then in this new file, create several worksheets with Create_sheet, such as
Careersheet = Outwb.create_sheet (0, ' career ')
A worksheet called career is inserted from the head (i.e. insert with Python list)
Next I need to iterate through each worksheet of the input file and do some work according to the table name (e.g. if the table name is not a number and I do not need to process it), OPENPYXL supports getting worksheets in the same way as dictionaries, and getting a table name for a workbook is Get_sheet_names
For SheetName in Inwb.get_sheet_names (): if not sheetname.isdigit (): continue sheet = Inwb[sheetname]
After the worksheet has been obtained, it is processed by columns and rows. OPENPYXL determines the number of rows and columns based on the area in which the table actually has data, and the methods for getting rows and columns are sheet.rows and sheet.columns, which can all be used as lists. For example, if I want to skip a table with less than 2 columns of data, you can write
If Len (Sheet.columns) < 2: continue
If I want to get the first two columns of this worksheet, you can write
ColA, ColB = Sheet.columns[:2]
In addition to using columns and rows to get the list of worksheets, you can also use Excel's cell encoding to get an area, such as
Cells = sheet[' A1 ': ' B20 ']
A bit like Excel's own function, you can pull out a two-dimensional area ~
For ease of handling, encountered a worksheet without column C, I want to create a column and a columns such as the empty C column out, then I can use Sheet.cell this method, by passing in the cell number and add null value to create a new column.
Alen = Len (ColA) for I in range (1, Alen + 1): Sheet.cell (' c%s '% (i)). Value = None
Note: Excel cell naming is starting from 1 ~
The above code is also shown, to get the value of the cell is Cell.value (can be left or right), its type can be a string, floating-point number, Integer, or Time (datetime.datetime), the Excel file will also generate the corresponding type of data.
Once you get the value of each cell, you can do it. ~OPENPYXL automatically encodes the string in Unicode, so the string is Unicode type.
In addition to using Cell.value to modify values one by one, you can also append a row to the work table
Sheet.append (Stra, Dateb, NUMC)
Finally, wait for the new file to be written and save it directly with Workbook.save.
Outwb.save ("Test.xlsx")
This overwrites the current file, even the one you read to the memory.
Some places to watch out for
To get the subscript of the current cell in this Column object when traversing each cell of a column
For IDX, cell in Enumerate (ColA): # do something ...
To prevent the acquisition of data at both ends of the invisible space (a very common pit in Excel files), remember strip ()
If the cell in the worksheet has no data, OPENPYXL will let it have a value of none, so if you want to do it based on the value of the cell, you can't pre-assume its type, the most useful
If not cell.value continue
Such statements to judge
If there are many noise in the Excel file you want to work with, such as when you expect a cell to be time, some of the table's data may be a string, which can be used
If Isinstance (Cell.value, Unicode): Break
such as the statement processing.
Win under the CMD does not seem very good set with the Utf-8 code page, if it is Simplified Chinese words can be used 936 (GBK), print will automatically convert from Unicode to GBK output to the terminal.
Some small functions to help with Chinese problems
The table I work with has some characters that go beyond the GBK range, and when I need to print out some information to monitor the progress of the process is very troublesome, fortunately they can be ignored, I directly with the space to replace the print also line, so add some I would have to replace the delimiter, I can:
# Annoying Seperatorsdot = U ' \u00b7 ' dash = U ' \u2014 ' emph = U ' \u2022 ' dot2 = U ' \u2027 ' seps = (U '. ', Dot, dash, emph, Dot2) de F get_clean_ch_string (chstring): "" " Remove annoying seperators from the Chinese string. Usage: cleanstring = get_clean_ch_string (chstring) "" " cleanstring = chstring for Sep in SEPs: cleanstring = Cleanstring.replace (Sep, u ") return cleanstring
In addition, I have a need to name English names in English, English, Chinese and Chinese names.
First I need to be able to split the English and Chinese, my approach is to use regular matching, according to common Chinese and English characters in the range of Unicode to set. The regular pattern matching English and Chinese is as follows:
# regex pattern matching all ascii charactersasciipattern = ur ' [%s]+ '% '. Join (Chr (i) for I in range (+ 127)) # regex Pat Tern matching all common Chinese characters and seporatorschinesepattern = ur ' [\u4e00-\u9fff.%s]+ '% (". Join (SEPs))
English is replaced with the range of ASCII printable characters, the range of common Chinese characters is \u4e00-\u9fff, and that SEPs is a number of characters above the GBK range mentioned above. In addition to the simple division, I also need to deal with only Chinese name without English name, only English name no Chinese name, etc., the logic of the judgment is as follows:
def split_name (name): "" " split [中文版 name, Chinese name]. If one of them is missing, None would be returned instead. Usage: engname, chname = split_name (name) "" " matches = Re.match (' (%s) (%s) '% (Asciipattern, Chinesepattern), name) if matches: # 中文版 name + Chinese name return Matches.group (1). Strip (), Matches.group ( 2). Strip () else: matches = Re.findall (' (%s) '% (Chinesepattern), name) matches = '. Join (matches). Strip ( if matches: # Chinese name only return None, Matches else: # 中文版 name only matches = Re.findall ( ' (%s) '% (Asciipattern), name) return '. Join (matches). Strip (), None
After getting the Chinese name, I need to split into a surname and first name, because the task requires not to split the name is very clear, I will follow the common Chinese name division method to divide--two characters or three words of the first word is the last name, four characters of two words is the last name, The delimiter name (minority name) delimiter is preceded by a last name (where the previous get_clean_ch_string function is used to remove the delimiter), and the name is longer without a separator, assuming the entire string is a name. (Note that the first name in English refers to the first name, last name refers to the surname, 2333)
def split_ch_name (chname): "" "split the Chinese name into first name and last name. * If The name is XY or XYZ, X would be returned as the last name. * If The name is WXYZ, WX would be returned as the last name. * If the name is ... WXYZ, the whole name is returned as the last name. * If the name is.. ABC * XYZ ..., the part before the seperator is returned as the last name. Usage: chfirstname, chlastname = Split_ch_name (chname) "" " If Len (Chname) < 4: # XY or XYZ Chlastname = chname[0] chfirstname = chname[1:] elif len (chname) = = 4: # wxyz chlastname = chname[:2]
chfirstname = chname[2:] Else: # longer cleanname = get_clean_ch_string (chname) nameparts = Cleanname.split () print U '. Join (nameparts) if Len (Nameparts) < 2: # ... WXYZ return None, Nameparts[0] chlastname, chfirstname = Nameparts[:2] #. ABC * XYZ ... Return Chfirstname, Chlastname
Division English name is very simple, the space is divided, the first part is the name, the second part is the surname, other circumstances for the time being on the line.