Search tools
The first step in determining a task is to find a hand in the library to work. Python Excel lists the packages xlrd, XLWT, Xlutils, but
They're older, and XLWT don't even support Excel after version 07.
Their documents are not very friendly, may need to read the source code, and elder sister's task is relatively tight, plus I was at the end of the period, do not have this time to read the source code
After another search I found OPENPYXL, support 07+ Excel, has been maintained, the document is clear and easy to read, referencing tutorial and API documents can soon start, that is it ~
Installation
This is easy, direct pip install OPENPYXL, oh Oh ~
Because I do not need to deal with the picture, there is no install pillow.
Some considerations
The source file is about one in the 1~2MB, which is smaller, so you can read it directly into the memory processing.
Since it is to deal with Excel, not to mention that their entire group is obviously win under the work (data are stored in Excel = =, business people ah ...). , this script is still under win.
This task does not require me to make any changes to the existing files! Be embarrassed to do ... All I have to do is read, process, and write another file.
Learn to use
Well, just open cmd, and then use the Python shell to play this module to get started ... (Win under no outfit Ipython, embarrassed)
To do this little script basically I just need to import two things
From OPENPYXL Import workbook from
OPENPYXL import Load_workbook
Load_workbook as the name suggests is to import files into memory, workbook is the most basic of a class, used in memory to create files last written to disk.
Work
First I need to import this file
INWB = Load_workbook (filename)
is to get a workbook object
Then I need to create a new file
Then in this new file, use Create_sheet to create a few new worksheets, such as
Careersheet = Outwb.create_sheet (0, ' career ')
You'll insert a worksheet called career from the head (that is, use a python list insert)
Next I need to walk through each worksheet of the input file and do some work with the table name (e.g. if the table name is not a number, I do not need to deal with it), OPENPYXL supports getting the worksheet by the table name in the same way as a dictionary, and getting the table name of a workbook is Get_sheet_names
For SheetName in Inwb.get_sheet_names ():
if not sheetname.isdigit ():
continue
sheet = Inwb[sheetname]
After you get the worksheet, you're done by columns and rows. OPENPYXL determines the number of rows and columns based on the area in which the worksheet actually has data, and the way to get rows and columns is sheet.rows and sheet.columns, which can be used like a list. For example, if I want to skip a table with less than 2 columns of data, you can write
If Len (Sheet.columns) < 2:
continue
If I want to get the first two columns of this worksheet, I can write
Cola, colb = Sheet.columns[:2]
In addition to using columns and rows to get the list of worksheets, you can also use Excel's cell code to get an area, such as
Cells = sheet[' A1 ': ' B20 ']
A bit like Excel's own function, you can pull out a two-dimensional area ~
In order to facilitate processing, encountered a worksheet without C column, I want to create a and a column, such as Long empty C column out, then I can use Sheet.cell this method, by passing in the cell number and add null value to create a new column.
Alen = Len (cola)
for I in range (1, Alen + 1):
Sheet.cell (' c%s '% (i)). Value = None
Note: Excel's cell name starts at 1.
The above code also shows that the value of getting a cell is cell.value (either a left or a right), its type can be a string, floating-point number, integer, or Time (Datetime.datetime), and the corresponding type of data is generated in the Excel file.
After you get the value of each cell, you can do it. ~OPENPYXL automatically encodes the string in Unicode, so the strings are Unicode types.
In addition to using Cell.value to modify values individually, you can also line append to the worksheet
Sheet.append (Stra, Dateb, NUMC)
Finally, when the new file is written, just save it with Workbook.save.
This will overwrite the current file, even the one you read to memory before.
Some place to look out for.
If you want to get the subscript for the current cell in this Column object while traversing each cell in a row
For IDX, cell in Enumerate (COLA):
# do something ...
To prevent the data from getting an invisible space at both ends (a common pit in an Excel file), remember the Strip ()
If the cell in the worksheet does not have data, OPENPYXL will leave it with a value of none, so if you want to do it based on the value of the cell, you cannot presuppose its type, most useful
If not cell.value
continue
And the like, to judge first.
If you have a lot of noise in the Excel file that you want to work with, such as when you expect a cell to be a time, some of the table's data may be strings, and you can use
If Isinstance (Cell.value, Unicode): Break
such as the statement processing.
Win under the CMD seems not very good to set the code page with Utf-8, if it is Simplified Chinese words can be used 936 (GBK), print automatically from Unicode to GBK output to the terminal.
Some small functions to help deal with Chinese problems
I am dealing with some of the tables beyond the GBK range of characters, when I need to print some information to monitor the progress of the time is very troublesome, but they can be ignored, I directly replace with a space and then print also line, so add some I would have to replace the separator, I can:
# annoying seperators
dot = u ' \u00b7 ' dash = U ' \u2014 ' emph = U ' \u2022 ' dot2 =
u ' \u2027 ' seps
= (U '. ') , Dot, Dash, emph, Dot2)
def get_clean_ch_string (chstring): "" "
Remove annoying seperators from the Chinese string .
Usage:
cleanstring = get_clean_ch_string (chstring) "" "
cleanstring = chstring for
Sep in SEPs:
cleanstring = cleanstring.replace (Sep, u ') return
cleanstring
In addition, I have a demand, is the English name [space] Chinese into English surname, English name, Chinese surname, Chinese name.
First I need to be able to split English and Chinese, my approach is to use a regular match, according to the common Chinese and English characters in the range of Unicode to set. The regular pattern matching English and Chinese is as follows:
# regex pattern matching all ASCII characters
asciipattern = ur ' [%s]+ '% '] Join (CHR (i) for I in range (127))
# Regex pattern matching all common Chinese characters and seporators
chinesepattern = ur ' [\u4e00-\u9fff.%s]+ '% ('. J Oin (SEPs))
English is replaced with the range of ASCII printable characters, the common Chinese character range is \u4e00-\u9fff, the SEPs is the previous mentioned above GBK range of some characters. In addition to the simple segmentation, I still need to deal with only Chinese names without English names, only English names without Chinese names, and so on, the logic of Judgment is as follows:
def split_name (name): "" "
split [中文版 name, Chinese name].
If one of them is missing, None would be returned instead.
Usage:
engname, chname = split_name (name) ""
"
matches = Re.match (' (%s) (%s) '% (Asciipattern, Chinesepattern), name)
if matches: # 中文版 name + Chinese name return
matches.group (1). Strip (), Matches.group ( 2). Strip ()
else:
matches = Re.findall (' (%s) '% (Chinesepattern), name)
matches = '. Join (matches). Strip ()
If matches: # Chinese name only return
None, matches
else: # 中文版 name only
matches = Re.findall (' (%s) '% (Asciipattern), name)
Return '. Join (matches). Strip (), None
After getting the Chinese name, I need to split the first name and name, because the task requirements do not need to split the name is very clear, I will be in accordance with the common Chinese name of the name division--two words or three words is a surname, four characters of the top two words is the surname, The name-delimited (ethnic-minority name) separator is preceded by the last name (the previous get_clean_ch_string function is used to remove the separator), and the name is longer without a delimiter, assuming the entire string is a name. (Note that the first name in English refers to the name, and last name refers to the surname, 2333)
def split_ch_name (chname): "" "split the Chinese name into the the name of
* If The name is XY or XYZ, X'll be returned as the last name.
* If The name is WXYZ, WX would be returned as the last name. * If the name is ...
WXYZ, the whole name is returned as the last name. * If the name is..
ABC * XYZ ..., the part before the seperator is returned as the last name. Usage:chfirstname, Chlastname = Split_ch_name (chname) "" "If Len (Chname) < 4: # XY or XYZ chlastname = ChN Ame[0] Chfirstname = chname[1:] elif len (chname) = = 4: # WXYZ chlastname = chname[:2] Chfirstname = chname[2 :] Else: # longer cleanname = get_clean_ch_string (chname) nameparts = Cleanname.split () print U '. Join (name Parts) If Len (Nameparts) < 2: # ... WXYZ return None, Nameparts[0] chlastname, chfirstname = Nameparts[:2] #. ABC * XYZ ... return chfirstname, Chlastname
The
Split English name is very simple, the space is separated, the first part is the name, the second part is the surname, the other situation temporarily does not matter.