Using Python + OPENPYXL to process excel2007 document ideas and insights

Source: Internet
Author: User
Tags printable characters python list
Search Tool

The first step after determining a task is to find a library to work with. XLRD, XLWT, Xlutils are listed on Python Excel, but

They're older, XLWT don't even support Excel after version 07
Their documents are not very friendly, may need to read the source code, and the elder sister's task is relatively tight, plus I was at the end of the period, there is no time to read the source code
After a search I found OPENPYXL, support 07+ Excel, has been maintained, the document is clear and easy to read, reference tutorial and API documentation will soon be able to get started, is it ~

Installation

This is easy, direct pip install OPENPYXL, OH hehe ~

Because I do not need to deal with the picture, there is no pillow.

Some considerations

The source file is about one in 1~2MB, which is smaller, so it can be read directly into memory processing.
Since is the processing Excel, moreover their entire group obviously is win under the work (the data all uses Excel to save = =, the business person ah ... ), this script is still done under win
This task does not require me to make changes to the existing files at all! I'm so embarrassed ... I just need to read, process, and write another file.

Learn to use

Well, just open cmd and use the Python shell to play this module to get started ... (Win does not install Ipython, embarrassing)

To do this little script basically I just need to import two things

From OPENPYXL import workbookfrom OPENPYXL import Load_workbook

Load_workbook as the name implies is to import files into memory, workbook is the most basic class, used in memory to create a file last written into the disk.

Work

First I need to import this file

INWB = Load_workbook (filename)

What you get is a workbook object.

Then I need to create a new file

OUTWB = Workbook ()

Then in this new file, create several worksheets with Create_sheet, such as

Careersheet = Outwb.create_sheet (0, ' career ')

A worksheet called career is inserted from the head (i.e. insert with Python list)

Next I need to iterate through each worksheet of the input file and do some work according to the table name (e.g. if the table name is not a number and I do not need to process it), OPENPYXL supports getting worksheets in the same way as dictionaries, and getting a table name for a workbook is Get_sheet_names

For SheetName in Inwb.get_sheet_names ():  if not sheetname.isdigit ():    continue  sheet = Inwb[sheetname]

After the worksheet has been obtained, it is processed by columns and rows. OPENPYXL determines the number of rows and columns based on the area in which the table actually has data, and the methods for getting rows and columns are sheet.rows and sheet.columns, which can all be used as lists. For example, if I want to skip a table with less than 2 columns of data, you can write

If Len (Sheet.columns) < 2:  continue

If I want to get the first two columns of this worksheet, you can write

ColA, ColB = Sheet.columns[:2]

In addition to using columns and rows to get the list of worksheets, you can also use Excel's cell encoding to get an area, such as

Cells = sheet[' A1 ': ' B20 ']

A bit like Excel's own function, you can pull out a two-dimensional area ~

For ease of handling, encountered a worksheet without column C, I want to create a column and a columns such as the empty C column out, then I can use Sheet.cell this method, by passing in the cell number and add null value to create a new column.

Alen = Len (ColA) for I in range (1, Alen + 1):  Sheet.cell (' c%s '% (i)). Value = None

Note: Excel cell naming is starting from 1 ~

The above code is also shown, to get the value of the cell is Cell.value (can be left or right), its type can be a string, floating-point number, Integer, or Time (datetime.datetime), the Excel file will also generate the corresponding type of data.

Once you get the value of each cell, you can do it. ~OPENPYXL automatically encodes the string in Unicode, so the string is Unicode type.

In addition to using Cell.value to modify values one by one, you can also append a row to the work table

Sheet.append (Stra, Dateb, NUMC)

Finally, wait for the new file to be written and save it directly with Workbook.save.

Outwb.save ("Test.xlsx")

This overwrites the current file, even the one you read to the memory.

Some places to watch out for
To get the subscript of the current cell in this Column object when traversing each cell of a column

For IDX, cell in Enumerate (ColA):  # do something ...

To prevent the acquisition of data at both ends of the invisible space (a very common pit in Excel files), remember strip ()

If the cell in the worksheet has no data, OPENPYXL will let it have a value of none, so if you want to do it based on the value of the cell, you can't pre-assume its type, the most useful

If not cell.value  continue

Such statements to judge

If there are many noise in the Excel file you want to work with, such as when you expect a cell to be time, some of the table's data may be a string, which can be used

If Isinstance (Cell.value, Unicode): Break  

such as the statement processing.

Win under the CMD does not seem very good set with the Utf-8 code page, if it is Simplified Chinese words can be used 936 (GBK), print will automatically convert from Unicode to GBK output to the terminal.

Some small functions to help with Chinese problems
The table I work with has some characters that go beyond the GBK range, and when I need to print out some information to monitor the progress of the process is very troublesome, fortunately they can be ignored, I directly with the space to replace the print also line, so add some I would have to replace the delimiter, I can:

# Annoying Seperatorsdot = U ' \u00b7 ' dash = U ' \u2014 ' emph = U ' \u2022 ' dot2 = U ' \u2027 ' seps = (U '. ', Dot, dash, emph, Dot2) de F get_clean_ch_string (chstring): "" "  Remove annoying seperators from the Chinese string.  Usage:    cleanstring = get_clean_ch_string (chstring) ""  "  cleanstring = chstring for  Sep in SEPs:    cleanstring = Cleanstring.replace (Sep, u ")  return cleanstring


In addition, I have a need to name English names in English, English, Chinese and Chinese names.

First I need to be able to split the English and Chinese, my approach is to use regular matching, according to common Chinese and English characters in the range of Unicode to set. The regular pattern matching English and Chinese is as follows:

# regex pattern matching all ascii charactersasciipattern = ur ' [%s]+ '% '. Join (Chr (i) for I in range (+ 127)) # regex Pat Tern matching all common Chinese characters and seporatorschinesepattern = ur ' [\u4e00-\u9fff.%s]+ '% (". Join (SEPs))

English is replaced with the range of ASCII printable characters, the range of common Chinese characters is \u4e00-\u9fff, and that SEPs is a number of characters above the GBK range mentioned above. In addition to the simple division, I also need to deal with only Chinese name without English name, only English name no Chinese name, etc., the logic of the judgment is as follows:

def split_name (name): "" "  split [中文版 name, Chinese name].    If one of them is missing, None would be returned instead.  Usage:    engname, chname = split_name (name) "" "  matches = Re.match (' (%s) (%s) '% (Asciipattern, Chinesepattern), name)  if matches: # 中文版 name + Chinese name    return Matches.group (1). Strip (), Matches.group ( 2). Strip ()  else:    matches = Re.findall (' (%s) '% (Chinesepattern), name)    matches = '. Join (matches). Strip (    if matches: # Chinese name only      return None, Matches    else: # 中文版 name only      matches = Re.findall ( ' (%s) '% (Asciipattern), name)      return '. Join (matches). Strip (), None

After getting the Chinese name, I need to split into a surname and first name, because the task requires not to split the name is very clear, I will follow the common Chinese name division method to divide--two characters or three words of the first word is the last name, four characters of two words is the last name, The delimiter name (minority name) delimiter is preceded by a last name (where the previous get_clean_ch_string function is used to remove the delimiter), and the name is longer without a separator, assuming the entire string is a name. (Note that the first name in English refers to the first name, last name refers to the surname, 2333)

def split_ch_name (chname): "" "split the Chinese name into first name and last  name.    * If The name is XY or XYZ, X would be returned as the last name.    * If The name is WXYZ, WX would be returned as the last name.    * If the name is ... WXYZ, the whole name is returned as the last     name.    * If the name is.. ABC * XYZ ..., the part before the seperator is returned as the last     name.  Usage:    chfirstname, chlastname = Split_ch_name (chname) "" "  If Len (Chname) < 4: # XY or XYZ    Chlastname = chname[0]    chfirstname = chname[1:]  elif len (chname) = = 4: # wxyz    chlastname = chname[:2]
  
   chfirstname = chname[2:]  Else: # longer    cleanname = get_clean_ch_string (chname)    nameparts = Cleanname.split ()    print U '. Join (nameparts)    if Len (Nameparts) < 2: # ... WXYZ      return None, Nameparts[0]    chlastname, chfirstname = Nameparts[:2] #. ABC * XYZ ...  Return Chfirstname, Chlastname
  

Division English name is very simple, the space is divided, the first part is the name, the second part is the surname, other circumstances for the time being on the line.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.