Use Perl to read Excel files

Source: Internet
Author: User

1. Task

To implement some mechanical word segmentation algorithms, we prepare to use the word segmentation vocabulary of the "National Language Commission corpus". The word segmentation file downloaded online is an Excel file. The task in this article is to use Perl to extract all words from the Execl file.

The Word Table file format is as follows:

The expected word is located in all cells in column B starting from row 8th. There are a total of 14629 words. (PS: The word segmentation vocabulary of the corpus contains more than 80 thousand words, but it is downloaded online to more than 50 words that appear more than 50 times ).

2. modules used

After reading some blog posts, we found that the Spreadsheet: ParseExcel module of PERL supports Excel reading.

3. How to download the module (Strawberry PERL on windows xp)

Enter cpan Spreadsheet: ParseExcel in the command line to automatically install the tool.

After the installation is complete, enter perldoc Spreadsheet: ParseExcel to check whether the installation is successful. (If the installation fails, the installation will fail)

4. Sample Code

It seems that the sample code of perldoc is very laborious to read. It is better to go directly to the cpan website to see the sample code or download the sample code of the module.

Log on to the cpan website at http://search.cpan.org/and check the spreadsheet::parseexcel.pdf on its homepage.
Http://search.cpan.org /~ Jmcnamara/Spreadsheet-ParseExcel-0.59/lib/Spreadsheet/ParseExcel. pm # NAME
The sample code and some explanations can be found above. The sample code is as follows: traverse all the worksheet and traverse the cells under each worksheet.

#! /Usr/bin/perl-w

Use strict;
Use Spreadsheet: ParseExcel;

My $ parser = Spreadsheet: ParseExcel-> new ();
My $ workbook = $ parser-> parse('Book1.xls ');

If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}

For my $ Worksheet ($ workbook-> worksheets () {My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range (); for my $ row ($ row_min .. $ row_max ){
For my $ Col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "Unformatted =", $ cell-> unformatted (), "\ n ";
Print "\ n ";
}
}
}

In addition, you can find the module file on the webpage:

Http://search.cpan.org/CPAN/authors/id/J/JM/JMCNAMARA/Spreadsheet-ParseExcel-0.59.tar.gz

This compressed package contains sample code of many modules.

5. Read the sample file

Create an excel file with only four rows and one column:

Use the previous sample code to replace 'book1.xls 'with the target file name. Chinese garbled characters are displayed.

According to the data on the Internet, the character encoding of excel is unicode, which is generally solved using the following code:

My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );

The complete code is as follows:

#! /Usr/bin/perl-w

Use Spreadsheet: ParseExcel;
Use Spreadsheet: ParseExcel: FmtUnicode;

My $ parser = Spreadsheet: ParseExcel-> new ();
My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );

If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}

For my $ worksheet ($ workbook-> worksheets ()){

My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range ();

For my $ row ($ row_min .. $ row_max ){
For my $ col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "\ n ";
}
}
}
<STDIN>;

Note that the following modules must be installed:

Spreadsheet: ParseExcel: first installed.
Unicode: Map: This must be installed for character encoding.
IO-stringy: This is already installed. It is unclear.
OLE-Storage_Lite: This is the package needed to access the office suite. The Spreadsheet: ParseExcel file is installed.

Here you only need to install UNICODE: map.

The above Code shows that the result is normal:

We can see that the subscript of the row and column cells starts from 0.

6. Task implementation

The row of a word starts from 8th rows (subscript: 7), and all columns are in 2nd columns (subscript: 1 ). In this way, the code is slightly modified, so that $ row_min = 7 and $ col_min = $ col_max = 1. Modify the target file name to 'corpuswordlist.xls '. The output result is as follows:

From 7th rows to 14635 rows, just 14629 rows.

7. File

/Files/pangxiaodong/learningperl/perlreader to obtain exceldictionary file. Zip

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.