Use Perl to read Excel files

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Task

To implement some mechanical word segmentation algorithms, we prepare to use the word segmentation vocabulary of the "National Language Commission corpus". The word segmentation file downloaded online is an Excel file. The task in this article is to use Perl to extract all words from the Execl file.

The Word Table file format is as follows:

The expected word is located in all cells in column B starting from row 8th. There are a total of 14629 words. (PS: The word segmentation vocabulary of the corpus contains more than 80 thousand words, but it is downloaded online to more than 50 words that appear more than 50 times ).

2. modules used

After reading some blog posts, we found that the Spreadsheet: ParseExcel module of PERL supports Excel reading.

3. How to download the module (Strawberry PERL on windows xp)

Enter cpan Spreadsheet: ParseExcel in the command line to automatically install the tool.

After the installation is complete, enter perldoc Spreadsheet: ParseExcel to check whether the installation is successful. (If the installation fails, the installation will fail)

4. Sample Code

It seems that the sample code of perldoc is very laborious to read. It is better to go directly to the cpan website to see the sample code or download the sample code of the module.

Log on to the cpan website at http://search.cpan.org/and check the spreadsheet::parseexcel.pdf on its homepage.
Http://search.cpan.org /~ Jmcnamara/Spreadsheet-ParseExcel-0.59/lib/Spreadsheet/ParseExcel. pm # NAME
The sample code and some explanations can be found above. The sample code is as follows: traverse all the worksheet and traverse the cells under each worksheet.

#! /Usr/bin/perl-w

Use strict;
Use Spreadsheet: ParseExcel;

My $ parser = Spreadsheet: ParseExcel-> new ();
My $ workbook = $ parser-> parse('Book1.xls ');

If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}

For my $ Worksheet ($ workbook-> worksheets () {My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range (); for my $ row ($ row_min .. $ row_max ){
For my $ Col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "Unformatted =", $ cell-> unformatted (), "\ n ";
Print "\ n ";
}
}
}

In addition, you can find the module file on the webpage:

Http://search.cpan.org/CPAN/authors/id/J/JM/JMCNAMARA/Spreadsheet-ParseExcel-0.59.tar.gz

This compressed package contains sample code of many modules.

5. Read the sample file

Create an excel file with only four rows and one column:

Use the previous sample code to replace 'book1.xls 'with the target file name. Chinese garbled characters are displayed.

According to the data on the Internet, the character encoding of excel is unicode, which is generally solved using the following code:

My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );

The complete code is as follows:

#! /Usr/bin/perl-w

Use Spreadsheet: ParseExcel;
Use Spreadsheet: ParseExcel: FmtUnicode;

My $ parser = Spreadsheet: ParseExcel-> new ();
My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );

If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}

For my $ worksheet ($ workbook-> worksheets ()){

My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range ();

For my $ row ($ row_min .. $ row_max ){
For my $ col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "\ n ";
}
}
}
<STDIN>;

Note that the following modules must be installed:

Spreadsheet: ParseExcel: first installed.
Unicode: Map: This must be installed for character encoding.
IO-stringy: This is already installed. It is unclear.
OLE-Storage_Lite: This is the package needed to access the office suite. The Spreadsheet: ParseExcel file is installed.

Here you only need to install UNICODE: map.

The above Code shows that the result is normal:

We can see that the subscript of the row and column cells starts from 0.

6. Task implementation

The row of a word starts from 8th rows (subscript: 7), and all columns are in 2nd columns (subscript: 1 ). In this way, the code is slightly modified, so that $ row_min = 7 and $ col_min = $ col_max = 1. Modify the target file name to 'corpuswordlist.xls '. The output result is as follows:

From 7th rows to 14635 rows, just 14629 rows.

7. File

/Files/pangxiaodong/learningperl/perlreader to obtain exceldictionary file. Zip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Perl to read Excel files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Perl to read Excel files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support