1. Task
To implement some mechanical word segmentation algorithms, we prepare to use the word segmentation vocabulary of the "National Language Commission corpus". The word segmentation file downloaded online is an Excel file. The task in this article is to use Perl to extract all words from the Execl file.
The Word Table file format is as follows:
The expected word is located in all cells in column B starting from row 8th. There are a total of 14629 words. (PS: The word segmentation vocabulary of the corpus contains more than 80 thousand words, but it is downloaded online to more than 50 words that appear more than 50 times ).
2. modules used
After reading some blog posts, we found that the Spreadsheet: ParseExcel module of PERL supports Excel reading.
3. How to download the module (Strawberry PERL on windows xp)
Enter cpan Spreadsheet: ParseExcel in the command line to automatically install the tool.
After the installation is complete, enter perldoc Spreadsheet: ParseExcel to check whether the installation is successful. (If the installation fails, the installation will fail)
4. Sample Code
It seems that the sample code of perldoc is very laborious to read. It is better to go directly to the cpan website to see the sample code or download the sample code of the module.
Log on to the cpan website at http://search.cpan.org/and check the spreadsheet::parseexcel.pdf on its homepage.
Http://search.cpan.org /~ Jmcnamara/Spreadsheet-ParseExcel-0.59/lib/Spreadsheet/ParseExcel. pm # NAME
The sample code and some explanations can be found above. The sample code is as follows: traverse all the worksheet and traverse the cells under each worksheet.
#! /Usr/bin/perl-w
Use strict;
Use Spreadsheet: ParseExcel;
My $ parser = Spreadsheet: ParseExcel-> new ();
My $ workbook = $ parser-> parse('Book1.xls ');
If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}
For my $ Worksheet ($ workbook-> worksheets () {My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range (); for my $ row ($ row_min .. $ row_max ){
For my $ Col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "Unformatted =", $ cell-> unformatted (), "\ n ";
Print "\ n ";
}
}
}
In addition, you can find the module file on the webpage:
Http://search.cpan.org/CPAN/authors/id/J/JM/JMCNAMARA/Spreadsheet-ParseExcel-0.59.tar.gz
This compressed package contains sample code of many modules.
5. Read the sample file
Create an excel file with only four rows and one column:
Use the previous sample code to replace 'book1.xls 'with the target file name. Chinese garbled characters are displayed.
According to the data on the Internet, the character encoding of excel is unicode, which is generally solved using the following code:
My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );
The complete code is as follows:
#! /Usr/bin/perl-w
Use Spreadsheet: ParseExcel;
Use Spreadsheet: ParseExcel: FmtUnicode;
My $ parser = Spreadsheet: ParseExcel-> new ();
My $ formatter = Spreadsheet: ParseExcel: FmtUnicode-> new (Unicode_Map => "CP936 ");
My $ workbook = $ parser-> parse('example.xls ', $ formatter );
If (! Defined $ workbook ){
Die $ parser-> error (), ". \ n ";
}
For my $ worksheet ($ workbook-> worksheets ()){
My ($ row_min, $ row_max) = $ worksheet-> row_range ();
My ($ col_min, $ col_max) = $ worksheet-> col_range ();
For my $ row ($ row_min .. $ row_max ){
For my $ col ($ col_min... $ col_max ){
My $ cell = $ worksheet-> get_cell ($ row, $ col );
Next unless $ cell;
Print "Row, Col = ($ row, $ col) \ n ";
Print "Value =", $ cell-> value (), "\ n ";
Print "\ n ";
}
}
}
<STDIN>;
Note that the following modules must be installed:
Spreadsheet: ParseExcel: first installed.
Unicode: Map: This must be installed for character encoding.
IO-stringy: This is already installed. It is unclear.
OLE-Storage_Lite: This is the package needed to access the office suite. The Spreadsheet: ParseExcel file is installed.
Here you only need to install UNICODE: map.
The above Code shows that the result is normal:
We can see that the subscript of the row and column cells starts from 0.
6. Task implementation
The row of a word starts from 8th rows (subscript: 7), and all columns are in 2nd columns (subscript: 1 ). In this way, the code is slightly modified, so that $ row_min = 7 and $ col_min = $ col_max = 1. Modify the target file name to 'corpuswordlist.xls '. The output result is as follows:
From 7th rows to 14635 rows, just 14629 rows.
7. File
/Files/pangxiaodong/learningperl/perlreader to obtain exceldictionary file. Zip