使用Perl讀取Excel檔案

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

1. 任務

為了實現一些機械分詞演算法，準備使用“國家語委語料庫”的分詞詞表，線上下載到得詞表檔案是一個Excel檔案。本文的任務就是使用Perl從該Execl檔案中提取所有的詞語。

詞表檔案格式如下：

需要的詞語的位置在從第8行開始的，第B列的所有儲存格。一共有14629個詞語。（PS：語料庫的分詞詞表包含8萬多個詞語，但是線上下載到是出現次數在50次以上的詞語，只有這1萬多）。

2. 使用什麼模組

通過閱讀一些博文發現，PERL的Spreadsheet::ParseExcel模組支援Excel的讀操作。

3. 如何下載模組（windows xp上的草莓PERL）

在命令列下輸入：cpan Spreadsheet::ParseExcel，即可自動安裝。

安裝結束後，輸入perldoc Spreadsheet::ParseExcel，即可檢測是否安裝成功。（如果安裝失敗，會輸出安裝失敗）

4. 範例代碼

感覺perldoc的範例代碼讀起來很費力，不如直接到cpan網站上面去看範例代碼，或者下載模組的範例代碼。

登入cpan網站：http://search.cpan.org/，尋找Spreadsheet::ParseExcel模組，在其首頁
http://search.cpan.org/~jmcnamara/Spreadsheet-ParseExcel-0.59/lib/Spreadsheet/ParseExcel.pm#NAME
上面可以找到範例代碼和一些解釋。範例代碼如下，遍曆所有的worksheet，遍曆每個worksheet下面的儲存格。

#!/usr/bin/perl -w

use strict;
use Spreadsheet::ParseExcel;

my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse('Book1.xls');

if ( !defined $workbook ) {
die $parser->error(), ".\n";
}

for my $worksheet ( $workbook->worksheets() ) {    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();    for my $row ( $row_min .. $row_max ) {
        for my $col ( $col_min .. $col_max ) {
           my $cell = $worksheet->get_cell( $row, $col );
           next unless $cell;
           print "Row, Col    = ($row, $col)\n";
           print "Value       = ", $cell->value(),       "\n";
           print "Unformatted = ", $cell->unformatted(), "\n";
           print "\n";
        }
    }
}

另外，在該網頁上可以找到該模組的檔案：

http://search.cpan.org/CPAN/authors/id/J/JM/JMCNAMARA/Spreadsheet-ParseExcel-0.59.tar.gz

這個壓縮包包含了很多模組的範例代碼。

5. 範例檔案讀取

首先建立一個只有4行1列的excel檔案，進行嘗試：

然後使用前面的範例代碼，將'Book1.xls'替換為目標檔案名，即可。顯示中文亂碼。

根據網上資料來看，excel的字元編碼是unicode，一般使用如下代碼進行解決：

my $formatter = Spreadsheet::ParseExcel::FmtUnicode->new(Unicode_Map=>"CP936");
my $workbook = $parser->parse('example.xls', $formatter);

完整代碼如下：

#!/usr/bin/perl -w

use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::FmtUnicode;

my $parser = Spreadsheet::ParseExcel->new();
my $formatter = Spreadsheet::ParseExcel::FmtUnicode->new(Unicode_Map=>"CP936");
my $workbook = $parser->parse('example.xls', $formatter);

if ( !defined $workbook ) {
die $parser->error(), ".\n";
}

for my $worksheet ( $workbook->worksheets() ) {

my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        for my $col ( $col_min .. $col_max ) {
           my $cell = $worksheet->get_cell( $row, $col );
           next unless $cell;
           print "Row, Col    = ($row, $col)\n";
           print "Value       = ", $cell->value(),       "\n";
           print "\n";
        }
    }
}
<STDIN>;

注意所有需要安裝的模組有如下幾個：

    Spreadsheet::ParseExcel：最開始安裝的。
    Unicode::Map：這個要安裝一下，用於字元編碼。
    IO-stringy：這個已經安裝過了，具體不清楚。
    OLE-Storage_Lite：這個是訪問office套件的需要的包。在安裝Spreadsheet::ParseExcel被安裝了。

這裡只需要安裝Unicode::Map即可。

上面代碼顯示結果正常：

此處可以發現，行和列的儲存格的下標都是從0開始的。

6. 任務實現

詞語的行從第8行開始（下標為7），列都在第2列（下標為1）。這樣對代碼略作修改，令$row_min=7，令$col_min=$col_max=1。修改目標檔案名為'CorpusWordlist.xls'。輸出結果如下：

從第7行到14635行，剛好14629行。

7. 檔案

/Files/pangxiaodong/LearningPerl/Perl讀取EXCEL詞典檔案.zip

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More