Mahout0.6-資料格式轉換_基於開源軟體的Machine Learning Platform for AI

最後更新：2018-08-22 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

在運行Mahout中的演算法之前需要將文本資料（數字或者字串）轉化為SequenceFile格式作為Mahout中演算法的輸入，待Mahout中的演算法執行完成後結果檔案也是SequenceFile格式的，SequenceFile格式是Hadoop特有的二進位格式，所以需要將其轉化為人可以讀懂的文字格式設定。以上的這一資料轉化過程，在上面幾章中可見一斑。

本章中將對Mahout中存在的輸入、輸出格式轉化的介面進行總結，其中一些事在以上某些章節中已經使用到的，有的是新增加的。

方法名稱

方法描述

InputDriver

數字檔案轉化為SequenceFile格式

1 數字檔案轉化為SequenceFile格式 1.1 簡介

數字檔案指的是檔案中包含的是數字，整型或者浮點型都是可以的，如synthetic_control.data資料，每一行代表了一個樣本，每個樣本中包括60個屬性。

上面已經說過Mahout所能夠識別的檔案類型是SequenceFile格式的，所以需要首先將數字文本轉化為SequenceFile格式，SequenceFile格式是Hadoop所特有的位元據檔案，以壓縮的形式儲存資訊。

在Mahout中，對數字檔案向SequenceFile格式的轉化功能已經做了相應封裝，在包mahout-integrations-0.6.jar中，可以找到org.apache.mahout.clustering.conversion.InputDriver類中，該類中有個main函數可以通過命令列傳遞參數進去，進行檔案格式轉化。

表1-1 數字檔案轉化為SequenceFile檔案的命令列參數

參數名

參數說明

-input (-i) input

文本資料檔案的輸入路徑

--output (-o) output

SequenceFile檔案輸出路徑

--vector (-v) v

經過InputDriver類轉換後的SequenceFile檔案中的索引值對資料類型分別為Text和VectorWritable。

注意：不過在下面的聚類例子中，已經把InputDriver的功能又做了一次封裝，所以下面的聚類演算法中直接傳遞文本資料是可以執行的。 1.2 使用方式介紹

1）將synthetic_control.data從本地複製到HDFS

$HADOOP_HOME/bin/hadoopfs -mkdir DataTrans/Numeric

$HADOOP_HOME/bin/hadoopfs -put /home/zhongchao/workspace/data/DataTrans/Numeric/synthetic_control.dataDataTrans/Numeric

2）執行轉化命令

$MAHOUT0P6_HOME/bin/mahoutorg.apache.mahout.clustering.conversion.InputDriver -iDataTrans/Numeric/synthetic_control.data -o DataTrans/Numeric/seq_synthetic_control.data

執行完上面命令後，產生了SequenceFile檔案，儲存在seq_synthetic_control.data檔案下。如下圖，其中轉化後的資訊儲存在part-r-00000中。

圖1.2-1 seq_synthetic_control.data中的檔案

3）讀取seq_synthetic_control.data中檔案結果

$MAHOUT0P6_HOME/bin/mahoutvectordump -s DataTrans/Numeric/seq_synthetic_control.data/part-m-00000 -o/home/zhongchao/workspace/data/DataTrans/Numeric/res_text -p -c -ac

用vectordump命令讀取結果，格式如下：

60 28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717

60 24.8923,25.741,27.5532,32.8217,27.8789,31.5926,31.4861,35.5469,27.9516,31.6595,27.5415,31.1887,27.4867,31.391,27.811,24.488,27.5918,35.6273,35.4102,31.4167,30.7447,24.1311,35.1422,30.4719,31.9874,33.6615,25.5511,30.4686,33.6472,25.0701,34.0765,32.5981,28.3038,26.1471,26.9414,31.5203,33.1089,24.1491,28.5157,25.7906,35.9519,26.5301,24.8578,25.9562,32.8357,28.5322,26.3458,30.6213,28.9861,29.4047,32.5577,31.0205,26.6418,28.4331,33.6564,26.4244,28.4661,34.2484,32.1005,26.691

。

可以看出轉化後的SequenceFile格式中key是60，表示的是每個樣本中元素/屬性的個數，value就是每個樣本中的資料。 2 文字檔轉化為SequenceFile格式 2.1 簡介

如果利用Mahout對文本進行分類、聚類等處理，就需要對文字檔進行處理將其轉化為SequenceFile檔案，可以使用命令seqdirectory，該命令的實現在包mahout-core-0.6-job.jar中，org.apache.mahout.text.SequenceFilesFromDirectory

表2-1 控制seqdirectory的參數

參數名稱

參數解釋

可選值

預設值

--input (-i) input

HDFS上文字檔所在路徑

--output (-o) output

HDFS上輸出路徑，是轉化後的SequenceFile格式

-overwrite (-ow)

如果使用此參數則在運行job前覆蓋輸出檔案

--chunkSize (-chunk) chunkSize

輸出檔案塊大小

--fileFilterClass (-filter) fileFilterClass

解析檔案所用到的類名

org.apache.mahout.text.PrefixAdditionFilter

--keyPrefix (-prefix) keyPrefix

追加在key值的首碼

--charset (-c) charset

編碼類別型

UTF-8

--help (-h)

列印協助資訊

2.2 使用方式

1）將文字檔拷貝到HDFS

$HADOOP_HOME/bin/hadoopfs –mkdir DataTrans/Text

$HADOOP_HOME/bin/hadoopfs -put /home/zhongchao/workspace/data/DataTrans/Text/text.data DataTrans/Text

text.data中儲存內容如下：

packageorg.apache.mahout.text;

importjava.lang.reflect.Constructor;

importjava.nio.charset.Charset;

importjava.util.Map;

importcom.google.common.collect.Maps;

importcom.google.common.io.Closeables;

importorg.apache.hadoop.conf.Configuration;

importorg.apache.hadoop.fs.FileSystem;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.SequenceFile;

importorg.apache.hadoop.util.ToolRunner;

importorg.apache.mahout.common.AbstractJob;

importorg.apache.mahout.common.HadoopUtil;

importorg.apache.mahout.common.commandline.DefaultOptionCreator;

importorg.apache.mahout.utils.io.ChunkedWriter;

2）執行轉化命令

$MAHOUT0P6_HOME/bin/mahoutseqdirectory -c UTF-8 -i DataTrans/Text/text.data -o DataTrans/Text/seq_text

3）讀取結果

$MAHOUT0P6_HOME/bin/mahoutseqdumper -s DataTrans/Numeric/Text/seq_text/chunk-0 -o/home/zhongchao/workspace/data/DataTrans/Text/res_text

結果如下：

Input Path:DataTrans/Text/seq_text/chunk-0

Key class: classorg.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text

Key:/seq_text/chunk-0: Value:

Key: /text.data:Value: package org.apache.mahout.text;

importjava.lang.reflect.Constructor;

importjava.nio.charset.Charset;

importjava.util.Map;

importcom.google.common.collect.Maps;

importcom.google.common.io.Closeables;

importorg.apache.hadoop.conf.Configuration;

importorg.apache.hadoop.fs.FileSystem;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.SequenceFile;

importorg.apache.hadoop.util.ToolRunner;

importorg.apache.mahout.common.AbstractJob;

importorg.apache.mahout.common.HadoopUtil;

importorg.apache.mahout.common.commandline.DefaultOptionCreator;

importorg.apache.mahout.utils.io.ChunkedWriter;

Count: 2

從中可以看出key為檔案名稱，value為檔案中的內容。 3 從SequenceFile中讀取數字資訊 3.1 簡介

vectordump專門從SequenceFile中讀取數字資訊，其在包mahout-integration-0.6.jar中org.apache.mahout.utils.vectors

表1.1.2-1VectorDumper命令列參數說明（VectorDumper類）

參數名稱

參數解釋

選擇性參數

預設值

-s

SequenceFile格式結果檔案，在HDFS檔案系統

無

-o

轉換後的結果檔案，文字格式設定，如果不設定該項，結果將列印到控制台，在本地

無

-u

如果key是向量，可以使用這個參數控制其輸出

-p

如果-u參數給定，使用-p控制修飾鍵按照空格分隔輸出鍵

-d

-dt

字典檔案格式

text/sequenefile

-c

輸出向量的格式按照csv格式

-ac

如果用csv格式輸出時，用此參數可以給每行向量資訊加上一條說明資訊（如結果檔案中的說明資訊：#eigenVector0, eigenvalue = 5888.20818554016）

-sort

-sz

-n

-vs

-fi

-h

3.2 使用方式

見本章1.2節中第3）點。

4 從SequenceFile檔案讀取文本資訊 4.1 簡介

用seqdumper命令從SequenceFile檔案中讀取文本資訊，其實現在mahout-example-0.6-job.jar中，org.apache.mahout.utils.SequenceFileDumper

命令列參數如下

表4.1-1 seqdumper 的執行參數

參數名稱

參數解釋

選擇性參數

預設值

--seqFile (-s) seqFile

輸入路徑在HDFS上

--output (-o) output

輸出檔案路徑，在本地

--substring (-b) substring

The number of chars to print out per value

--count (-c)

Report the count only

--numItems (-n) numItems

Output at most <n> key value pairs

--facets (-fa)

Output the counts per key. Note, if there are a lot of unique keys, this can take up a fair amount of memory

--help (-h)

Print out help

4.2 使用方式

見本章2.2節第3）點

5 將SequenceFile格式的文字檔轉化為向量檔案 5.1 簡介

在對文本進行分類、聚類處理時，第一步是將文字檔轉化為SequcnceFile格式，這在上面已經介紹了，本節介紹的是將SequenceFile格式的檔案檔案轉化為向量化的SequenceFile格式。

使用seq2sparse命令可以完成此功能。其在mahout-example-0.6-job.jar包中，org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles

表5.1-1 seq2sparse向量化過程參數說明

參數名

說明

可選值

預設值

--input (-i) input

輸入路徑（轉換為sequencefile格式的文本）

--output (-o) output

輸出路徑

--chunkSize (-chunk) chunkSiz

一次處理的資料區塊大小（MB）

100

--analyzerName (-a) analyzerName

指定使用的分詞器(org.apache.lucene.analysis.standard.StandardAnalyzer、org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer)

org.apache.lucene.analysis.standard.StandardAnalyzer

--minSupport (-s) minSupport

詞頻大於minSupport的詞才會成為特徵詞

--minDF (-md) minDF

DF值小於minDF的單詞計算tfidf值時將使用minDF計算，而不是原來的DF值。

--maxDFPercent (-x) maxDFPercent

去除在%maxDFPercent的文檔中都出現的詞

--weight (-wt) weight

向量化的方法(tf、tfidf)

tfidf

--norm (-n) norm

按指定的範數進行正常化

--minLLR (-ml) minLLR

當maxNGramSize>1時起作用，可以去除不常用的單片語合

1.0

--numReducers (-nr) numReducers

指定reduce數目

--maxNGramSize (-ng) ngramSize

指定ngrams

--overwrite (-ow)

如指定，則覆蓋上次執行結果

--sequentialAccessVector (-seq)

如指定，則輸出結果使用連續訪問效率較高的向量SequentialAccessSparseVectors，否則使用預設

RandomAccessSparseVectors

--namedVector (-nv)

如指定，則輸出結果向量類型為nameVector

false

--logNormalize (-lnorm)

(Optional) Whether output vectors should be logNormalize. If set true else false

false

--maxDFSigma (-xs) maxDFSigma

What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less then 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent

5.2 使用方式

見第八部分，2.1.1節。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More