mahout0.6-Data Format Conversion _ Open source software based machine learning platform

Source: Internet
Author: User

Before running the algorithm in Mahout, it is necessary to convert the text data (number or string) into Sequencefile format as input to the algorithm in Mahout, and the result file is Sequencefile format after the algorithm execution is completed in Mahout. The sequencefile format is a binary format unique to Hadoop, so it needs to be converted into a text format that people can read. The above data conversion process can be seen in the above chapters.

This chapter summarizes the interfaces that exist in mahout for input and output formats, some of which are already in use in some of these chapters, some of which are new additions.

Method name

Method description

Inputdriver

Convert digital file to sequencefile format

1 digital file conversion to Sequencefile format 1.1 introduction

A digital file refers to a file that contains numbers, integers, or floating-point types, such as synthetic_control.data data, each row represents a sample, and 60 attributes are included in each sample.

It has already been said that Mahout can recognize the type of file is Sequencefile format, so you need to first convert the digital text into the sequencefile format, sequencefile format is a unique binary data file of Hadoop, Stores information in a compressed form.

In Mahout, the conversion function of digital file to Sequencefile format has been encapsulated, In the package Mahout-integrations-0.6.jar, you can find the Org.apache.mahout.clustering.conversion.InputDriver class, which has a main function that can pass arguments through the command line to go through the file format Of.

Table 1-1 command line arguments for converting a digital file into a sequencefile file

Name of parameter

Parameter description

-input (I.) input

Input path to text data file

--output (-O) output

Sequencefile file Output Path

--vector (-V) v

The key values in the Sequencefile file after the Inputdriver class conversion are text and vectorwritable for each data type.

Note: However, in the following cluster example, the function of the inputdriver has been encapsulated again, so the following clustering algorithm can be carried out directly by passing text data. 1.2 Use Way Introduction

1) Copy Synthetic_control.data from local to HDFs

$HADOOP _home/bin/hadoopfs-mkdir Datatrans/numeric

$HADOOP _home/bin/hadoopfs-put/home/zhongchao/workspace/data/datatrans/numeric/synthetic_control.datadatatrans/ Numeric

2) Execute conversion command

$MAHOUT 0p6_home/bin/mahoutorg.apache.mahout.clustering.conversion.inputdriver-idatatrans/numeric/synthetic_ Control.data-o Datatrans/numeric/seq_synthetic_control.data

After executing the above command, the Sequencefile file is generated and stored under the Seq_synthetic_control.data file. The following figure, in which the transformed information is stored in the part-r-00000.

Figure 1.2-1 the file in Seq_synthetic_control.data

3 Read the results of the file in Seq_synthetic_control.data

$MAHOUT 0p6_home/bin/mahoutvectordump-s datatrans/numeric/seq_synthetic_control.data/part-m-00000-o/home/ Zhongchao/workspace/data/datatrans/numeric/res_text-p-c-ac

Read the results with the Vectordump command in the following format:

60         28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32 .8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293 , 26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553 , 28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717

60 24.8923,25.741,27.5532,32.8217,27.8789,31.5926,31.4861,35.5469,27.9516,31.6595,27.5415,31.1887,27.4867,31.391,27.811,24.4 88,27.5918,35.6273,35.4102,31.4167,30.7447,24.1311,35.1422,30.4719,31.9874,33.6615,25.5511,30.4686,33.6472,25.0701,34.076 5,32.5981,28.3038,26.1471,26.9414,31.5203,33.1089,24.1491,28.5157,25.7906,35.9519,26.5301,24.8578,25.9562,32.8357,28.5322 , 26.3458,30.6213,28.9861,29.4047,32.5577,31.0205,26.6418,28.4331,33.6564,26.4244,28.4661,34.2484,32.1005,26.691

You can see that the key in the transformed Sequencefile format is 60, which represents the number of elements/attributes in each sample, and value is the data in each sample. 2 text file conversion to Sequencefile format 2.1 introduction

If we use Mahout to classify and cluster the text, we need to process the text file and convert it into Sequencefile file, and we can use command seqdirectory, The implementation of this command is in the package Mahout-core-0.6-job.jar, org.apache.mahout.text.SequenceFilesFromDirectory

Table 2-1 parameters for controlling seqdirectory

Parameter name

Parameter explanation

Optional value

Default value

--input (I.) input

Path of text file on HDFs

--output (-O) output

HDFs on the output path, is the transformed Sequencefile format

-overwrite (-ow)

If you use this parameter, overwrite the output file before running the job

--chunksize (-chunk) chunksize

Output File Block size

--filefilterclass (-filter) Filefilterclass

The name of the class used to parse the file

Org.apache.mahout.text.PrefixAdditionFilter

--keyprefix (-prefix) Keyprefix

Prefix appended to key value

--charset (-C) CharSet

Encoding type

UTF-8

--help (-h)

Printing Help information

2.2 Use Mode

1 Copy the text file to the HDFs

$HADOOP _home/bin/hadoopfs–mkdir Datatrans/text

$HADOOP _home/bin/hadoopfs-put/home/zhongchao/workspace/data/datatrans/text/text.data DataTrans/Text

The content stored in Text.data is as follows:

Packageorg.apache.mahout.text;

Importjava.lang.reflect.Constructor;

Importjava.nio.charset.Charset;

Importjava.util.Map;

Importcom.google.common.collect.Maps;

Importcom.google.common.io.Closeables;

Importorg.apache.hadoop.conf.Configuration;

Importorg.apache.hadoop.fs.FileSystem;

Importorg.apache.hadoop.fs.Path;

Importorg.apache.hadoop.io.SequenceFile;

Importorg.apache.hadoop.util.ToolRunner;

Importorg.apache.mahout.common.AbstractJob;

Importorg.apache.mahout.common.HadoopUtil;

Importorg.apache.mahout.common.commandline.DefaultOptionCreator;

Importorg.apache.mahout.utils.io.ChunkedWriter;

2) Execute conversion command

$MAHOUT 0p6_home/bin/mahoutseqdirectory-c UTF-8-I. Datatrans/text/text.data-o datatrans/text/seq_text

3) Read results

$MAHOUT 0p6_home/bin/mahoutseqdumper-s datatrans/numeric/text/seq_text/chunk-0-o/home/zhongchao/workspace/data/ Datatrans/text/res_text

The results are as follows:

Input path:datatrans/text/seq_text/chunk-0

Key class:classorg.apache.hadoop.io.Text Value Class:class org.apache.hadoop.io.Text

key:/seq_text/chunk-0: Value:

Key:/text.data:value:package org.apache.mahout.text;

Importjava.lang.reflect.Constructor;

Importjava.nio.charset.Charset;

Importjava.util.Map;

Importcom.google.common.collect.Maps;

Importcom.google.common.io.Closeables;

Importorg.apache.hadoop.conf.Configuration;

Importorg.apache.hadoop.fs.FileSystem;

Importorg.apache.hadoop.fs.Path;

Importorg.apache.hadoop.io.SequenceFile;

Importorg.apache.hadoop.util.ToolRunner;

Importorg.apache.mahout.common.AbstractJob;

Importorg.apache.mahout.common.HadoopUtil;

Importorg.apache.mahout.common.commandline.DefaultOptionCreator;

Importorg.apache.mahout.utils.io.ChunkedWriter;

Count:2

From this you can see that the key is a filename and value is the content in the file. 3 reading digital information from Sequencefile 3.1 introduction

Vectordump specifically reads digital information from Sequencefile, which is in the package Mahout-integration-0.6.jar org.apache.mahout.utils.vectors

Table 1.1.2-1vectordumper Command line parameter description (Vectordumper Class)

Parameter name

Parameter explanation

Optional parameters

Default value

-S

Sequencefile format result file, in HDFs file system

No

-O

The converted result file, text format, if you do not set the item, the result is printed to the console, local

No

-U

If the key is a vector, you can use this parameter to control its output

-P

If the-u argument is given, the output key is separated by a space using the-P control control key

-D

-dt

Dictionary file format

Text/sequenefile

-C

Format of output vectors in CSV format

-ac

If you output in CSV format, you can use this parameter to add a description to each line of vector information, such as the descriptive information in the result file: #eigenVector0, eigenvalue = 5888.20818554016)

-sort

-sz

-N

-vs

-fi

-H

3.2 Use mode

See point 3 of section 1.2 of this chapter.

4 reading text information from Sequencefile file 4.1 introduction

Using the Seqdumper command to read text messages from Sequencefile files, in fact, now Mahout-example-0.6-job.jar, Org.apache.mahout.utils.SequenceFileDumper

Command line arguments are as follows

Table 4.1-1 seqdumper Execution parameters

Parameter name

Parameter explanation

Optional parameters

Default value

--seqfile (s) seqfile

Input path on HDFs

--output (-O) output

Output file path, in local

--substring (-B) substring

The number of chars to print out per value

--count (-c)

The Count only

--numitems (-N) numitems

Output at most <n> key value pairs

--facets (-FA)

Output the Counts per key. Note, if there are a lot of the unique keys, this can take up a fair amount of memory

--help (-h)

Print out Help

4.2 Use Mode

See section 2.2, 3rd of this chapter.

5 Convert text files in sequencefile format to vector file 5.1 introduction

In the classification of text, clustering processing, the first step is to convert the text file into the sequcncefile format, which is described above, this section describes the Sequencefile format of the file file into a quantified sequencefile format.

Use the Seq2sparse command to complete this function. In the Mahout-example-0.6-job.jar package, Org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles

Table 5.1-1 seq2sparse Quantification Process parameter description

Name of parameter

Description

Optional value

Default value

--input (I.) input

Input path (text converted to sequencefile format)

_

--output (-O) output

Output path

_

--chunksize (-chunk) chunksiz

Data block Size (MB) processed at a time

100

--analyzername (-a) analyzername

Specifies the word breaker used (Org.apache.lucene.analysis.standard.StandardAnalyzer, Org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer)

Org.apache.lucene.analysis.standard.StandardAnalyzer

--minsupport (s) minsupport

Words with word frequency greater than minsupport will become characteristic words.

2

--MINDF (-MD) mindf

When the DF value is less than the mindf Word calculates the TFIDF value, the MINDF calculation is used instead of the original DF value.

1

--maxdfpercent (-X) maxdfpercent

Remove words that appear in%maxdfpercent documents

99

--weight (-WT) weight

Method of quantification (TF, TFIDF)

Tfidf

--norm (-N) norm

Normalize by specified norm

0

--minllr (-ML) Minllr

When maxngramsize>1 works, you can remove the less common word combinations

1.0

--numreducers (-NR) numreducers

Specify reduce number

1

--maxngramsize (-ng) ngramsize

Specify Ngrams

1

--overwrite (-ow)

Overwrite last execution result, if specified

_

--sequentialaccessvector (-SEQ)

If specified, the output is sequentialaccesssparsevectors with a continuous access-efficient vector, otherwise using the default

Randomaccesssparsevectors

--namedvector (-NV)

If specified, the output result vector type is namevector

False

--lognormalize (-lnorm)

(Optional) Whether output vectors should be lognormalize. If set True Else false

False

--maxdfsigma (-XS) Maxdfsigma

What portion of the TF (TF-IDF) vectors to is used, expressed in times the standard deviation (sigma) of the document F    Requencies of these vectors. Can is used to remove really high frequency terms. expressed as a double value.  Good value to was specified is 3.0. In case the value is less then 0 no vectors would be filtered out. Default is-1.0. Overrides maxdfpercent

5.2 Use Mode

See part eighth, Section 2.1.1.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.