Text Mining instance

Source: Internet
Author: User
Tags foreach static class stub

First, the development environment:

1. System: WIN7

2, Ide:eclipse

3, java:jdk1.6


Second, the required jar package

1, Lucene-core-3.1.0.jar

2, Paoding-analysis.jar

3. Data dictionary dic


Third, the cluster environment

1, Node: Master (1), Slave (2)

2. System: RedHat 6.2

3, jdk:jdk1.6

4, hadoop:hadoop1.1.2

5, mahout:mahout0.6

6, pig:pig0.11


Iv. Data Preparation

1.18.7m,8000+ Model Files

2.19.2m,9000+ Test Files


V. Development steps

(i), purchase and build Cbayes model

1, the model file is composed of more than 8,000 small files, if using the MapReduce default Fileinputformat read, will generate at least 8000+ map task, so the efficiency will be very low, in order to deal with small file problems, You need to customize the Fileinputformat and extends Combinefileinputformat (create slices with multiple small file combinations).

The custom Combinefileinputformat and Recordreader codes are as follows:

1), Custom Combinefileinputformat

Package Fileinputformat;
Import java.io.IOException;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.InputSplit;
Import Org.apache.hadoop.mapreduce.JobContext;
Import Org.apache.hadoop.mapreduce.RecordReader;
Import Org.apache.hadoop.mapreduce.TaskAttemptContext;
Import Org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
Import Org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;


public class Myfileinputformat extends Combinefileinputformat<text, text>{
@Override
Public Recordreader<text,text> Createrecordreader (inputsplit split,taskattemptcontext context) throws IOException {
Combinefilerecordreader<text, text> recordreader = new Combinefilerecordreader<text, Text> (( Combinefilesplit) Split, context, Myfilerecordreader.class);

Returns the custom Recordreader
return recordreader;
}
Requires that a file must be in one slice, and a slice can contain multiple files
@Override
Protected Boolean issplitable (Jobcontext context, Path file) {
return false;
}
}

2), Custom Recordreader

Package Fileinputformat;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FSDataInputStream;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.BytesWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.InputSplit;
Import Org.apache.hadoop.mapreduce.RecordReader;
Import Org.apache.hadoop.mapreduce.TaskAttemptContext;
Import Org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
Import Org.apache.hadoop.mapreduce.lib.input.FileSplit;
Import Org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
Import Org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader;
Import Org.apache.hadoop.util.ReflectionUtils;


public class Myfilerecordreader extends recordreader<text,text>{
Private text Currentkey = new text (); The current key
Private text CurrentValue = new text (); The current value
Private Configuration conf; Task Information
Private Boolean processed; Records whether the current file has been read
Private Combinefilesplit split; Task slices to be processed
private int totallength; The number of files that the slice contains
private int index; Index of the current file in Split
private float currentprogress = 0; Current processing progress

Public Myfilerecordreader (Combinefilesplit split, Taskattemptcontext context, Integer index) throws IOException {
Super ();
This.split = split;
This.index = index; The current small file to be processed by the block in the Combinefilesplit index
this.conf = Context.getconfiguration ();
This.totallength = Split.getpaths (). length;
this.processed = false;
}
@Override
public void Close () throws IOException {
}
@Override
Public Text Getcurrentkey () throws IOException, Interruptedexception {
TODO auto-generated Method Stub
return currentkey;
}
@Override
Public Text GetCurrentValue () throws IOException, Interruptedexception {
TODO auto-generated Method Stub
return currentvalue;
}
@Override
public float getprogress () throws IOException, Interruptedexception {
if (index >= 0 && Index < totallength) {
Currentprogress = (float) index/totallength;
return currentprogress;
}
return currentprogress;
}
@Override
public void Initialize (Inputsplit split, Taskattemptcontext context)
Throws IOException, Interruptedexception {
This.split = (combinefilesplit) split;
Handle a small file block in Combinefilesplit, because using Linerecordreader, you need to construct a Filesplit object before you can read the data
Filesplit filesplit = new Filesplit (This.split.getPath (index), This.split.getOffset (index), This.split.getLength ( Index), this.split.getLocations ());
Linerecordreader.initialize (filesplit, context);
This.paths = This.split.getPaths ();
Totallength = Paths.length;
Context.getconfiguration (). Set ("Map.input.file.name", This.split.getPath (Index). GetName ());
}


Reads all content from one file at a time to generate a row
@Override
public Boolean Nextkeyvalue () throws IOException, Interruptedexception {
if (!processed) {//If the file is not processed, read the file and set the Key-value
Set key
Path file = Split.getpath (index);
Currentkey.set (File.getparent (). GetName ());
Set Value
Fsdatainputstream in = null;
byte[] contents = new byte[(int) (Split.getlength (index))];
try{
FileSystem fs = File.getfilesystem (conf);
in = Fs.open (file);
in.readfully (contents);
Currentvalue.set (contents);
} catch (Exception e) {
} finally {
In.close ();
}
processed = true;
return true;
}
return false; Must return FALSE if the file has already been processed
}
}


3), custom MapReduce and run interface related code

Package Mr;
Import java.io.IOException;
Import Java.io.StringReader;
Import Net.paoding.analysis.analyzer.PaodingAnalyzer;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.BytesWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Fileinputformat.myfileinputformat;

public class Countapp extends configured implements tool{
public static void Main (string[] args) throws exception{
Toolrunner.run (New Countapp (), args);
}
Static class Mymapper extends Mapper<text, text, text, text>
{
Text Outvalue = new text ();
Paodinganalyzer Analyzer = new Paodinganalyzer ();
@Override
protected void Map (text key, text value,mapper<text, text, text, Text>. Context ctx) throws IOException, Interruptedexception {
System.out.println (Key.tostring ());
String line = value.tostring ();
StringReader sr = new StringReader (line);
Tokenstream ts = Analyzer.tokenstream ("", SR);
StringBuilder sb = new StringBuilder ();
while (Ts.incrementtoken ())
{
Chartermattribute ta = Ts.getattribute (chartermattribute.class);
Sb.append (Ta.tostring ());
Sb.append ("");
}
Outvalue.set (sb.tostring (). Trim ());
Ctx.write (key, Outvalue);
}
}
@Override
public int run (string[] args) throws Exception {
Configuration conf = new configuration ();
Conf.setlong ("Mapreduce.input.fileinputformat.split.maxsize", 4000000);
Conf.setint ("Mapred.min.split.size", 1);
Conf.setint ("Mapred.reduce.tasks", 5);

Job Job = new Job (Conf,countapp.class.getsimplename ());
Job.setjarbyclass (Countapp.class);

Using a custom Fileinputformat
Job.setinputformatclass (Myfileinputformat.class);
Job.setmapperclass (Mymapper.class);

Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Text.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);

Path Outpath = new Path (args[1]);
Path Inpath = new Path (args[0]);

FileSystem IFS = inpath.getfilesystem (conf);
filestatus[] inpaths = Ifs.liststatus (Inpath);
for (Filestatus fls:inpaths)
{
Fileinputformat.addinputpath (Job, Fls.getpath ());
}

FileSystem fs = Outpath.getfilesystem (conf);
if (fs.exists (Outpath))
{
Fs.delete (Outpath, true);
Fs.close ();
}
Fileoutputformat.setoutputpath (Job, Outpath);
Job.setoutputformatclass (Textoutputformat.class);
Job.waitforcompletion (TRUE);
System.exit (0);
return 1;
}
}


4), build the jar package, and run on Hadoop

Hadoop jar Wordcount.jar/yu/news/yu/out/news

Raw data:

Http://mp3.zol.com.cn/300/3003150.html
Multi-point recommendation Taiwan Electric Flat-panel purchase introduction
Multi-point recommendation Taiwan Electric Flat-panel purchase introduction


The most affordable, thinnest texture, the most comprehensive performance, is to let more people enjoy digital life. All the time, Taiwan Electric Design Factory has been able to get a good market response, and even once to meet the demand for hot selling situation.

......


The generated data

MP3 HTTP MP 3 mp3 Zol com CN 3.zol.com.cn 289 2897244 HTML 2897244.html 10 hours play A13 pu Momo 9 momo9 Enhanced enhanced Range Aerial Survey test 10 Hours play A13 Momo 9 momo9 Strengthen the enhanced version of the Aerial survey test April lover Tip launch new 7 inch A13 Master Game tablet Momo 9 Momo9 enhanced version 22 generation abbreviation Momo 9 mom O9 strengthens the edition to enjoy the everlasting sincere love Pepnice Momo 9 momo9 Strengthen the built-in 7-inch

.......


5) Generate Cbayes models based on the generated model data

Start pig

processed = Load '/yu/news ' as (Category:chararray,doc:chararray);

Test = sample processed 0.2;

JNT = Join processed by (Category,doc) to outer, test by (Category,doc);

Filt_test = Filter jnt by test::category is null;

Train = foreach Filt_test generate processed::category as category,processed::d OC as Doc;

Store test into '/yu/model/test ';

Store train into '/yu/model/train ';


Execute the mahout command to generate the training model

Mahout Trainclassifier

-i/yu/out/model/test

-o/yu/out/train/model/cbayes

-type Cbayes

-ng 1

-source HDFs;


(b) According to the generated Cbayes model, classify the test data and find out the classification that the user is most interested in.

1, the test data to initialize the operation

Execute command:

Hadoop jar Wordcount.jar/yu/user_sport/yu/out/user_sport

The resulting data is as follows:

12213800 interview Sun Fengfeng Wu Women's basketball talents face the fault of the Olympic Games to the top 8 not easy Sohu Sports Pei Zhang Zhangliang Month Beijing Gazette report the first Olympic Games will be opened today China women's basketball basketball airport departure to France will be in France Three warm-up days to go to the British Leeds China Representative delegation before the camp before the Chinese women's basketball manager Sun Fengfeng was accepted by the Sohu Sports exclusive special interview that the Olympic Games will be the largest Chinese women's basketball big The task is to successfully complete the succession a group of young athletes under the age of mobilization not often have the potential of this Olympic Games will be for them to say is very good exercise opportunities the next Olympic Games will be their positive value


2, borrow Mahout API, redefine MapReduce, load Cbayes model

1), write the class, generate the main function call interface, the relevant code is as follows:

Package mahout;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner;
Import Org.apache.mahout.classifier.bayes.BayesParameters;


public class Bayesdriver extends configured implements tool{


public static void Main (string[] args) throws exception{
Toolrunner.run (New Bayesdriver (), args);
}


@Override
public int run (string[] args) throws Exception {
Bayesparameters params = new bayesparameters ();
Params.set ("Classifiertype", args[3]);
Params.set ("Alpha_i", "1.0");
Params.set ("Defaultcat", "Unknown");
Params.setgramsize (1);
Params.setbasepath (args[2]);

Configuration conf = new configuration ();
Conf.set ("Bayes.parameters", params.tostring ());
Job Job = new Job (Conf,bayesdriver.class.getsimplename ());
Job.setjarbyclass (Bayesdriver.class);
Job.setinputformatclass (Keyvaluetextinputformat.class);
Job.setmapperclass (Mymapper.class);
Job.setreducerclass (Myreducer.class);

Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Longwritable.class);
Job.setoutputkeyclass (Nullwritable.class);
Job.setoutputvalueclass (Text.class);

Path Inpath = new Path (args[0]);
Path Outpath = new Path (args[1]);
Fileinputformat.addinputpath (Job, Inpath);
Fileoutputformat.setoutputpath (Job, Outpath);

FileSystem fs = Outpath.getfilesystem (conf);
if (fs.exists (Outpath))
{
Fs.delete (outpath,true);
Fs.close ();
}
Job.waitforcompletion (TRUE);
System.exit (1);
return 1;
}
}

2), write the Custom mapper class

Package mahout;
Import java.io.IOException;
Import java.util.List;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.mahout.classifier.ClassifierResult;
Import Org.apache.mahout.classifier.bayes.Algorithm;
Import Org.apache.mahout.classifier.bayes.BayesAlgorithm;
Import Org.apache.mahout.classifier.bayes.BayesParameters;
Import Org.apache.mahout.classifier.bayes.CBayesAlgorithm;
Import Org.apache.mahout.classifier.bayes.ClassifierContext;
Import Org.apache.mahout.classifier.bayes.Datastore;
Import Org.apache.mahout.classifier.bayes.InMemoryBayesDatastore;
Import org.apache.mahout.classifier.bayes.InvalidDatastoreException;
Import Org.apache.mahout.common.nlp.NGrams;


public class Mymapper extends Mapper<text, text, text, longwritable>{
Private Classifiercontext classifier;
Private String defaultcategory;
private int gramsize = 1;
Text Outkey = new text ();
Longwritable one = new longwritable (1);
@Override
protected void Setup (Context ctx)
Throws IOException, Interruptedexception {
Configuration conf = ctx.getconfiguration ();
Bayesparameters params = new Bayesparameters (Conf.get ("Bayes.parameters", "" "));
algorithm algorithm;
Datastore Datastore;
if ("Bayes". Equalsignorecase (Params.get ("Classifiertype")))
{
Algorithm = new Bayesalgorithm ();
Datastore = new Inmemorybayesdatastore (params);
}
else if ("Cbayes". Equalsignorecase (Params.get ("Classifiertype")))
{
Algorithm = new Cbayesalgorithm ();
Datastore = new Inmemorybayesdatastore (params);
}
Else
{
throw new IllegalArgumentException ("Unrecognized classifier Type:" +params.get ("Classifiertype"));
}

classifier = new Classifiercontext (algorithm, datastore);
try {
Classifier.initialize ();
} catch (Invaliddatastoreexception e) {
E.printstacktrace ();
}
Defaultcategory = Params.get ("Defaultcat");
Gramsize = Params.getgramsize ();
}

@Override
protected void Map (text key, text value,
Mapper<text, text, text, Longwritable> Context CTX)
Throws IOException, Interruptedexception {
String Doclabel = "";
String userId = key.tostring ();
list<string> ngrams = new Ngrams (value.tostring (), gramsize). Generatengramswithoutlabel ();
Classifierresult result;
try {
result = Classifier.classifydocument (Ngrams.toarray (New String[ngrams.size ())), defaultcategory);
Doclabel = Result.getlabel ();
} catch (Invaliddatastoreexception e) {
E.printstacktrace ();
}
Outkey.set (userId + "|" + Doclabel);
Ctx.write (Outkey, one);
}


}


3), write a custom reduce class

Package mahout;
Import java.io.IOException;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.NullWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Reducer;
public class Myreducer extends Reducer<text, longwritable, nullwritable, text>{
Private text Outvalue = new text ();
@Override
protected void reduce (Text key, Iterable<longwritable> Vals,
Reducer<text, Longwritable, nullwritable, Text> Context CTX)
Throws IOException, Interruptedexception {
Long sum = 0;
for (longwritable lw:vals)
{
Sum+=lw.get ();
}
Outvalue.set (key.tostring () + "|" +sum);
Ctx.write (Nullwritable.get (), outvalue);
}
}


3. Make the project into jar package, execute the command

Hadoop jar Cbayessort.jar \

>>/yu/out/user_sport \

>>/yu/out/user_info\

>>/yu/model/cbayes \//generated Cbayes model storage address

>> Cbayes;

The resulting data set is as follows:

10511838|camera|7
10511838|household|2
10511838|mobile|53
10564290|camera|4
10564290|household|4
10564290|mobile|80
107879|camera|8
107879|household|1
107879|mobile|83
11516148|camera|12
11516148|household|1

......

4. Start pig, execute command

u_ct = Load '/yu/out/user_info ' using Pigstorage (' | ') as (Userid:chararray, Category:chararray, Visitnums:int);

U_stat = foreach u_ct (group u_ct by UserId) {

sorted = Order u_ct by visitnums Desc;

top = limit sorted 1;

Generate Flatten (top), SUM (u_ct.visitnums);

}

Store u_stat into '/yu/out/user_info_stort ';

The resulting data is as follows (User ID Category: Total number of access times for this class of visits)

10511838 Mobile 53 62
10564290 Mobile 80 88
107879 Mobile 83 92
11516148 Mobile 80 93
11837625 Mobile 91 100
11845829 Mobile 161 183
11884229 Mobile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.