grunt> cat/opt/dataset/input.txt keyword1 keyword2 keyword2 keyword4 keyword3 keyword1 k
Eyword4 keyword4 A = LOAD '/opt/dataset/input.txt ' using pigstorage (' \ n ') as (Line:chararray);
B = foreach A generate Tokenize ((Chararray) $);
C = foreach B Generate Flatten ($) as word;
D = Group C by word;
E = foreach D generate COUNT (C), group;
Dump B;
({(keyword1), (KEYWORD2)})
({(KEYWORD2), (KEYWORD4)})
({(KEYWORD3), (KEYWORD1)})
({(KEYWORD4), (KEYWORD4)}) dump C;
(KEYWORD1)
(KEYWORD2)
(KEYWORD2)
(KEYWORD4)
(KEYWORD3)
(KEYWORD1)
(KEYWORD4)
(KEYWORD4) Dump D;
(keyword1,{(Keyword1), (KEYWORD1)})
(keyword2,{(Keyword2), (KEYWORD2)})
(Keyword3,{(KEYWORD3)})
(keyword4,{(KEYWORD4), (KEYWORD4), (KEYWORD4)}) dump E;
(2,KEYWORD1)
(2,KEYWORD2)
(1,KEYWORD3) (3,KEYWORD4) store E into './wordcount ';
<pre code_snippet_id= "327646" snippet_file_name= "blog_20140505_2_6349649" name= "code" class= "Java" >TOKENIZE
Splits a string and outputs a bag of words.
Syntax tokenize (expression) Terms expression A expression with data type Chararray. Usage Use the Tokenize function to split a string of words (all words in a single tuple) into a bag of words (each WOR D in a single tuple).
The following characters are considered to be word separators:space, double quote ("), coma (,) parenthesis (()), Star (*).
Example in this Example the strings in each row are split.
A = LOAD ' data ' as (F1:chararray);
DUMP A;
(This is the "the"
(This is the second string.)
(This is the third string.)
X = FOREACH A GENERATE tokenize (F1);
DUMP X;
{(here), (IS), (the), (A, (a)})
{(here), (IS), (the), (second), (String.)}) {(here), (IS), (the), (third), (String.)}) </pre><br> <br&Gt <pre></pre> <br>
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/