Record the last two days in the work of the Pig0.12.0 and Solr4.10.2 some problems encountered in a total of 3, as follows:
(1) Question one: How to use the ASCII and hex (hexadecimal) separators in pig to load, and slice data?
Note that in pig, there are 2 scenarios in which
First: When the pig loads (load) data.
Second: When the pig deals with split, or when it intercepts data.
First of all, why use hexadecimal field separators, instead of our common spaces, commas, colons, semicolons, #号, etc., these characters, although can also be used, but if we have data in conflict with these symbols, then in the parsing, there will be some unexpected bug, so, For the sake of insurance, it is a good choice to choose the hexadecimal data which is not readable by the naked eye, which is, of course, for the scene, depends on the situation.
For detailed documentation on ASCII and hex, binary, octal, decimal, please refer to Wikipedia.
Let's go back to the point, in this case, our data format is stored like this:
Java code
- One record per line, utf-8 encoding;
- Each record includes the field name and field contents;
- The fields are separated by ASCII code 1;
- The field name and content are separated by ASCII code 2;
One record per line, UTF-8 encoding; Each record includes the field name and field contents, the fields are separated by ASCII code 1, and the field names and contents are separated by ASCII code 2;
A small example in Eclipse is as follows:
Java code
- Public static void Main (string[] args) {
- //Note \1 and \2, in our IDE, notepad++, the interface of the terminal equipment of Linux, will render different
- //display mode, you can learn more about it in Wikipedia
- //Data sample
- String s="prod_cate_disp_id019";
- //split Rules
- String ss[]=s.split ("\2");
- For (String st:ss) {
- System.out.println (ST);
- }
- }
public static void Main (string[] args) {//note \1 and \2, in our IDE, notepad++, the interface of the terminal equipment of Linux, will show different//display way, you can learn more about it in Wikipedia Data example String s= "prod_cate_disp_id019";//split rule string ss[]=s.split ("\2"); for (String st:ss) {System.out.println (ST) ;}}
For the load function, the type of delimiter that is supported when loading, you can refer to the official website's documentation
Here's a look at the code in the Pig script:
Java code
- --hadoop Technology Exchange Group:415886155
- /*pig supported separators include the following:
- 1, arbitrary string,
- 2, any escape character
- 3,dec characters \\u001 or \\u002
- 4, 16 for character \\x0A \\x0B
- */
- --note that the load delimiter, which represents 1 of the ASCII , is used as the DEC direct parsing method inside pig
- A = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ');
- /**
- Note the following delimiter ^b, which is the caret character and will only be on the terminal
- Display, this symbol, representing the ASCII 2
- */
- A = foreach a generate regex_extract ($0, ' (. *) ^b (. *) ', 2) as time,
- Regex_extract ($1, ' (. *) ^b (. *) ', 2) as KW,
- Regex_extract ($2, ' (. *) ^b (. *) ', 2) as IC,
- Regex_extract ($3, ' (. *) ^b (. *) ', 2) as CID,
- Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME,
- Regex_extract ($5, ' (. *) ^b (. *) ', 2) as PName,
- Regex_extract ($6, ' (. *) ^b (. *) ', 2) as snt,
- Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT,
- Regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT,
- Regex_extract ($9, ' (. *) ^b (. *) ', 2) as Ant,
- Regex_extract ($Ten, ' (. *) ^b (. *) ', 2) as PNT;
- --Get string length
- A = foreach a generate SIZE (CID) as Len;
- --Grouping by length
- b = Group A by Len;
- --count the numbers under each length
- c = foreach B generate Group, COUNT ($1);
- --Output printing
- Dump C;
--hadoop Technology Exchange Group: 415886155/*pig supported delimiters include: 1, arbitrary string, 2, any character of the escaped character 3,dec \\u001 or \\u0024, 16 character \\x0A \\x0b*/--Note the delimiter at load , representing 1 of the ASCII, as a direct analytic method of Dec inside the pig a = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ');/** note the following delimiter ^b, this symbol is the caret character, only displayed on the terminal device, this symbol, which represents the ASCII 2*/a = foreach a generate regex_extract ($, ' (. *) ^b (. *) ', 2) as time, Regex_extract ($, ' (. *) ^b (. *) ', 2) as KW, Regex_ EXTRACT ($, ' (. *) ^b (. *) ', 2) as IC, Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as CID, Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME, Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as PName , Regex_extract ($6, ' (. *) ^b (. *) ', 2) as SNT, Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT, regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT, regex_extract ($9, ' (. *) ^b (. *) ', 2) as Ant, Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as PNT;--get string length a = foreach a generate SIZE (CID) as len;--by length Group B = Group A B Y len;--count the number of each length c = foreach B generate Group, COUNT ($);--output print dump C;
(2) Question two: How to query the length of a non-participle field in Apache SOLR, how many records are there?
SOLR does not directly provide such a function like Java lenth, or the size of pig inside a function, then how should we query it?
SOLR does not directly support such queries, but we can do this in disguise through regular queries, using the following:
(1) Query fixed length cid:/. {6}/only filter records with a length of 6
(2) Query range length cid:/. {6,9}/only filter records of length 6 to 9
(3) Query the minimum length of cid:/. {6}.*/with a minimum length of 6
(3) Problem three: In the use of pig+mapreduce, to SOLR, when the batch index, found that there is no error exception, but there is no data in the index?
This is a more bizarre problem, originally, scattered fairy think it should be a program problem, but later found that the same code to another collection add data, it is normal, look at SOLR's log, found that some of the information printed in the following:
Java code
- INFO- 2015-04- :36.097; org.apache.solr.update.DirectUpdateHandler2; start commit{, optimize=false,opensearcher=true,waitsearcher=true,expungedeletes=false,softcommit=false, Preparecommit=false}
- INFO- 2015-04-: 36.098; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping Iw.commit.
- INFO- 2015-04-: 36.101; org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not re-opening:org.apache.solr.search.solrindexsearcher
- INFO- 2015-04- :36.102; org.apache.solr.update.DirectUpdateHandler2; end_commit_ Flush
info-2015-04-01 21:08:36.097; org.apache.solr.update.DirectUpdateHandler2; start Commit{,optimize=false,opensearcher=true,waitsearcher=true,expungedeletes=false,softcommit=false,preparecommit =false}info-2015-04-01 21:08:36.098; Org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping iw.commit.info-2015-04-01 21:08:36.101; Org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not re-opening:org.apache.solr.search.solrindexsearcherinfo-2015-04-01 21:08:36.1 02; Org.apache.solr.update.DirectUpdateHandler2; End_commit_flush
Explain the above the meaning of the message, probably said in the data index is finished, but did not find a commit data, so skip commit, this is very strange when the program runs, because the data source HDFs has at least 1.1 million of the data, how can there be no data? Then scattered fairy through Google search found that some people also found similar strange situation, without any abnormalities, rebuilding the index successfully, but in the index did not see any data, and the most puzzling is that these several online cases, there is no solution.
No way, had to look at the program again, this time, the middle processing good need to build index data, to print out to see what the situation, the results are printed out is a row of empty data, the original in the use of regular interception of data, the original delimiter is invalid, so lead to interception of data, the problem is basically positioned , SOLR Index There is no data, it must be because there is no data submitted, resulting in the strange log, the result in the scattered fairy to fix the bug, re-rebuilt the index, found this is successful, in SOLR, also can be normal query data. If you have a similar situation, please first make sure that you can get the data correctly, whether it is read from the remote, or parse word,excel, or txt inside the data, must first determine, can correctly parse the data out, and then, if not built successfully, It can be repaired according to SOLR's log or the exception hint thrown.
Apache Pig and Solr question notes (i)