Apache Pig and Solr question notes (i)

Source: Internet
Author: User
Tags apache solr solr

Record the last two days in the work of the Pig0.12.0 and Solr4.10.2 some problems encountered in a total of 3, as follows:

(1) Question one: How to use the ASCII and hex (hexadecimal) separators in pig to load, and slice data?

Note that in pig, there are 2 scenarios in which
First: When the pig loads (load) data.
Second: When the pig deals with split, or when it intercepts data.

First of all, why use hexadecimal field separators, instead of our common spaces, commas, colons, semicolons, #号, etc., these characters, although can also be used, but if we have data in conflict with these symbols, then in the parsing, there will be some unexpected bug, so, For the sake of insurance, it is a good choice to choose the hexadecimal data which is not readable by the naked eye, which is, of course, for the scene, depends on the situation.

For detailed documentation on ASCII and hex, binary, octal, decimal, please refer to Wikipedia.

Let's go back to the point, in this case, our data format is stored like this:

Java code
    1. One record per line, utf-8 encoding;
    2. Each record includes the field name and field contents;
    3. The fields are separated by ASCII code 1;
    4. The field name and content are separated by ASCII code 2;
One record per line, UTF-8 encoding; Each record includes the field name and field contents, the fields are separated by ASCII code 1, and the field names and contents are separated by ASCII code 2;



A small example in Eclipse is as follows:

Java code
  1. Public static void Main (string[] args) {
  2. //Note \1 and \2, in our IDE, notepad++, the interface of the terminal equipment of Linux, will render different
  3. //display mode, you can learn more about it in Wikipedia
  4. //Data sample
  5. String s="prod_cate_disp_id019";
  6. //split Rules
  7. String ss[]=s.split ("\2");
  8. For (String st:ss) {
  9. System.out.println (ST);
  10. }
  11. }
public static void Main (string[] args) {//note \1 and \2, in our IDE, notepad++, the interface of the terminal equipment of Linux, will show different//display way, you can learn more about it in Wikipedia Data example String s= "prod_cate_disp_id019";//split rule string ss[]=s.split ("\2"); for (String st:ss) {System.out.println (ST) ;}}




For the load function, the type of delimiter that is supported when loading, you can refer to the official website's documentation
Here's a look at the code in the Pig script:

Java code
  1. --hadoop Technology Exchange Group:415886155
  2. /*pig supported separators include the following:
  3. 1, arbitrary string,
  4. 2, any escape character
  5. 3,dec characters \\u001 or \\u002
  6. 4, 16 for character \\x0A \\x0B
  7. */
  8. --note that the load delimiter, which represents 1 of the ASCII , is used as the DEC direct parsing method inside pig
  9. A = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ');
  10. /**
  11. Note the following delimiter ^b, which is the caret character and will only be on the terminal
  12. Display, this symbol, representing the ASCII 2
  13. */
  14. A = foreach a generate regex_extract ($0, ' (. *) ^b (. *) ', 2) as time,
  15. Regex_extract ($1, ' (. *) ^b (. *) ', 2) as KW,
  16. Regex_extract ($2, ' (. *) ^b (. *) ', 2) as IC,
  17. Regex_extract ($3, ' (. *) ^b (. *) ', 2) as CID,
  18. Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME,
  19. Regex_extract ($5, ' (. *) ^b (. *) ', 2) as PName,
  20. Regex_extract ($6, ' (. *) ^b (. *) ', 2) as snt,
  21. Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT,
  22. Regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT,
  23. Regex_extract ($9, ' (. *) ^b (. *) ', 2) as Ant,
  24. Regex_extract ($Ten, ' (. *) ^b (. *) ', 2) as PNT;
  25. --Get string length
  26. A = foreach a generate SIZE (CID) as Len;
  27. --Grouping by length
  28. b = Group A by Len;
  29. --count the numbers under each length
  30. c = foreach B generate Group, COUNT ($1);
  31. --Output printing
  32. Dump C;
--hadoop Technology Exchange Group: 415886155/*pig supported delimiters include: 1, arbitrary string, 2, any character of the escaped character 3,dec \\u001 or \\u0024, 16 character \\x0A \\x0b*/--Note the delimiter at load , representing 1 of the ASCII, as a direct analytic method of Dec inside the pig a = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ');/** note the following delimiter ^b, this symbol is the caret character, only displayed on the terminal device, this symbol, which represents the ASCII 2*/a = foreach a generate regex_extract ($, ' (. *) ^b (. *) ', 2) as time, Regex_extract ($, ' (. *) ^b (. *) ', 2) as KW, Regex_                         EXTRACT ($, ' (. *) ^b (. *) ', 2) as IC, Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as CID, Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME, Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as PName , Regex_extract ($6, ' (. *) ^b (. *) ', 2) as SNT, Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT, regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT, regex_extract                 ($9, ' (. *) ^b (. *) ', 2) as Ant,        Regex_extract ($ $, ' (. *) ^b (. *) ', 2) as PNT;--get string length a = foreach a generate SIZE (CID) as len;--by length Group B = Group A B Y len;--count the number of each length c = foreach B generate Group, COUNT ($);--output print dump C;




(2) Question two: How to query the length of a non-participle field in Apache SOLR, how many records are there?

SOLR does not directly provide such a function like Java lenth, or the size of pig inside a function, then how should we query it?

SOLR does not directly support such queries, but we can do this in disguise through regular queries, using the following:
(1) Query fixed length cid:/. {6}/only filter records with a length of 6
(2) Query range length cid:/. {6,9}/only filter records of length 6 to 9
(3) Query the minimum length of cid:/. {6}.*/with a minimum length of 6



(3) Problem three: In the use of pig+mapreduce, to SOLR, when the batch index, found that there is no error exception, but there is no data in the index?

This is a more bizarre problem, originally, scattered fairy think it should be a program problem, but later found that the same code to another collection add data, it is normal, look at SOLR's log, found that some of the information printed in the following:

Java code
  1. INFO- 2015-04- :36.097; org.apache.solr.update.DirectUpdateHandler2; start commit{, optimize=false,opensearcher=true,waitsearcher=true,expungedeletes=false,softcommit=false, Preparecommit=false}
  2. INFO- 2015-04-: 36.098; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.   Skipping Iw.commit.
  3. INFO- 2015-04-: 36.101; org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not re-opening:org.apache.solr.search.solrindexsearcher
  4. INFO- 2015-04- :36.102; org.apache.solr.update.DirectUpdateHandler2; end_commit_ Flush
info-2015-04-01 21:08:36.097; org.apache.solr.update.DirectUpdateHandler2; start Commit{,optimize=false,opensearcher=true,waitsearcher=true,expungedeletes=false,softcommit=false,preparecommit =false}info-2015-04-01 21:08:36.098; Org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping iw.commit.info-2015-04-01 21:08:36.101; Org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not re-opening:org.apache.solr.search.solrindexsearcherinfo-2015-04-01 21:08:36.1 02; Org.apache.solr.update.DirectUpdateHandler2; End_commit_flush 




Explain the above the meaning of the message, probably said in the data index is finished, but did not find a commit data, so skip commit, this is very strange when the program runs, because the data source HDFs has at least 1.1 million of the data, how can there be no data? Then scattered fairy through Google search found that some people also found similar strange situation, without any abnormalities, rebuilding the index successfully, but in the index did not see any data, and the most puzzling is that these several online cases, there is no solution.

No way, had to look at the program again, this time, the middle processing good need to build index data, to print out to see what the situation, the results are printed out is a row of empty data, the original in the use of regular interception of data, the original delimiter is invalid, so lead to interception of data, the problem is basically positioned , SOLR Index There is no data, it must be because there is no data submitted, resulting in the strange log, the result in the scattered fairy to fix the bug, re-rebuilt the index, found this is successful, in SOLR, also can be normal query data. If you have a similar situation, please first make sure that you can get the data correctly, whether it is read from the remote, or parse word,excel, or txt inside the data, must first determine, can correctly parse the data out, and then, if not built successfully, It can be repaired according to SOLR's log or the exception hint thrown.


Apache Pig and Solr question notes (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.