Apache Pig and Solr question notes (i)

Source: Internet
Author: User
Tags solr solr query

Record the last two days in the work of the Pig0.12.0 and Solr4.10.2 some of the problems encountered in a total of 3 of the following

1 question one how to load and slice data in pig using the ASCII and hexadecimal hexadecimal separators

Note about this problem in pig will be reflected in 2 scenes
The first time when the pig loads the load data.
The second is when the pig deals with split or the data is being intercepted.

Let's say a little bit about why we use hexadecimal field separators instead of our common whitespace comma colon semicolon # These characters can also be used but if we have data that conflicts with these symbols, then there are some unexpected bugs in parsing, so it's a good idea to choose a hexadecimal data that is not readable by the naked eye for the sake of insurance. Of course, this is also the case for the scenario of the decision.

For a detailed description of the documentation for ASCII and hexadecimal binary octal decimal, please refer to Wikipedia.

Let's go back to the subject. In this example, our data format is stored in this way

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

    1. One record per line, utf-8 encoding

    2. Each record includes field names and field contents

    3. Between fields are separated by ASCII code 1

    4. The field name and content are separated by ASCII code 2

One record per line, UTF-8 encoding each record includes field names and field content fields separated by ASCII code 1 between the field names and the contents of the fields by ASCII 2



A small example in Eclipse is as follows

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. Public Static void Main (string[] args) {

  2. //Note that \1 and \2 in the interface of the Linux terminal device in our IDE notepad++ will present a different

  3. //Display mode you can learn more about the following in Wikipedia

  4. //Data sample

  5. String s="prod_cate_disp_id019";

  6. //split Rules

  7. String ss[]=s.split ("\2");

  8. for (String st:ss) {

  9. System.out.println (ST);

  10. }

  11. }

public static void Main (string[] args) {//note \1 and \2 will show different//display methods in the interface of the Linux terminal device in our IDE notepad++ you can learn more about it in Wikipedia Data example String s= "prod_cate_disp_id019";//split rule string ss[]=s.split ("\2"); for (String st:ss) {System.out.println (ST);}}




About the delimiter types supported when loading the load function you can refer to the official website's documentation
The code below looks at the pig script

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. --hadoop Technology Exchange Group 415886155

  2. /*pig-supported delimiters include

  3. 1, arbitrary string

  4. 2, any escape character

  5. 3dec characters \\u001 or \\u002

  6. 46 character \\x0A \\x0B

  7. */

  8. Note that this load delimiter represents the ASCII 1 as the DEC direct parsing method inside pig

  9. A = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ' ) ;

  10. /**

  11. Note the following delimiter ^b this symbol is the caret character only on the end device

  12. shows that this symbol represents 2 of the ASCII

  13. */

  14. A = foreach a generate regex_extract ($0, ' (. *) ^b (. *) ', 2) as time,

  15. Regex_extract ($1, ' (. *) ^b (. *) ', 2) as KW,

  16. Regex_extract ($2, ' (. *) ^b (. *) ', 2) as IC,

  17. Regex_extract ($3, ' (. *) ^b (. *) ', 2) as CID,

  18. Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME,

  19. Regex_extract ($5, ' (. *) ^b (. *) ', 2) as PName,

  20. Regex_extract ($6, ' (. *) ^b (. *) ', 2) as SNT,

  21. Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT,

  22. Regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT,

  23. Regex_extract ($9, ' (. *) ^b (. *) ', 2) as Ant,

  24. Regex_extract ($ten, ' (. *) ^b (. *) ', 2) as PNT;

  25. --Get string length

  26. A = foreach a generate SIZE (CID) as Len;

  27. --Grouping by length

  28. b = Group A by Len;

  29. --count the numbers under each length

  30. c = foreach B generate Group, COUNT ($1);

  31. --Output printing

  32. Dump C;

--hadoop Technical Exchange Group 415886155/*pig supported delimiters include 1, any string 2, any character with an escape character of 3dec \\u001  or  \\u0024 16 for character  \\x0A   \\x0b*/--Note that this load delimiter represents the ASCII 1 as the DEC Direct parsing method in pig a = load  '/tmp/dongliang/20150401/ 20150301/tmp_search_keywords_cate_stat/'  using pigstorage (' \\u001 ')  ;/** Note the following delimiters ^ b This symbol is the caret character will only show on the end device this symbol represents the ASCII 2*/a = foreach a generate   regex_extract   ($0,  ' (. *) ^b (. *) ',  2)  as time ,                          regex_ extract  ($1,  ' (. *) ^b (. *) ',  2)  as kw ,                           regex_extract  ($2,  ' (. *) ^b (. *) ',  2)  as ic ,                          REGEX_EXTRACT  ($3,  ' (. *) ^b (. *) ',  2  as cid,                          REGEX_EXTRACT  ($4,  ' (. *) ^b (. *) ',  2  as cname,                          REGEX_EXTRACT  ($5,  ' (. *) ^b (. *) ',  2  as pname,                          REGEX_EXTRACT  ($6,  ' (. *) ^b (. *) ',  2  as snt,                          REGEX_EXTRACT  ($7,  ' (. *) ^b (. *) ',  2)  as cnt,                          REGEX_EXTRACT  ($8,  ' (. *) ^b (. *) ',  2)  as fnt,                          REGEX_EXTRACT  ($9,  ' (. *) ^b (. *) ',  2)  as ant,                          REGEX_EXTRACT  ($10,  ' (. *) ^b (. *) ',  2 )  as pnt ;--Gets the string length a = foreach a generate size (CID)  as len;- -Grouping by length b = group a by len;--count the number of each length c = foreach b generate  Group, count ($);--Output print dump c;




2 question two how in Apache How many records in SOLR query the length of a field that does not have a word breaker

SOLR does not directly provide such a function like Java lenth or the size of the pig, so how should we query it?

SOLR does not directly support such queries but we can do this in a disguised way using regular queries.
1 Queries fixed-length cid:/.{ 6}/filters only records with a length of 6
2 query range length cid:/.{ 6,9}/only filters the length 6 to 9 of the records
3 queries at least how many lengths above the cid:/.{ 6}.*/with a minimum length of 6



3 question three when you use Pig+mapreduce to add indexes to SOLR in bulk, you find that there is no error exception but there is no data in the index?

This is a rather bizarre problem, and it was supposed to be something wrong with the program, but then the same code was found to add data to another collection. Check the contents of SOLR's log discovery for some information such as


Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

  1. INFO- about-£ º 36.097; Org.apache.solr.update.DirectUpdateHandler2; Start commit{,optimize=false, opensearcher=true, waitsearcher=true, expungedeletes= False, softcommit=false, preparecommit=false}

  2. INFO- about-£ º 36.098; Org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping Iw.commit.

  3. INFO- about-£ º 36.101; Org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not Re-opening:org.apache.solr.search.solrindexsearcher

  4. Info  - 2015 -04 - 01  21 : 08 : 36.102 ;  org.apache.solr.update.directupdatehandler2; end_commit_flush   

info  - 2015-04-01 21:08:36.097;  Org.apache.solr.update.directupdatehandler2; start commit{,optimize=false,opensearcher=true, Waitsearcher=true,expungedeletes=false,softcommit=false,preparecommit=false}info  - 2015-04-01  21:08:36.098; org.apache.solr.update.directupdatehandler2; no uncommitted changes.  Skipping IW.commit.INFO  - 2015-04-01 21:08:36.101;  org.apache.solr.core.solrcore; solrindexsearcher has not changed - not  re-opening: org.apache.solr.search.solrindexsearcherinfo  - 2015-04-01 21:08:36.102;  org.apache.solr.update.directupdatehandler2; end_commit_flush 




Explaining the information above means that the data index is finished but no commit data is found, so skipping commit is very strange when the program is running because at least 1.1 million of the data in HDFs has no data. Then scattered the fairy through Google search found that some people found that similar strange situation without any exception to rebuild the index successfully but in the index did not see any data and the most puzzling is that these several online cases have no solution.

There is no way to look at the program again this time to the middle of the processing of the data to be indexed to print out to see what the results print out is a row of empty data originally in the use of regular interception of data when the original delimiter is invalid, so that the problem of interception of data is basically located in the SOLR index no data is definitely because there is no The result of the data submission of the strange log occurred after the bug fix the error after re-rebuilding the index found this time it was successful in SOLR can also be normal query data. If you have a similar situation, please make sure that you can correctly obtain the data whether it is read from the remote or parse wordexcel or TXT data in the first to determine the correct data can be parsed and then if it is not built successfully can be based on SOLR log or thrown exception prompt repair 。

650) this.width=650; "Title=" Click to view the original size picture "class=" Magplus "src=" http://dl2.iteye.com/upload/attachment/0107/2115/ C3a73103-ca7b-3773-9611-d21f41a377e7.png "width=" "height=" 311 "alt=" C3a73103-ca7b-3773-9611-d21f41a377e7.png " />


This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1627654

Apache Pig and solr question notes (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.