Apache Pig and Solr question notes (i)

Last Update:2015-04-02 Source: Internet

Author: User

Tags solr solr query

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Record the last two days in the work of the Pig0.12.0 and Solr4.10.2 some of the problems encountered in a total of 3 of the following

1 question one how to load and slice data in pig using the ASCII and hexadecimal hexadecimal separators

Note about this problem in pig will be reflected in 2 scenes
The first time when the pig loads the load data.
The second is when the pig deals with split or the data is being intercepted.

Let's say a little bit about why we use hexadecimal field separators instead of our common whitespace comma colon semicolon # These characters can also be used but if we have data that conflicts with these symbols, then there are some unexpected bugs in parsing, so it's a good idea to choose a hexadecimal data that is not readable by the naked eye for the sake of insurance. Of course, this is also the case for the scenario of the decision.

For a detailed description of the documentation for ASCII and hexadecimal binary octal decimal, please refer to Wikipedia.

Let's go back to the subject. In this example, our data format is stored in this way

Java code 650) this.width=650; "alt=" Copy Code "src=" Http://qindongliang.iteye.com/images/icon_copy.gif "/> 650) this.width=650; "class=" Star "alt=" collection Code "src=" Http://qindongliang.iteye.com/images/icon_star.png "/>650" this.width=650, "class=" Spinner "src=" Http://qindongliang.iteye.com/images/spinner.gif "alt=" Spinner.gif "/>

One record per line, utf-8 encoding
Each record includes field names and field contents
Between fields are separated by ASCII code 1
The field name and content are separated by ASCII code 2

One record per line, UTF-8 encoding each record includes field names and field content fields separated by ASCII code 1 between the field names and the contents of the fields by ASCII 2

A small example in Eclipse is as follows

Public Static void Main (string[] args) {
//Note that \1 and \2 in the interface of the Linux terminal device in our IDE notepad++ will present a different
//Display mode you can learn more about the following in Wikipedia
//Data sample
String s="prod_cate_disp_id019";
//split Rules
String ss[]=s.split ("\2");
for (String st:ss) {
System.out.println (ST);
}
}

public static void Main (string[] args) {//note \1 and \2 will show different//display methods in the interface of the Linux terminal device in our IDE notepad++ you can learn more about it in Wikipedia Data example String s= "prod_cate_disp_id019";//split rule string ss[]=s.split ("\2"); for (String st:ss) {System.out.println (ST);}}

About the delimiter types supported when loading the load function you can refer to the official website's documentation
The code below looks at the pig script

--hadoop Technology Exchange Group 415886155
/*pig-supported delimiters include
1, arbitrary string
2, any escape character
3dec characters \\u001 or \\u002
46 character \\x0A \\x0B
*/
Note that this load delimiter represents the ASCII 1 as the DEC direct parsing method inside pig
A = Load '/tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using Pigstorage (' \\u001 ' ) ;
/**
Note the following delimiter ^b this symbol is the caret character only on the end device
shows that this symbol represents 2 of the ASCII
*/
A = foreach a generate regex_extract ($0, ' (. *) ^b (. *) ', 2) as time,
Regex_extract ($1, ' (. *) ^b (. *) ', 2) as KW,
Regex_extract ($2, ' (. *) ^b (. *) ', 2) as IC,
Regex_extract ($3, ' (. *) ^b (. *) ', 2) as CID,
Regex_extract ($4, ' (. *) ^b (. *) ', 2) as CNAME,
Regex_extract ($5, ' (. *) ^b (. *) ', 2) as PName,
Regex_extract ($6, ' (. *) ^b (. *) ', 2) as SNT,
Regex_extract ($7, ' (. *) ^b (. *) ', 2) as CNT,
Regex_extract ($8, ' (. *) ^b (. *) ', 2) as FNT,
Regex_extract ($9, ' (. *) ^b (. *) ', 2) as Ant,
Regex_extract ($ten, ' (. *) ^b (. *) ', 2) as PNT;
--Get string length
A = foreach a generate SIZE (CID) as Len;
--Grouping by length
b = Group A by Len;
--count the numbers under each length
c = foreach B generate Group, COUNT ($1);
--Output printing
Dump C;

--hadoop Technical Exchange Group 415886155/*pig supported delimiters include 1, any string 2, any character with an escape character of 3dec \\u001  or  \\u0024 16 for character  \\x0A   \\x0b*/--Note that this load delimiter represents the ASCII 1 as the DEC Direct parsing method in pig a = load  '/tmp/dongliang/20150401/ 20150301/tmp_search_keywords_cate_stat/'  using pigstorage (' \\u001 ')  ;/** Note the following delimiters ^ b This symbol is the caret character will only show on the end device this symbol represents the ASCII 2*/a = foreach a generate   regex_extract   ($0,  ' (. *) ^b (. *) ',  2)  as time ,                          regex_ extract  ($1,  ' (. *) ^b (. *) ',  2)  as kw ,                           regex_extract  ($2,  ' (. *) ^b (. *) ',  2)  as ic ,                          REGEX_EXTRACT  ($3,  ' (. *) ^b (. *) ',  2  as cid,                          REGEX_EXTRACT  ($4,  ' (. *) ^b (. *) ',  2  as cname,                          REGEX_EXTRACT  ($5,  ' (. *) ^b (. *) ',  2  as pname,                          REGEX_EXTRACT  ($6,  ' (. *) ^b (. *) ',  2  as snt,                          REGEX_EXTRACT  ($7,  ' (. *) ^b (. *) ',  2)  as cnt,                          REGEX_EXTRACT  ($8,  ' (. *) ^b (. *) ',  2)  as fnt,                          REGEX_EXTRACT  ($9,  ' (. *) ^b (. *) ',  2)  as ant,                          REGEX_EXTRACT  ($10,  ' (. *) ^b (. *) ',  2 )  as pnt ;--Gets the string length a = foreach a generate size (CID)  as len;- -Grouping by length b = group a by len;--count the number of each length c = foreach b generate  Group, count ($);--Output print dump c;

2 question two how in Apache How many records in SOLR query the length of a field that does not have a word breaker

SOLR does not directly provide such a function like Java lenth or the size of the pig, so how should we query it?

SOLR does not directly support such queries but we can do this in a disguised way using regular queries.
1 Queries fixed-length cid:/.{ 6}/filters only records with a length of 6
2 query range length cid:/.{ 6,9}/only filters the length 6 to 9 of the records
3 queries at least how many lengths above the cid:/.{ 6}.*/with a minimum length of 6

3 question three when you use Pig+mapreduce to add indexes to SOLR in bulk, you find that there is no error exception but there is no data in the index?

This is a rather bizarre problem, and it was supposed to be something wrong with the program, but then the same code was found to add data to another collection. Check the contents of SOLR's log discovery for some information such as

INFO- about-£ º 36.097; Org.apache.solr.update.DirectUpdateHandler2; Start commit{,optimize=false, opensearcher=true, waitsearcher=true, expungedeletes= False, softcommit=false, preparecommit=false}
INFO- about-£ º 36.098; Org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping Iw.commit.
INFO- about-£ º 36.101; Org.apache.solr.core.SolrCore; Solrindexsearcher have not changed-not Re-opening:org.apache.solr.search.solrindexsearcher
Info - 2015 -04 - 01 21 : 08 : 36.102 ; org.apache.solr.update.directupdatehandler2; end_commit_flush

info  - 2015-04-01 21:08:36.097;  Org.apache.solr.update.directupdatehandler2; start commit{,optimize=false,opensearcher=true, Waitsearcher=true,expungedeletes=false,softcommit=false,preparecommit=false}info  - 2015-04-01  21:08:36.098; org.apache.solr.update.directupdatehandler2; no uncommitted changes.  Skipping IW.commit.INFO  - 2015-04-01 21:08:36.101;  org.apache.solr.core.solrcore; solrindexsearcher has not changed - not  re-opening: org.apache.solr.search.solrindexsearcherinfo  - 2015-04-01 21:08:36.102;  org.apache.solr.update.directupdatehandler2; end_commit_flush

Explaining the information above means that the data index is finished but no commit data is found, so skipping commit is very strange when the program is running because at least 1.1 million of the data in HDFs has no data. Then scattered the fairy through Google search found that some people found that similar strange situation without any exception to rebuild the index successfully but in the index did not see any data and the most puzzling is that these several online cases have no solution.

There is no way to look at the program again this time to the middle of the processing of the data to be indexed to print out to see what the results print out is a row of empty data originally in the use of regular interception of data when the original delimiter is invalid, so that the problem of interception of data is basically located in the SOLR index no data is definitely because there is no The result of the data submission of the strange log occurred after the bug fix the error after re-rebuilding the index found this time it was successful in SOLR can also be normal query data. If you have a similar situation, please make sure that you can correctly obtain the data whether it is read from the remote or parse wordexcel or TXT data in the first to determine the correct data can be parsed and then if it is not built successfully can be based on SOLR log or thrown exception prompt repair 。

650) this.width=650; "Title=" Click to view the original size picture "class=" Magplus "src=" http://dl2.iteye.com/upload/attachment/0107/2115/ C3a73103-ca7b-3773-9611-d21f41a377e7.png "width=" "height=" 311 "alt=" C3a73103-ca7b-3773-9611-d21f41a377e7.png " />

This article is from the "7936494" blog, please be sure to keep this source http://7946494.blog.51cto.com/7936494/1627654

Apache Pig and solr question notes (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More