Import CSV files to SOLR

Source: Internet
Author: User
Tags solr

Today, I want to use DIH to import CSV files, so the data source is roughly implemented using filedatasource + custom converter.

Package COM. besttone. transformer; import Java. util. map; public class csvtransformer {// reference http://wiki.apache.org/solr/DIHCustomTransformerpublic object transformrow (Map <string, Object> row) {// todo auto-generated method stubobject rawline = row. get ("rawline"); If (rawline! = NULL) {string [] props = rawline. tostring (). split (","); row. put ("ID", props [0]); row. put ("name", props [1]);} return row ;}}

Many problems have been found, such as the comma in the field, and so on. This rough converter cannot be implemented, so I continued to find the document and found that SOLR has a csvrequesthandler, however, the default value is solrconfig. the requesthandler is not configured in XML, so you must configure one first:

  <!-- CSV update handler, loaded on demand -->  <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">  </requestHandler> 

Enter URL: http: // localhost: 8088/SOLR-src/CSV-core/update/CSV in the browser? Stream. File = D:/dpimport/test_data2.csv & stream. contenttype = text/plain; charset = UTF-8 & fieldnames = ID, Name & commit = true

The CSV file can be imported. My CSV file has two fields, one ID and one name, and some test data is done, for example:

1, AAA

2, BBB

...

Continuous row import is of course no problem. When there is no problem in the middle, the CSV file of the Office will become:

1, AAA

,

2, BBB

That is to say, there will be a comma in the blank line. When importing, filedschema of the ID field is unique and cannot be empty, which will cause an exception when creating the index file, so I expanded the csvrequesthandler source code. I added the emptyline parameter and added a logic in the load method:

// Whether empty data rows are supported. If (emptyline) {int totallength = 0; For (INT I = 0; I <Vals. length; I ++) {totallength + = Vals [I]. length () ;}if (totallength = 0) {continue ;}}

The modified csvrequesthandler is as follows:

/*** Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. see the notice file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to you under the Apache license, version 2.0 * (the "License"); you may not use this file except T in compliance with * the license. you may obtain a copy of th E license at ** http://www.apache.org/licenses/LICENSE-2.0 ** unless required by applicable law or agreed to in writing, software * distributed under the license is distributed on an "as is" basis, * Without warranties or conditions of any kind, either express or implied. * See the license for the specific language governing permissions and * limitations under the license. */package Org. apache. s OLR. handler; import Org. apache. SOLR. request. solrqueryrequest; import Org. apache. SOLR. response. solrqueryresponse; import Org. apache. SOLR. common. solrexception; import Org. apache. SOLR. common. solrinputdocument; import Org. apache. SOLR. common. params. solrparams; import Org. apache. SOLR. common. params. updateparams; import Org. apache. SOLR. common. util. strutils; import Org. apache. SOLR. common. util. contentstream; import Org. apache. SOLR. schema. indexschema; import Org. apache. SOLR. schema. schemafield; import Org. apache. SOLR. update. *; import Org. apache. SOLR. update. processor. updaterequestprocessor; import org.apache.solr.internal.csv. csvstrategy; import org.apache.solr.internal.csv. csvparser; import Org. apache. commons. io. ioutils; import Java. util. regEx. pattern; import Java. util. list; import Java. io. *;/*** @ version $ ID: csvrequ Esthandler. java 1298169 22: 27: 54z uschindler $ */public class extends contentstreamhandlerbase {@ override protected contentstreamloader newloader (solrqueryrequest req, updaterequestprocessor processor) {return new handler (req, processor );} /// // solrinfombeans Methods ////////////// /// // @ override Public String getdescription () {Return "Add/update multiple attributes with CSV formatted rows" ;}@ override Public String getversion () {return "$ revision: 1298169 $" ;}@ override Public String getsourceid () {return "$ ID: csvrequesthandler. java 1298169 2012-03-07 22: 27: 54z uschindler $ ";}@ override Public String getsource () {return" $ URL: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/core/src/java/org/ Apache/SOLR/handler/csvrequesthandler. java $ ";}} abstract class csvloader extends contentstreamloader {public static final string separator =" separator "; public static final string fieldnames =" fieldnames "; public static final string header = "Header"; public static final string skip = "Skip"; public static final string skiplines = "skiplines"; public static final string map = "map "; public static final St Ring trim = "trim"; public static final string empty = "keepempty"; public static final string split = "split"; public static final string encapsulator = "encapsulator "; public static final string escape = "escape"; public static final string overwrite = "Overwrite"; public static final string emptyline = "emptyline "; // whether empty data rows are supported: Private Static pattern colonsplit = pattern. compile (":"); Private Static pattern com Masplit = pattern. compile (","); Final indexschema Schema; Final solrparams Params; Final csvstrategy strategy; Final updaterequestprocessor processor; string [] fieldnames; schemafield [] fields; csvloader. fieldadder [] adders; int skiplines; // number of lines to skip at start of file Boolean emptyline; // whether to support empty data row final addupdatecommand templateadd; /** Add a field to a document unless it's zero Len Authorization. * The fieldadder hierarchy handles all the complexity of * Further Transforming or splitting Field Values to keep the * main logic loop clean. All implementations of add () must be * MT-Safe! */Private class fieldadder {void add (solrinputdocument doc, int line, int column, string Val) {If (Val. length ()> 0) {Doc. addfield (fields [column]. getname (), Val, 1.0f) ;}}/ ** add zero length fields */private class fieldadderempty extends csvloader. fieldadder {@ override void add (solrinputdocument doc, int line, int column, string Val) {Doc. addfield (fields [column]. getname (), Val, 1.0f );}}/ ** Trim fields */private class fieldtrimmer extends csvloader. fieldadder {private final csvloader. fieldadder base; fieldtrimmer (csvloader. fieldadder base) {This. base = base ;}@ override void add (solrinputdocument doc, int line, int column, string Val) {base. add (Doc, line, column, Val. trim ();}/** map a single value. * for just a couple of mappings, this is probably faster than * using a hash Map. */private class fieldmappersingle extends csvloader. fieldadder {private final string from; private final string to; private final csvloader. fieldadder base; fieldmappersingle (string from, string to, csvloader. fieldadder base) {This. from = from; this. to = to; this. base = base ;}@ override void add (solrinputdocument doc, int line, int column, string Val) {If (from. equals (VAL) val = to; base. add (do C, line, column, Val) ;}/ ** split a single value into multiple values based on * A csvstrategy. */private class fieldsplitter extends csvloader. fieldadder {private final csvstrategy strategy; private final csvloader. fieldadder base; fieldsplitter (csvstrategy strategy, csvloader. fieldadder base) {This. strategy = strategy; this. base = base ;}@ override void add (solrinputdocument doc, int line, in T column, string Val) {csvparser parser = new csvparser (New stringreader (VAL), Strategy); try {string [] Vals = parser. Getline (); If (Vals! = NULL) {for (string V: Vals) base. add (Doc, line, column, V);} else {base. add (Doc, line, column, Val) ;}catch (ioexception e) {Throw new solrexception (solrexception. errorcode. bad_request, e) ;}}string errheader = "csvloader:"; csvloader (solrqueryrequest req, updaterequestprocessor processor) {This. processor = processor; this. params = req. getparams (); schema = req. getschema (); templateadd = new Ddupdatecommand (); templateadd. allowdups = false; templateadd. overwritecommitted = true; templateadd. overwritepending = true; If (Params. getbool (overwrite, true) {templateadd. allowdups = false; templateadd. overwritecommitted = true; templateadd. overwritepending = true;} else {templateadd. allowdups = true; templateadd. overwritecommitted = false; templateadd. overwritepending = false;} templateadd. commitwithin = P Arams. getint (updateparams. commit_within,-1); strategy = new csvstrategy (',', '"', csvstrategy. comments_disabled, csvstrategy. escape_disabled, false, true); string Sep = Params. get (separator); If (Sep! = NULL) {If (Sep. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid separator: '" + Sep + "'"); strategy. setdelimiter (Sep. charat (0);} string encapsulator = Params. get (encapsulator); If (encapsulator! = NULL) {If (encapsulator. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid encapsulator: '" + encapsulator + "'");} string escape = Params. get (escape); If (escape! = NULL) {If (escape. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid escape: '" + escape + "'");} // If only encapsulator or escape is set, disable the other escaping mechanic if (encapsulator = NULL & escape! = NULL) {strategy. setencapsulator (csvstrategy. encapsulator_disabled); strategy. setescape (escape. charat (0);} else {If (encapsulator! = NULL) {strategy. setencapsulator (encapsulator. charat (0);} If (escape! = NULL) {char CH = escape. charat (0); strategy. setescape (CH); If (CH = '\') {// If the escape is the standard backslash, then also enable // Unicode escapes (it's harmless since 'U' wocould not otherwise // be escaped. strategy. setunicodeescapeinterpretation (true) ;}} string fn = Params. get (fieldnames); fieldnames = FN! = NULL? Commasplit. split (FN,-1): NULL; Boolean hasheader = Params. getbool (header); skiplines = Params. getint (skiplines, 0); emptyline = Params. getbool (emptyline, false); // extended if (fieldnames = NULL) {If (null = hasheader) {// assume the file has the headers if they aren't supplied in The args hasheader = true;} else if (! Hasheader) {Throw new solrexception (solrexception. errorcode. bad_request, "csvloader: Must specify fieldnames = <fields> * or header = true") ;}} else {// If the fieldnames were supplied and the file has a header, we need to // skip over that header. if (hasheader! = NULL & hasheader) skiplines ++; preparefields () ;}}/** create the fieldadders that control how each field is indexed */void preparefields () {// possible future optimization: For really rapid incremental indexing // from a post, one cocould cache all of this setup info based on the Params. // The Link from fieldadder to this wowould need to be severed for that to happen. fields = new schemafield [fi Eldnames. Length]; adders = new csvloader. fieldadder [fieldnames. Length]; string skipstr = Params. Get (skip); List <string> skipfields = skipstr = NULL? Null: strutils. splitsmart (skipstr, ','); csvloader. fieldadder adder = new csvloader. fieldadder (); csvloader. fieldadder adderkeepempty = new csvloader. fieldadderempty (); For (INT I = 0; I <fields. length; I ++) {string fname = fieldnames [I]; // to skip a field, leave the entries in fields and addrs null if (fname. length () = 0 | (skipfields! = NULL & skipfields. contains (fname) continue; fields [I] = schema. getfield (fname); Boolean keepempty = Params. getfieldbool (fname, empty, false); adders [I] = keepempty? Adderkeepempty: adder; // order that operations are applied: Split-> trim-> map-> Add // so create in reverse order. // creation of fieldadders cocould be optimized and shared among fields string [] fmap = Params. getfieldparams (fname, MAP); If (fmap! = NULL) {for (string maprule: fmap) {string [] mapargs = colonsplit. Split (maprule,-1); If (mapargs. length! = 2) throw new solrexception (solrexception. errorcode. bad_request, "map rules must be of the form 'from: To', got '" + maprule + "'"); adders [I] = new csvloader. fieldmappersingle (mapargs [0], mapargs [1], adders [I]) ;}} if (Params. getfieldbool (fname, trim, false) {adders [I] = new csvloader. fieldtrimmer (adders [I]);} If (Params. getfieldbool (fname, split, false) {string sepstr = Params. getfieldparam (fname, S Eparator); char fsep = sepstr = NULL | sepstr. Length () = 0? ',': Sepstr. charat (0); string encstr = Params. getfieldparam (fname, encapsulator); char fenc = encstr = NULL | encstr. length () = 0? (Char)-2: encstr. charat (0); string escstr = Params. getfieldparam (fname, escape); char fesc = escstr = NULL | escstr. length () = 0? Csvstrategy. escape_disabled: escstr. charat (0); csvstrategy fstrat = new csvstrategy (fsep, fenc, csvstrategy. comments_disabled, fesc, false, false); adders [I] = new csvloader. fieldsplitter (fstrat, adders [I]) ;}} private void input_err (string MSG, string [] Line, int lineno) {stringbuilder sb = new stringbuilder (); sb. append (errheader ). append (", line = "). append (lineno ). append (","). appe Nd (MSG ). append ("\ n \ tvalues = {"); For (string VAL: Line) {sb. append ("'"). append (VAL ). append ("',");} sb. append ('}'); throw new solrexception (solrexception. errorcode. bad_request, sb. tostring ();} private void input_err (string MSG, string [] lines, int lineno, throwable e) {stringbuilder sb = new stringbuilder (); sb. append (errheader ). append (", line = "). append (lineno ). append (","). append (MSG ). append ("\ N \ tvalues = {"); If (lines! = NULL) {for (string VAL: lines) {sb. append ("'"). append (VAL ). append ("',") ;}} else {sb. append ("no lines available");} sb. append ('}'); throw new solrexception (solrexception. errorcode. bad_request, sb. tostring (), e);}/** load the CSV input */@ override public void load (solrqueryrequest req, solrqueryresponse RSP, contentstream stream) throws ioexception {errheader = "csvloader: input = "+ stre Am. getsourceinfo (); reader = NULL; try {reader = stream. getreader (); If (skiplines> 0) {If (! (Reader instanceof bufferedreader) {reader = new bufferedreader (Reader);} bufferedreader r = (bufferedreader) reader; For (INT I = 0; I <skiplines; I ++) {R. readline () ;}} csvparser parser = new csvparser (reader, strategy); // parse the fieldnames from the header of the file if (fieldnames = NULL) {fieldnames = parser. getline (); If (fieldnames = NULL) {Throw new solrexception (solrexception. errorco De. bad_request, "Expected fieldnames in CSV input");} preparefields ();} // read the rest of the CSV file for (;) {int line = parser. getlinenumber (); // For Error Reporting in MT mode string [] Vals = NULL; try {Vals = parser. getline ();} catch (ioexception e) {// catch the exception and rethrow it with more line information input_err ("can't read line:" + line, null, line, e);} If (Vals = NULL) Break; // whether empty data rows are supported if (emptyline) {int totallength = 0; For (INT I = 0; I <Vals. length; I ++) {totallength + = Vals [I]. length () ;}if (totallength = 0) {continue ;}} if (Vals. length! = Fields. length) {input_err ("expected" + fields. length + "values but got" + Vals. length, Vals, line) ;}adddoc (line, Vals) ;}} finally {If (reader! = NULL) {ioutils. closequietly (Reader) ;}}/ ** called for each line of values (document) */abstract void adddoc (INT line, string [] Vals) throws ioexception; /** this must be Mt safe... may be called concurrently from multiple threads. */void doadd (INT line, string [] Vals, solrinputdocument doc, addupdatecommand template) throws ioexception {// The line number is passed simply for Error Reporting in MT mode. // first, create the Lucene document for (INT I = 0; I <Vals. length; I ++) {If (fields [I] = NULL) continue; // ignore this field string val = Vals [I]; adders [I]. add (Doc, line, I, Val);} template. solrdoc = Doc; processor. processadd (Template) ;}} class singlethreadedcsvloader extends csvloader {singlethreadedcsvloader (solrqueryrequest req, updaterequestprocessor processor) {super (req, processor);} @ override void adddoc (INT, string [] Vals) throws ioexception {templateadd. indexedid = NULL; solrinputdocument Doc = new solrinputdocument (); doadd (line, Vals, Doc, templateadd );}}

In this way, & amp; emptyline = true on the basis of the above request URL can avoid exceptions in empty rows.

The above is the SOLR modification for version 3.6. Different versions are not necessarily feasible

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.