Today, I want to use DIH to import CSV files, so the data source is roughly implemented using filedatasource + custom converter.
Package COM. besttone. transformer; import Java. util. map; public class csvtransformer {// reference http://wiki.apache.org/solr/DIHCustomTransformerpublic object transformrow (Map <string, Object> row) {// todo auto-generated method stubobject rawline = row. get ("rawline"); If (rawline! = NULL) {string [] props = rawline. tostring (). split (","); row. put ("ID", props [0]); row. put ("name", props [1]);} return row ;}}
Many problems have been found, such as the comma in the field, and so on. This rough converter cannot be implemented, so I continued to find the document and found that SOLR has a csvrequesthandler, however, the default value is solrconfig. the requesthandler is not configured in XML, so you must configure one first:
<!-- CSV update handler, loaded on demand --> <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy"> </requestHandler>
Enter URL: http: // localhost: 8088/SOLR-src/CSV-core/update/CSV in the browser? Stream. File = D:/dpimport/test_data2.csv & stream. contenttype = text/plain; charset = UTF-8 & fieldnames = ID, Name & commit = true
The CSV file can be imported. My CSV file has two fields, one ID and one name, and some test data is done, for example:
1, AAA
2, BBB
...
Continuous row import is of course no problem. When there is no problem in the middle, the CSV file of the Office will become:
1, AAA
,
2, BBB
That is to say, there will be a comma in the blank line. When importing, filedschema of the ID field is unique and cannot be empty, which will cause an exception when creating the index file, so I expanded the csvrequesthandler source code. I added the emptyline parameter and added a logic in the load method:
// Whether empty data rows are supported. If (emptyline) {int totallength = 0; For (INT I = 0; I <Vals. length; I ++) {totallength + = Vals [I]. length () ;}if (totallength = 0) {continue ;}}
The modified csvrequesthandler is as follows:
/*** Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. see the notice file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to you under the Apache license, version 2.0 * (the "License"); you may not use this file except T in compliance with * the license. you may obtain a copy of th E license at ** http://www.apache.org/licenses/LICENSE-2.0 ** unless required by applicable law or agreed to in writing, software * distributed under the license is distributed on an "as is" basis, * Without warranties or conditions of any kind, either express or implied. * See the license for the specific language governing permissions and * limitations under the license. */package Org. apache. s OLR. handler; import Org. apache. SOLR. request. solrqueryrequest; import Org. apache. SOLR. response. solrqueryresponse; import Org. apache. SOLR. common. solrexception; import Org. apache. SOLR. common. solrinputdocument; import Org. apache. SOLR. common. params. solrparams; import Org. apache. SOLR. common. params. updateparams; import Org. apache. SOLR. common. util. strutils; import Org. apache. SOLR. common. util. contentstream; import Org. apache. SOLR. schema. indexschema; import Org. apache. SOLR. schema. schemafield; import Org. apache. SOLR. update. *; import Org. apache. SOLR. update. processor. updaterequestprocessor; import org.apache.solr.internal.csv. csvstrategy; import org.apache.solr.internal.csv. csvparser; import Org. apache. commons. io. ioutils; import Java. util. regEx. pattern; import Java. util. list; import Java. io. *;/*** @ version $ ID: csvrequ Esthandler. java 1298169 22: 27: 54z uschindler $ */public class extends contentstreamhandlerbase {@ override protected contentstreamloader newloader (solrqueryrequest req, updaterequestprocessor processor) {return new handler (req, processor );} /// // solrinfombeans Methods ////////////// /// // @ override Public String getdescription () {Return "Add/update multiple attributes with CSV formatted rows" ;}@ override Public String getversion () {return "$ revision: 1298169 $" ;}@ override Public String getsourceid () {return "$ ID: csvrequesthandler. java 1298169 2012-03-07 22: 27: 54z uschindler $ ";}@ override Public String getsource () {return" $ URL: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/core/src/java/org/ Apache/SOLR/handler/csvrequesthandler. java $ ";}} abstract class csvloader extends contentstreamloader {public static final string separator =" separator "; public static final string fieldnames =" fieldnames "; public static final string header = "Header"; public static final string skip = "Skip"; public static final string skiplines = "skiplines"; public static final string map = "map "; public static final St Ring trim = "trim"; public static final string empty = "keepempty"; public static final string split = "split"; public static final string encapsulator = "encapsulator "; public static final string escape = "escape"; public static final string overwrite = "Overwrite"; public static final string emptyline = "emptyline "; // whether empty data rows are supported: Private Static pattern colonsplit = pattern. compile (":"); Private Static pattern com Masplit = pattern. compile (","); Final indexschema Schema; Final solrparams Params; Final csvstrategy strategy; Final updaterequestprocessor processor; string [] fieldnames; schemafield [] fields; csvloader. fieldadder [] adders; int skiplines; // number of lines to skip at start of file Boolean emptyline; // whether to support empty data row final addupdatecommand templateadd; /** Add a field to a document unless it's zero Len Authorization. * The fieldadder hierarchy handles all the complexity of * Further Transforming or splitting Field Values to keep the * main logic loop clean. All implementations of add () must be * MT-Safe! */Private class fieldadder {void add (solrinputdocument doc, int line, int column, string Val) {If (Val. length ()> 0) {Doc. addfield (fields [column]. getname (), Val, 1.0f) ;}}/ ** add zero length fields */private class fieldadderempty extends csvloader. fieldadder {@ override void add (solrinputdocument doc, int line, int column, string Val) {Doc. addfield (fields [column]. getname (), Val, 1.0f );}}/ ** Trim fields */private class fieldtrimmer extends csvloader. fieldadder {private final csvloader. fieldadder base; fieldtrimmer (csvloader. fieldadder base) {This. base = base ;}@ override void add (solrinputdocument doc, int line, int column, string Val) {base. add (Doc, line, column, Val. trim ();}/** map a single value. * for just a couple of mappings, this is probably faster than * using a hash Map. */private class fieldmappersingle extends csvloader. fieldadder {private final string from; private final string to; private final csvloader. fieldadder base; fieldmappersingle (string from, string to, csvloader. fieldadder base) {This. from = from; this. to = to; this. base = base ;}@ override void add (solrinputdocument doc, int line, int column, string Val) {If (from. equals (VAL) val = to; base. add (do C, line, column, Val) ;}/ ** split a single value into multiple values based on * A csvstrategy. */private class fieldsplitter extends csvloader. fieldadder {private final csvstrategy strategy; private final csvloader. fieldadder base; fieldsplitter (csvstrategy strategy, csvloader. fieldadder base) {This. strategy = strategy; this. base = base ;}@ override void add (solrinputdocument doc, int line, in T column, string Val) {csvparser parser = new csvparser (New stringreader (VAL), Strategy); try {string [] Vals = parser. Getline (); If (Vals! = NULL) {for (string V: Vals) base. add (Doc, line, column, V);} else {base. add (Doc, line, column, Val) ;}catch (ioexception e) {Throw new solrexception (solrexception. errorcode. bad_request, e) ;}}string errheader = "csvloader:"; csvloader (solrqueryrequest req, updaterequestprocessor processor) {This. processor = processor; this. params = req. getparams (); schema = req. getschema (); templateadd = new Ddupdatecommand (); templateadd. allowdups = false; templateadd. overwritecommitted = true; templateadd. overwritepending = true; If (Params. getbool (overwrite, true) {templateadd. allowdups = false; templateadd. overwritecommitted = true; templateadd. overwritepending = true;} else {templateadd. allowdups = true; templateadd. overwritecommitted = false; templateadd. overwritepending = false;} templateadd. commitwithin = P Arams. getint (updateparams. commit_within,-1); strategy = new csvstrategy (',', '"', csvstrategy. comments_disabled, csvstrategy. escape_disabled, false, true); string Sep = Params. get (separator); If (Sep! = NULL) {If (Sep. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid separator: '" + Sep + "'"); strategy. setdelimiter (Sep. charat (0);} string encapsulator = Params. get (encapsulator); If (encapsulator! = NULL) {If (encapsulator. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid encapsulator: '" + encapsulator + "'");} string escape = Params. get (escape); If (escape! = NULL) {If (escape. Length ()! = 1) throw new solrexception (solrexception. errorcode. bad_request, "invalid escape: '" + escape + "'");} // If only encapsulator or escape is set, disable the other escaping mechanic if (encapsulator = NULL & escape! = NULL) {strategy. setencapsulator (csvstrategy. encapsulator_disabled); strategy. setescape (escape. charat (0);} else {If (encapsulator! = NULL) {strategy. setencapsulator (encapsulator. charat (0);} If (escape! = NULL) {char CH = escape. charat (0); strategy. setescape (CH); If (CH = '\') {// If the escape is the standard backslash, then also enable // Unicode escapes (it's harmless since 'U' wocould not otherwise // be escaped. strategy. setunicodeescapeinterpretation (true) ;}} string fn = Params. get (fieldnames); fieldnames = FN! = NULL? Commasplit. split (FN,-1): NULL; Boolean hasheader = Params. getbool (header); skiplines = Params. getint (skiplines, 0); emptyline = Params. getbool (emptyline, false); // extended if (fieldnames = NULL) {If (null = hasheader) {// assume the file has the headers if they aren't supplied in The args hasheader = true;} else if (! Hasheader) {Throw new solrexception (solrexception. errorcode. bad_request, "csvloader: Must specify fieldnames = <fields> * or header = true") ;}} else {// If the fieldnames were supplied and the file has a header, we need to // skip over that header. if (hasheader! = NULL & hasheader) skiplines ++; preparefields () ;}}/** create the fieldadders that control how each field is indexed */void preparefields () {// possible future optimization: For really rapid incremental indexing // from a post, one cocould cache all of this setup info based on the Params. // The Link from fieldadder to this wowould need to be severed for that to happen. fields = new schemafield [fi Eldnames. Length]; adders = new csvloader. fieldadder [fieldnames. Length]; string skipstr = Params. Get (skip); List <string> skipfields = skipstr = NULL? Null: strutils. splitsmart (skipstr, ','); csvloader. fieldadder adder = new csvloader. fieldadder (); csvloader. fieldadder adderkeepempty = new csvloader. fieldadderempty (); For (INT I = 0; I <fields. length; I ++) {string fname = fieldnames [I]; // to skip a field, leave the entries in fields and addrs null if (fname. length () = 0 | (skipfields! = NULL & skipfields. contains (fname) continue; fields [I] = schema. getfield (fname); Boolean keepempty = Params. getfieldbool (fname, empty, false); adders [I] = keepempty? Adderkeepempty: adder; // order that operations are applied: Split-> trim-> map-> Add // so create in reverse order. // creation of fieldadders cocould be optimized and shared among fields string [] fmap = Params. getfieldparams (fname, MAP); If (fmap! = NULL) {for (string maprule: fmap) {string [] mapargs = colonsplit. Split (maprule,-1); If (mapargs. length! = 2) throw new solrexception (solrexception. errorcode. bad_request, "map rules must be of the form 'from: To', got '" + maprule + "'"); adders [I] = new csvloader. fieldmappersingle (mapargs [0], mapargs [1], adders [I]) ;}} if (Params. getfieldbool (fname, trim, false) {adders [I] = new csvloader. fieldtrimmer (adders [I]);} If (Params. getfieldbool (fname, split, false) {string sepstr = Params. getfieldparam (fname, S Eparator); char fsep = sepstr = NULL | sepstr. Length () = 0? ',': Sepstr. charat (0); string encstr = Params. getfieldparam (fname, encapsulator); char fenc = encstr = NULL | encstr. length () = 0? (Char)-2: encstr. charat (0); string escstr = Params. getfieldparam (fname, escape); char fesc = escstr = NULL | escstr. length () = 0? Csvstrategy. escape_disabled: escstr. charat (0); csvstrategy fstrat = new csvstrategy (fsep, fenc, csvstrategy. comments_disabled, fesc, false, false); adders [I] = new csvloader. fieldsplitter (fstrat, adders [I]) ;}} private void input_err (string MSG, string [] Line, int lineno) {stringbuilder sb = new stringbuilder (); sb. append (errheader ). append (", line = "). append (lineno ). append (","). appe Nd (MSG ). append ("\ n \ tvalues = {"); For (string VAL: Line) {sb. append ("'"). append (VAL ). append ("',");} sb. append ('}'); throw new solrexception (solrexception. errorcode. bad_request, sb. tostring ();} private void input_err (string MSG, string [] lines, int lineno, throwable e) {stringbuilder sb = new stringbuilder (); sb. append (errheader ). append (", line = "). append (lineno ). append (","). append (MSG ). append ("\ N \ tvalues = {"); If (lines! = NULL) {for (string VAL: lines) {sb. append ("'"). append (VAL ). append ("',") ;}} else {sb. append ("no lines available");} sb. append ('}'); throw new solrexception (solrexception. errorcode. bad_request, sb. tostring (), e);}/** load the CSV input */@ override public void load (solrqueryrequest req, solrqueryresponse RSP, contentstream stream) throws ioexception {errheader = "csvloader: input = "+ stre Am. getsourceinfo (); reader = NULL; try {reader = stream. getreader (); If (skiplines> 0) {If (! (Reader instanceof bufferedreader) {reader = new bufferedreader (Reader);} bufferedreader r = (bufferedreader) reader; For (INT I = 0; I <skiplines; I ++) {R. readline () ;}} csvparser parser = new csvparser (reader, strategy); // parse the fieldnames from the header of the file if (fieldnames = NULL) {fieldnames = parser. getline (); If (fieldnames = NULL) {Throw new solrexception (solrexception. errorco De. bad_request, "Expected fieldnames in CSV input");} preparefields ();} // read the rest of the CSV file for (;) {int line = parser. getlinenumber (); // For Error Reporting in MT mode string [] Vals = NULL; try {Vals = parser. getline ();} catch (ioexception e) {// catch the exception and rethrow it with more line information input_err ("can't read line:" + line, null, line, e);} If (Vals = NULL) Break; // whether empty data rows are supported if (emptyline) {int totallength = 0; For (INT I = 0; I <Vals. length; I ++) {totallength + = Vals [I]. length () ;}if (totallength = 0) {continue ;}} if (Vals. length! = Fields. length) {input_err ("expected" + fields. length + "values but got" + Vals. length, Vals, line) ;}adddoc (line, Vals) ;}} finally {If (reader! = NULL) {ioutils. closequietly (Reader) ;}}/ ** called for each line of values (document) */abstract void adddoc (INT line, string [] Vals) throws ioexception; /** this must be Mt safe... may be called concurrently from multiple threads. */void doadd (INT line, string [] Vals, solrinputdocument doc, addupdatecommand template) throws ioexception {// The line number is passed simply for Error Reporting in MT mode. // first, create the Lucene document for (INT I = 0; I <Vals. length; I ++) {If (fields [I] = NULL) continue; // ignore this field string val = Vals [I]; adders [I]. add (Doc, line, I, Val);} template. solrdoc = Doc; processor. processadd (Template) ;}} class singlethreadedcsvloader extends csvloader {singlethreadedcsvloader (solrqueryrequest req, updaterequestprocessor processor) {super (req, processor);} @ override void adddoc (INT, string [] Vals) throws ioexception {templateadd. indexedid = NULL; solrinputdocument Doc = new solrinputdocument (); doadd (line, Vals, Doc, templateadd );}}
In this way, & amp; emptyline = true on the basis of the above request URL can avoid exceptions in empty rows.
The above is the SOLR modification for version 3.6. Different versions are not necessarily feasible