1.6.3 uploading Data with SOLR Cell using Apache Tika

Last Update:2015-03-03 Source: Internet

Author: User

Tags http post solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Uploading Data with SOLR Cell using Apache Tika

SOLR uses the Apache Tika Engineering code to provide a framework for merging all the different formats of file parsers for SOLR's own parser, such as Apache Pdfbox,apache POI. Through this framework, SOLR uses Extractingrequesthandler to upload binary files.

If you want SOLR to use your own contenthandler, you need to inherit Extractingrequesthandler, rewrite Createfactory () Method. This method is primarily used to build solrcontenthandler and tika interactions. and allows literals to overwrite values that are parsed by Tika. Sets the parameter literalsoverride, which defaults to true. If False, Adds a literal value after the Tika parse.

For more information about SOLR extraction requests, refer to Https://wiki.apache.org/solr/ExtractingRequestHandler

The concept of 1.1 key

When using SOLR cell, it's helpful to know about the information:

SOLR will automatically attempt to determine the document type (word,pdf,html) and extract the appropriate content. If you want, you can use Steam.type to specify an explicit MIME type for Tika.
Tika work generating an XHTML stream to be provided to sax Contenthandler.sax is a common interface implemented by many different XML parsers. For more information, refer to. Http://www.saxproject.org/quickstart.html
Solr then responds to the Tika Sax event, creating a field into the index.
SOLR generates metadata such as Title,subject,author. Reference Http://tika.apache.org/1.4/formats.html's File type Support Section.
SOLR extracts all text into the Content field. This field is defined as stored in Schema.xml.
You can map SOLR's metadata to a field in SOLR, or you can weight these fields.
You can pass in a literal value for a field value. The literal value will overwrite the value parsed by Tika, containing the field in the Tika metadata object, the Content field of Tika, and any Tika content field that can be obtained.
You can use XPath expressions to restrict what is produced in Tika XHTML.

Tip: Although Apache Tika is very powerful, PDF files are particularly problematic, mainly due to the PDF format itself. If a failure occurs while processing any file, Extractingrequesthandler does not have a second-hand preparation mechanism to extract text from a file, it throws an exception.

1.2 Trying out Tika with the SOLR Example Directory

CD Example-jar Start.jar

In the new command-line window, open the docs/directory and send the file to Solr via HTTP Post.

Curl ' Http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true '-f ' [email protected ]"

The above URL calls the extraction Request Handler, uploads the tutorial.html file, defines the unique primary key ID for the doc1,-f tag description using Content-type:multipart/form-data, and support upload binary file [email protected] symbol describes the uploaded file attachment. [Email protected] Specifies a valid path. It can be absolute or relative. (such as [email protected]/. /site/tutorial.html, if still in the Exampledocs directory.)

It may have been noted that while retrieving the contents of the text, it is possible to retrieve the textual content, because the "content" field generated by Tika is mapped to the "text" field in Solr, and this field in SOLR is not stored. This can be done by/update/ The default mapping rule for extract handles is changed. For example, store and see all metadata and content:

Curl ' Http://localhost:8983/solr/update/extract?literal.id=doc1 &uprefix=attr_&fmapcontent=attr_content&commit=true '-f ' [email Protected] "

This parameter uprefix=attr_ all undefined fields in Solr schema.xml with Attr_. Attr_ as a dynamic storage field in Schema.xml. fmap.content=attr_ The content parameter overrides the default Fmap.content=text. The content is added to the Attr_content field.

1.3 Input Parameters

Extraction Request Handler can accept the parameters:

Parameters	Describe
Boost.<fieldname>	Weighted for the specified field
Capture	Captures the specified XHTML element, which is added to the SOLR document. This parameter is useful when copying a side of an XHTML to a specified field. For example, it can search for <p> Note: Content is still being crawled to the whole " Content field.
Captureattr	Indexes Tika XHTML properties into separate fields. If set to true, for example, when extracting content from HTML, Tika can return the href attribute in the <a> tag element as the "a" field. Refer to the example below.
Commitwithin	Submit index to disk within a specified millisecond time
Date.formats	Define date formats for document recognition
Defaultfield	This default field is used when the Uprefix parameter is not specified and the field cannot be recognized.
Extractonly	By default, False, if true, returns the contents of this tika extract, not indexed by this document. This contains the extracted XHTML string verbatim in the response. When viewed manually, it may be more useful than XML. To avoid viewing more embedded XHTML tags. Reference http ://wiki.apache.org/solr/tikaextractonlyexampleoutput.
Extractformat	The default is "XML". Another format is "text".-X indicates that XML-T represents the text format. This parameter is valid only if Extractonly is true.
Fmap.<source_field>	Source_field must be a field in the input document whose value is the SOLR field that needs to be mapped to. For example, Fmap.content=text to move the Content field contents of Tika generated to the text field in SOLR
Literal.<fieldname>	Occupies the SOLR field with the specified value. This data can be multi-valued if the field is a multivalued type.

1.6.3 uploading Data with SOLR Cell using Apache Tika

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More