1.6.3 uploading Data with SOLR Cell using Apache Tika

Source: Internet
Author: User
Tags http post solr

1. Uploading Data with SOLR Cell using Apache Tika

SOLR uses the Apache Tika Engineering code to provide a framework for merging all the different formats of file parsers for SOLR's own parser, such as Apache Pdfbox,apache POI. Through this framework, SOLR uses Extractingrequesthandler to upload binary files.

If you want SOLR to use your own contenthandler, you need to inherit Extractingrequesthandler, rewrite Createfactory () Method. This method is primarily used to build solrcontenthandler and tika interactions. and allows literals to overwrite values that are parsed by Tika. Sets the parameter literalsoverride, which defaults to true. If False, Adds a literal value after the Tika parse.

For more information about SOLR extraction requests, refer to Https://wiki.apache.org/solr/ExtractingRequestHandler

The concept of 1.1 key

When using SOLR cell, it's helpful to know about the information:

    • SOLR will automatically attempt to determine the document type (word,pdf,html) and extract the appropriate content. If you want, you can use Steam.type to specify an explicit MIME type for Tika.
    • Tika work generating an XHTML stream to be provided to sax Contenthandler.sax is a common interface implemented by many different XML parsers. For more information, refer to. Http://www.saxproject.org/quickstart.html
    • Solr then responds to the Tika Sax event, creating a field into the index.
    • SOLR generates metadata such as Title,subject,author. Reference Http://tika.apache.org/1.4/formats.html's File type Support Section.
    • SOLR extracts all text into the Content field. This field is defined as stored in Schema.xml.
    • You can map SOLR's metadata to a field in SOLR, or you can weight these fields.
    • You can pass in a literal value for a field value. The literal value will overwrite the value parsed by Tika, containing the field in the Tika metadata object, the Content field of Tika, and any Tika content field that can be obtained.
    • You can use XPath expressions to restrict what is produced in Tika XHTML.

Tip: Although Apache Tika is very powerful, PDF files are particularly problematic, mainly due to the PDF format itself. If a failure occurs while processing any file, Extractingrequesthandler does not have a second-hand preparation mechanism to extract text from a file, it throws an exception.

1.2 Trying out Tika with the SOLR Example Directory
CD Example-jar Start.jar

In the new command-line window, open the docs/directory and send the file to Solr via HTTP Post.

Curl ' Http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true '-f ' [email protected ]"

The above URL calls the extraction Request Handler, uploads the tutorial.html file, defines the unique primary key ID for the doc1,-f tag description using Content-type:multipart/form-data,  and support upload binary file [email protected] symbol describes the uploaded file attachment. [Email protected] Specifies a valid path. It can be absolute or relative. (such as [email protected]/. /site/tutorial.html, if still in the Exampledocs directory.)

It may have been noted that while retrieving the contents of the text, it is possible to retrieve the textual content, because the "content" field generated by Tika is mapped to the "text" field in Solr, and this field in SOLR is not stored. This can be done by/update/ The default mapping rule for extract handles is changed. For example, store and see all metadata and content:

Curl ' Http://localhost:8983/solr/update/extract?literal.id=doc1 &uprefix=attr_&fmapcontent=attr_content&commit=true '-f ' [email Protected] "

This parameter uprefix=attr_ all undefined fields in Solr schema.xml with Attr_. Attr_ as a dynamic storage field in Schema.xml. fmap.content=attr_ The content parameter overrides the default Fmap.content=text. The content is added to the Attr_content field.

1.3 Input Parameters

Extraction Request Handler can accept the parameters:

Parameters Describe
Boost.<fieldname> Weighted for the specified field
Capture

Captures the specified XHTML element, which is added to the SOLR document. This parameter is useful when copying a side of an XHTML to a specified field. For example, it can search for <p> Note: Content is still being crawled to the whole " Content field.

Captureattr Indexes Tika XHTML properties into separate fields. If set to true, for example, when extracting content from HTML, Tika can return the href attribute in the <a> tag element as the "a" field. Refer to the example below.
Commitwithin Submit index to disk within a specified millisecond time
Date.formats Define date formats for document recognition
Defaultfield This default field is used when the Uprefix parameter is not specified and the field cannot be recognized.
Extractonly

By default, False, if true, returns the contents of this tika extract, not indexed by this document. This contains the extracted XHTML string verbatim in the response. When viewed manually, it may be more useful than XML. To avoid viewing more embedded XHTML tags. Reference http ://wiki.apache.org/solr/tikaextractonlyexampleoutput.

Extractformat The default is "XML". Another format is "text".-X indicates that XML-T represents the text format. This parameter is valid only if Extractonly is true.
Fmap.<source_field> Source_field must be a field in the input document whose value is the SOLR field that needs to be mapped to. For example, Fmap.content=text to move the Content field contents of Tika generated to the text field in SOLR
Literal.<fieldname> Occupies the SOLR field with the specified value. This data can be multi-valued if the field is a multivalued type.

1.6.3 uploading Data with SOLR Cell using Apache Tika

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.