(4) solr index data import: pdf Format

Source: Internet
Author: User
Tags solr

An accidental requirement requires indexing of pdf (non-scanned) documents,

Schema. xml

 

<Fields>
<Field name = "id" type = "string" indexed = "true" stored = "true" required = "true"/>
<Field name = "content" type = "text_general" indexed = "true" stored = "true" required = "true"/>
<Field name = "size" type = "slong" indexed = "true" stored = "true" required = "true"/>
<DynamicField name = "ignored _ *" type = "ignored" multiValued = "true" indexed = "false" stored = "false"/>
</Fields>
<UniqueKey> id </uniqueKey>
<Defasearchsearchfield> content </defasearchsearchfield>
<SolrQueryParser defaultOperator = "AND"/>

To configure solrconfig. xml:

<RequestHandler name = "/update/extract"
Startup = "lazy"
Class = "solr. extraction. ExtractingRequestHandler">
<Lst name = "defaults">
<! -- All the main content goes into "text"... if you need to return
The extracted text or do highlighting, use a stored field. -->
<Str name = "fmap. content"> content </str>
<Str name = "fmap. stream_size"> size </str>
<Str name = "lowernames"> true </str>
<Str name = "uprefix"> ignored _ </str>
<! -- Capture link hrefs but ignore div attributes -->
<Str name = "captureAttr"> true </str>
<! -- <Str name = "fmap. a"> links </str> -->
<! -- <Str name = "fmap. div"> ignored_div </str> -->
</Lst>
</RequestHandler>

 

Parameter description:

 

Fmap. source = target: ing rule, which maps the extracted field (source) in the PDF file to the field (tar) in solr)

 

Uprefix: If this parameter is specified, all fields not defined in the schema will use the value specified by this parameter as the field name prefix.

 

DefaultField: If the uprefix parameter value is not specified and fields cannot be detected in the schema, use the field name specified by defaultField.

 

CaptureAttr :( true | false) captures attributes and indexes the attributes of the Tika XHTML element.

Literal: Custom metadata information, that is, to specify a value for a field defined in the schema file

Submit documents for indexing:

 

Curl "http: /localhost: 8983/solr/update/extract? Literal. id = doc2 & captureAttr = true & defaultField = ignored_undefined "-F" commit = true "-F" filepath @t2.pdf"

 

 

Specific reference documents:

 

Http://wiki.apache.org/solr/ExtractingRequestHandler

 

Note: The processing of Word documents is the same as that of pdf documents.

 

 

 

 

 

 

 


 

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.