An accidental requirement requires indexing of pdf (non-scanned) documents,
Schema. xml
<Fields>
<Field name = "id" type = "string" indexed = "true" stored = "true" required = "true"/>
<Field name = "content" type = "text_general" indexed = "true" stored = "true" required = "true"/>
<Field name = "size" type = "slong" indexed = "true" stored = "true" required = "true"/>
<DynamicField name = "ignored _ *" type = "ignored" multiValued = "true" indexed = "false" stored = "false"/>
</Fields>
<UniqueKey> id </uniqueKey>
<Defasearchsearchfield> content </defasearchsearchfield>
<SolrQueryParser defaultOperator = "AND"/>
To configure solrconfig. xml:
<RequestHandler name = "/update/extract"
Startup = "lazy"
Class = "solr. extraction. ExtractingRequestHandler">
<Lst name = "defaults">
<! -- All the main content goes into "text"... if you need to return
The extracted text or do highlighting, use a stored field. -->
<Str name = "fmap. content"> content </str>
<Str name = "fmap. stream_size"> size </str>
<Str name = "lowernames"> true </str>
<Str name = "uprefix"> ignored _ </str>
<! -- Capture link hrefs but ignore div attributes -->
<Str name = "captureAttr"> true </str>
<! -- <Str name = "fmap. a"> links </str> -->
<! -- <Str name = "fmap. div"> ignored_div </str> -->
</Lst>
</RequestHandler>
Parameter description:
Fmap. source = target: ing rule, which maps the extracted field (source) in the PDF file to the field (tar) in solr)
Uprefix: If this parameter is specified, all fields not defined in the schema will use the value specified by this parameter as the field name prefix.
DefaultField: If the uprefix parameter value is not specified and fields cannot be detected in the schema, use the field name specified by defaultField.
CaptureAttr :( true | false) captures attributes and indexes the attributes of the Tika XHTML element.
Literal: Custom metadata information, that is, to specify a value for a field defined in the schema file
Submit documents for indexing:
Curl "http: /localhost: 8983/solr/update/extract? Literal. id = doc2 & captureAttr = true & defaultField = ignored_undefined "-F" commit = true "-F" filepath @t2.pdf"
Specific reference documents:
Http://wiki.apache.org/solr/ExtractingRequestHandler
Note: The processing of Word documents is the same as that of pdf documents.