An error occurred while tika extracted the pdf information. tika extracted the pdf information.

Source: Internet
Author: User

An error occurred while tika extracted the pdf information. tika extracted the pdf information.

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:398)at org.apache.pdfbox.util.PDFTextStripper.writeString(PDFTextStripper.java:866)at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1896)at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:744)at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:461)at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)

The preceding error is reported when apache tika is used to extract pdf information. According to the error message, the read may exceed the request limit (0.1 million words ).

My code is as follows:

Parser parser = new PDFParser();//parser.BodyContentHandler handler = new BodyContentHandler();Metadata metadata = new Metadata();InputStream stream = null;try {stream = new FileInputStream(new File("1.pdf"));parser.parse(stream, handler, metadata, new ParseContext()); for (String name : metadata.names()) {                 System.out.println(name + ":\t" + metadata.get(name));             }} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (SAXException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (TikaException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {stream.close();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}

The maximum number of words to read may not be input in a constructor, but the default 100,000 words are used. Check the above Code. I noticed that

BodyContentHandler constructor:
org.apache.tika.sax.BodyContentHandler.BodyContentHandler(int writeLimit)

It seems that there is a relationship between them. Modify the number of the constructor:10*1024*1024 (This number is determined by the size of the pdf document ).

Redebug the program to obtain the following metadata:

  

dc:subject:meta:save-date:2014-07-22T21:02:38Zsubject:PostgreSQL 9.3 DocumentationAuthor:The PostgreSQL Global Development Groupdcterms:created:2014-07-22T20:55:33Zdate:2014-07-22T21:02:38Zcreator:The PostgreSQL Global Development GroupCreation-Date:2014-07-22T20:55:33Ztitle:PostgreSQL 9.3 Documentationtrapped:Falsemeta:author:The PostgreSQL Global Development Groupcreated:Wed Jul 23 04:55:33 CST 2014meta:keyword:cp:subject:PostgreSQL 9.3 Documentationdc:format:application/pdf; version=1.4PTEX.Fullbanner:This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012/Debian) kpathsea version 6.1.0xmp:CreatorTool:LaTeX with hyperref packageKeywords:dc:title:PostgreSQL 9.3 DocumentationLast-Save-Date:2014-07-22T21:02:38Zmeta:creation-date:2014-07-22T20:55:33Zdcterms:modified:2014-07-22T21:02:38Zdc:creator:The PostgreSQL Global Development Grouppdf:PDFVersion:1.4Last-Modified:2014-07-22T21:02:38Zmodified:2014-07-22T21:02:38ZxmpTPg:NPages:2861pdf:encrypted:falseproducer:pdfTeX-1.40.13; modified using iText® 5.1.3 ©2000-2011 1T3XT BVBAContent-Type:application/pdf

  


Java pdf reading error

This is unknown.
 
How does lucene model the search content?

The document is a container that contains one or more domains. The Domain value can be indexed or not indexed. If you need to search for a domain, you must index it. Domain values in binary format can only be stored but cannot be indexed. When indexing a domain, you need to use the analyzer to convert the Domain value to a vocabulary unit. 2. The indexing process shows that the lucene indexing process is divided into three main steps: converting the original document into text, analyzing the text, and saving the analyzed text to the index. During the indexing operation, the text is first extracted from the original data and used to create the object of the corresponding document. This object contains multiple Field instances. They are all used to save original data information. The subsequent analysis process processes the text into a large number of vocabulary units, and finally adds the vocabulary units to the segment structure. Extract text and create documents: lucene can easily extract text, such as txt files, from plain text information. However, if you need to index a manual in PDF format, you must first extract the text from the document and use the information to create the documents and fields of the stored ed. Java does not have the corresponding method to process PDF files, and the corresponding Microsoft Word files are the same as other non-plain text format files. Lucene can be combined with the Tika framework to easily extract relevant text information from it. analysis document: During indexing operations, Lucene first analyzes the text. Separates text data into Vocabulary units. And then perform some optional operations on them. For example, a vocabulary unit must be converted to lowercase letters before indexing. To make the search case insensitive or remove frequently related words that have no practical significance (for example, ah, haha ...). The analysis of relevant documents involves a considerable amount of content, which is not described in detail here. I will explain this in detail later.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.