An error occurred while tika extracted the pdf information. tika extracted the pdf information.

Last Update:2014-11-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:398)at org.apache.pdfbox.util.PDFTextStripper.writeString(PDFTextStripper.java:866)at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1896)at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:744)at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:461)at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)

The preceding error is reported when apache tika is used to extract pdf information. According to the error message, the read may exceed the request limit (0.1 million words ).

My code is as follows:

Parser parser = new PDFParser();//parser.BodyContentHandler handler = new BodyContentHandler();Metadata metadata = new Metadata();InputStream stream = null;try {stream = new FileInputStream(new File("1.pdf"));parser.parse(stream, handler, metadata, new ParseContext()); for (String name : metadata.names()) {                 System.out.println(name + ":\t" + metadata.get(name));             }} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (SAXException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (TikaException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {stream.close();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}

The maximum number of words to read may not be input in a constructor, but the default 100,000 words are used. Check the above Code. I noticed that

BodyContentHandler constructor:

org.apache.tika.sax.BodyContentHandler.BodyContentHandler(int writeLimit)

It seems that there is a relationship between them. Modify the number of the constructor:10*1024*1024 (This number is determined by the size of the pdf document ).

Redebug the program to obtain the following metadata:

dc:subject:meta:save-date:2014-07-22T21:02:38Zsubject:PostgreSQL 9.3 DocumentationAuthor:The PostgreSQL Global Development Groupdcterms:created:2014-07-22T20:55:33Zdate:2014-07-22T21:02:38Zcreator:The PostgreSQL Global Development GroupCreation-Date:2014-07-22T20:55:33Ztitle:PostgreSQL 9.3 Documentationtrapped:Falsemeta:author:The PostgreSQL Global Development Groupcreated:Wed Jul 23 04:55:33 CST 2014meta:keyword:cp:subject:PostgreSQL 9.3 Documentationdc:format:application/pdf; version=1.4PTEX.Fullbanner:This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012/Debian) kpathsea version 6.1.0xmp:CreatorTool:LaTeX with hyperref packageKeywords:dc:title:PostgreSQL 9.3 DocumentationLast-Save-Date:2014-07-22T21:02:38Zmeta:creation-date:2014-07-22T20:55:33Zdcterms:modified:2014-07-22T21:02:38Zdc:creator:The PostgreSQL Global Development Grouppdf:PDFVersion:1.4Last-Modified:2014-07-22T21:02:38Zmodified:2014-07-22T21:02:38ZxmpTPg:NPages:2861pdf:encrypted:falseproducer:pdfTeX-1.40.13; modified using iText® 5.1.3 ©2000-2011 1T3XT BVBAContent-Type:application/pdf

Java pdf reading error

This is unknown.

How does lucene model the search content?

The document is a container that contains one or more domains. The Domain value can be indexed or not indexed. If you need to search for a domain, you must index it. Domain values in binary format can only be stored but cannot be indexed. When indexing a domain, you need to use the analyzer to convert the Domain value to a vocabulary unit. 2. The indexing process shows that the lucene indexing process is divided into three main steps: converting the original document into text, analyzing the text, and saving the analyzed text to the index. During the indexing operation, the text is first extracted from the original data and used to create the object of the corresponding document. This object contains multiple Field instances. They are all used to save original data information. The subsequent analysis process processes the text into a large number of vocabulary units, and finally adds the vocabulary units to the segment structure. Extract text and create documents: lucene can easily extract text, such as txt files, from plain text information. However, if you need to index a manual in PDF format, you must first extract the text from the document and use the information to create the documents and fields of the stored ed. Java does not have the corresponding method to process PDF files, and the corresponding Microsoft Word files are the same as other non-plain text format files. Lucene can be combined with the Tika framework to easily extract relevant text information from it. analysis document: During indexing operations, Lucene first analyzes the text. Separates text data into Vocabulary units. And then perform some optional operations on them. For example, a vocabulary unit must be converted to lowercase letters before indexing. To make the search case insensitive or remove frequently related words that have no practical significance (for example, ah, haha ...). The analysis of relevant documents involves a considerable amount of content, which is not described in detail here. I will explain this in detail later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

An error occurred while tika extracted the pdf information. tika extracted the pdf information.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

An error occurred while tika extracted the pdf information. tika extracted the pdf information.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support