"NLP" Tika text preprocessing: Extracting content from various format files

Source: Internet
Author: User
Tags microsoft outlook

tika Common Format files extract content and do preprocessing

Author Bai Ningsu

March 30, 2016 18:57:08

Abstract : This paper focuses on natural language processing (NLP) process, the important basic part of extracting text content preprocessing. First, we need to realize the importance of preprocessing. In the context of big data, more and more unstructured semi-structured text. How to extract the valuable knowledge we need from the massive text is especially important. In addition, the text format is often different, such as: Pdf,word,excl,xml,ppt,txt and other common file types you may have a way to deal with some of the setbacks. If you encounter database,html, mail, RTF, image, voice and other documents, whether you have no strategy. Based on this article, we summarize the Apache Tika Content extraction tool, which is powerful in that it can handle various files and save you more time to do important things. The first section of this article uses the core concept to explain the second section of Knowledge expansion supplement. Section III typical dome with the source code section fourth refer to the core file and the Tika tool's jar package share. (The author of the original, compilation of the proceeds, reproduced please specify: Tika common format files extracted content and do preprocessing)

1 Tika Introduction

Tika Concept

Tika is a content analysis tool that comes with a comprehensive parser tool class that can parse basic all common format files, get metadata,content of files, and return formatted information. Generally, it can be used as a general parsing tool. Especially for the search engine data capture and processing steps are of great significance. Tika is a sub-project under the Lucene project of Apache, and it is easy and easy to use Tika to get the content from a large batch of documents in Lucene applications. Apache Tika Toolkit can automatically detect the types of documents (such as word,ppt,xml,csv,ppt, etc.) and extract the document's metadata and textual content. Tika integrates with existing document parsing libraries and provides a unified interface to make parsing for different types of documents easier. Tika is useful for search engine indexing, content analysis, conversion, and more.

Tika Architecture

Application programmers can easily integrate tika in their applications. Tika provides a command-line interface and a graphical user interface to make it more humane. In this chapter, we will discuss the four important modules that make up the Tika architecture. Shows the architecture of the four modules of the Tika:

    • Language detection mechanism.
    • MIME detection mechanism.
    • Parser interface.
    • Tika facade class.

Language detection mechanism

Whenever a text file is passed to Tika, it detects the language in it. It accepts no language for the comment file and adds metadata information to the file by detecting the language. Support for language recognition, Tika has a class called the language identifier in the package org.apache.tika.language and the language recognition database contains the language detection from the given text algorithm. Tika internal use of N-gram algorithm language detection.

MIME detection mechanism

Tika can detect document types based on MIME standards. Tika default MIME type detection is using Org.apache.tika.mime.mimeTypes. It uses the Org.apache.tika.detect.Detector interface for most content type detection. Internal Tika uses a variety of techniques such as file match substitution, content type hinting, magic Byte, character encoding, and other techniques.

Parser interface

The Org.apache.tika.parser parser interface is the primary interface for Tika parsing documents. The interface extracts text and metadata from the document and summarizes its willingness to write parser plugins to external users. Using different specific parser classes, specifically for each document type, Tika supports a large number of file formats. These formats provide support for specific classes of different file formats, either by directly implementing a logic parser or by using an external parser library.

Tika Facade Class

The Tika facade class used is the simplest and most straightforward way to invoke Tika from Java, and it also follows the design pattern of appearances. The appearance facade class can be found in the Org.apache.tika package Tika of the Tika API. By implementing the basic use case, Tika acts as a proxy for facade. It abstracts the underlying complexity of the Tika library, such as the MIME detection mechanism, parser interface, and language detection mechanism, and provides the user with a simple interface to use.

Features of Tika

    • Unified Parser Interface: The Tika encapsulates a third-party parser library in a single parser interface. Due to this feature, the user escapes from the burden of selecting the appropriate parser library and uses it, depending on the type of file encountered.

    • Low memory footprint: Tika therefore consumes less memory resources and is easily embedded in Java applications. You can also use the Tika platform as a mobile PDA with less resources to run the application.

    • Fast processing: Detection and extraction of content from the application link can be expected.

    • Flexible metadata: Tika Understand all of these metadata models that are used to describe files.

    • Parser Integration: Tika can use a variety of parser libraries that can be used for each file type in a single application.

    • MIME type detection: Tika can detect and extract content from all media types included in the MIME standard.

    • Language detection: Tika includes the language recognition feature, so it can be used in a multilingual website based on the language type of the document.

Features of the Tika

Tika supports a variety of features:

    • Document Type Detection
    • Content Extraction
    • Meta Data extraction
    • Language detection
File type detection

Tika uses different detection techniques to detect the type of files to it.

Content Extraction

Tika has a parser library that can parse and extract the contents of various document formats. It then detects the type of document, selects the appropriate parser from the parser library, and passes the document. Different classes of Tika methods to parse different file formats.

Meta Data extraction

With the content, Tika extracts the content in the metadata of the file that has the same program. For some file types, Tika has an interface class that extracts metadata.

Language detection

Internally, Tika resembles a n-gram algorithm to detect a given document in the language of the content. Tika depends on the class, such as language recognition and the language recognition of the profiler.

2 Core Knowledge expansion

Parser interface

The Org.apache.tika.parser.Parser interface is a key component of Apache Tika. It hides the complexities of different file formats and parsing libraries, while providing a simple and powerful mechanism for customer applications to extract structured text content and metadata from a variety of different documents. All of this is done in a simple way:

void Parse (InputStream stream, ContentHandler handler, Metadata Metadata) throws IOException, Saxexception, Tikaexception;

parseThe method accepts the document to be parsed and the associated metadata as input, and outputs XHTML SAX events as well as additional metadata as the result. The main conditions that lead to this design are shown in table 1.

Table 1. The condition of Tika analytic design
conditions explain
流线化的解析 This interface should not require a client application or parser implementation to store the full document content in memory or to disk. This allows even large documents to be parsed without excessive resource requirements.
结构化的内容 A parser implementation should be able to include structure information (headings, links, and so on) within the extracted content. Customer applications can use this information, for example, to better determine the relevance of the different parts of the parsed document.
输入元数据 A client application should be able to include metadata such as the file name or the declared content type of the document to be parsed. This parser implementation can use this information to better guide this parsing process.
输出元数据 A parser implementation should be able to return document metadata in addition to the contents of the document. Many document formats contain metadata that is useful to customer applications, such as the author's name.

These conditions are parse reflected within the parameters of the method.

Document InputStream

The first parameter is the one InputStream used to read the document to be parsed.

If this document stream cannot be read, the resolution is stopped and the throw is IOException passed to the client application. If the stream can be read but cannot be parsed (for example, if the document is corrupted), the parser throws one TikaException . This parser implementation will use this stream, but it will not close it. The shutdown flow is the responsibility of the client application that initially opened it. Listing 1 shows the parse recommended mode for using a stream with a method.

Listing 1. Use parseMethod uses the recommended mode of the stream
InputStream stream = ...;      Open the Streamtry {    parser.parse (stream, ...);//parse the stream} finally {    stream.close ();            Close the stream}

XHTML SAX Events

The parsed content of this document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express the structured content of this document, and SAX events are used to enable streamlined processing. Note that the XHTML format is used only to express structured information, not to render the document for browsing. These XHTML SAX events generated by this parser implementation are sent to parse an instance of the method ContentHandler . If this content handler fails to handle an event, the resolution is stopped and the thrown is SAXException sent to the client application. Listing 2 shows the overall structure of the generated event stream (and indentation added for clarity).

Listing 2. The overall structure of the generated event stream

Parser implementations typically use XHTMLContentHandler utility classes to generate XHTML output. Dealing with these raw SAX events can be complicated, so Apache Tika (since V0.2) carried several utility classes to handle the flow of events and convert the event stream to other representations.

For example, a BodyContentHandler class can be used to extract only the body part of an XHTML output and provide it as a SAX event to another content handler or as a symbol to an output stream, a writer, or a string. The following code snippet resolves the document from the standard input stream and outputs the extracted document content to standard output:

ContentHandler handler = new Bodycontenthandler (system.out);p arser.parse (system.in, Handler, ...);

Another useful class is ParsingReader that it uses a background thread to parse the document and return the extracted text content as a character stream.

Listing 3.ParsingReaderExamples of
InputStream stream = ...; The document to be parsedreader reader = new Parsingreader (parser, stream, ...); try {...;//Read the document text using the reader} finally {reader.close ();//The document stream is closed Automati Cally}
Document Meta Data

parseThe last parameter of the method is used to pass the document metadata in/out of this parser. Document metadata is expressed as a metadata object. Table 2 lists some of the more interesting metadata properties.

Table 2. Meta Data properties
Properties Description
Metadata.RESOURCE_NAME_KEY Contains the file or resource name for this document-a client application can set this property to have the parser infer the format of this document through the file name. If the file format contains the canonical file name (for example, the GZIP format has a slot for the file name), then the file parser implementation can set this property.
Metadata.CONTENT_TYPE The declared content type of this document-a client application can set this property based on, for example, a HTTP Content-Type header. The declared content type helps the parser to parse the document correctly. The parser implementation sets this property to the appropriate content type, based on which document is being parsed.
Metadata.TITLE Title of document-if the document format contains an explicit title field, this property is set by the parser implementation.
Metadata.AUTHOR Author name of the document-if the document format contains an explicit author field, this property is set by the parser implementation.

Note that Meta data processing is also discussed in the Apache Tika development team, so there may be some (later incompatible) differences in Meta data processing in versions prior to Tika V1.0.

Parser implementation

Apache Tika comes with some parser classes to parse various document formats, as shown in table 3.

Table 3. Tika Parser Class
format Description
microsoft®excel® (Application/vnd.ms-excel) There is support for Excel spreadsheets in all Tika versions, based on the HSSF Library of POI.
Microsoft word® (Application/msword) There is support for Word documents in all Tika versions, based on the HWPF Library of POI.
Microsoft powerpoint® (Application/vnd.ms-powerpoint) Support for PowerPoint presentations is available in all Tika versions, based on the HSLF library of the POI.
Microsoft visio® (Application/vnd.visio) Support for Visio diagrams was added to Tika V0.2, based on the HDGF Library of POI.
Microsoft Outlook® (Application/vnd.ms-outlook) Support for Outlook messages was added to Tika V0.2, based on the HSMF Library of POI.
GZIP Compression (Application/x-gzip) Support for GZIP was added to Tika V0.2, based on classes in the Java 5 class library GZIPInputStream .
bzip2 Compression (Application/x-bzip) Support for BZIP2 was added to the Tika V0.2, based on the BZIP2 parsing code of Apache Ant, which was initially based on Aftex software Keiron's work.
MP3 Audio (Audio/mpeg) The parsing of the markup for the MP3 file was added to the Tika V0.2 ID3v1 . If found, the following metadata will be extracted and set:
  • TITLE Title
  • SUBJECT Subject
MIDI Audio (Audio/midi) Tika uses javax.audio.midi MIDI support within to parse MIDI sequence files. Many karaoke file formats are based on MIDI and contain lyrics in the form of embedded text songs, and Tika know how to extract them.
Wave Audio (Audio/basic) The Tika javax.audio.sampled supports sampled wave audio (. wav files, etc.) through the package. Only sampled metadata is extracted.
Extensible Markup Language (XML) (Application/xml) Tika uses the javax.xml class parsing XML file.
Hypertext Markup Language (HTML) (text/html) Tika uses the Cyberneko library to parse the HTML file.
Image (image/*) Tika uses javax.imageio classes to extract metadata from an image file.
Java class files The parsing of Java-class files is based on the work of the ASM Library and the JCR-1522 Dave Brosius.
Java Archive Files The parsing of the JAR file is done using both the ZIP and Java file parsers.
OpenDocument (application/vnd.oasis.opendocument.*) Tika uses the built-in ZIP and XML attributes in the Java language to resolve the OpenDocument document types that are used by OpenOffice V2.0 or later. Older OpenOffice V1.0 formats are also supported, but they are not currently automatically detected as in the newer format.
Plain Text (text/plain) Tika uses the international components for Unicode Java Library (icu4j) to parse plain text.
Portable Document Format (PDF) (application/pdf) Tika uses the PDFBox library to parse PDF documents.
Rich Text Format (RTF) (APPLICATION/RTF) Tika uses the built-in Swing Library of Java to parse RTF documents.
TAR (Application/x-tar) Tika uses an adjusted version of the TAR parsing code from Apache Ant to parse the tar file. This TAR code is based on the work of Timothy Gerard endres.
ZIP (Application/zip) Tika uses the Java built-in zip class to parse the zip file.

You can use your own parser to extend Apache Tika, and any contribution you make to Tika is welcome. The goal of Tika is to reuse existing parser libraries (such as Apache PDFBox or Apache POI) as much as possible, so most parser classes within Tika are adapted to these external libraries. Apache Tika also contains some generic parser implementations that are not intended for any particular document format. One of the most noteworthy is the AutoDetectParser class, which wraps all the Tika functions into a parser that can handle any document type. This parser automatically determines the type of document to be entered, and then parses the document accordingly. Now, we can do some practical things. The following classes are what we will develop throughout the tutorial:

      1. BudgetScramble-Shows how to use Apache Tika metadata to determine which document was recently changed and when.
      2. TikaMetadata-Shows how to get all Apache Tika metadata for a document, even if there is no data (only the metadata types are displayed).
      3. TikaMimeType-Shows how to use Apache Tika's mimetypes to detect mimetype for a particular document.
      4. TikaExtractText-The Apache Tika file extraction function is displayed and the extracted text is saved as the appropriate file.
      5. LanguageDetector —This paper introduces the recognition function of Nutch language to identify the language of specific content.
      6. Summary —The Tika features, such as MimeType content charset detection and meta data, are summarized. In addition, it introduces the Cpdetector function to determine the charset encoding of a file. Finally, it shows the actual use of Nutch language recognition.
3 Tika Text Extraction Example Analysis

Tika General data extraction is done through 5 parts:

1 InputStream input=new FileInputStream (New File ("./myfile/active learning.pdf")); Build InputStream to read data, write file path, pdf,word,html, etc.

2 Bodycontenthandler texthandler=new Bodycontenthandler (); Get content

3 Metadata matadata=new Metadata ();//metadata object holds metadata for author, title, etc.

4 Pdfparser parsecontext context=new parsecontext (); Here the parser parser uses different parsers based on different files

5 Parser parser=new Autodetectparser ();//When the call to Parser,autodetectparser automatically estimates the document MIME type, enter the PDF file here, so you can use

6 Parser.parse (input, Texthandler, matadata, context);//Perform the parsing process

Source:

/** * Tika Autodetectparser class to identify and extract content * @throws tikaexception  * @throws saxexception  * @throws ioexception * *  Public   static void Gettextfronpdf () throws IOException, Saxexception, tikaexception{   //build InputStream to read data   InputStream  input=new fileinputstream (New file ("./myfile/active learning.pdf");//can write file path, pdf,word,html, etc.   Bodycontenthandler texthandler=new Bodycontenthandler ();   Metadata matadata=new Metadata ();//metadata Object holds author, title, etc. metadata   Parser parser=new autodetectparser  ();// When the call to Parser,autodetectparser automatically estimates the document MIME type, the PDF file is entered here, so you can use Pdfparser   parsecontext context=new parsecontext ();   Parser.parse (input, Texthandler, matadata, context);//Perform parsing process   input.close ();   System.out.println ("Title:" +matadata.get (Metadata.title));   System.out.println ("Type:" +matadata.get (Metadata.type));   System.out.println ("Body:" +texthandler.tostring ());//print body from Texthandler   }

Operation Result:

4 References and JAR package sharing

1 Understanding information content with Apache Tika

2 Tika Tutorials

3 Apache Tika: A common content analysis tool

4 Download Apache Tika

5 Server-1.12.jar Package access password 32CD

"NLP" Tika text preprocessing: Extracting content from various format files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.