JAVA example of how to read text from WORD, EXCEL, PDF, TXT, RTF, and HTML files

Source: Internet
Author: User

The following is the Java code for reading the content of several text files. The OFFICE document (WORD, EXCEL) uses the POI control, while the PDF uses the product_box control.

Click here to view the related controls and configuration methods.

WORD </p> <p> Java code <br/> package textReader; <br/> import java. io. *; <br/> import org. apache. poi. hwpf. extractor. wordExtractor; </p> <p> public class WordReader {<br/> public WordReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return read the Word content <br/> * /<br/> public String getTextFromWord (String filePath) {<br/> String result = null; <br/> File file = new File (filePath); <br/> try {<br /> FileInputStream FCM = new FileInputStream (file); <br/> WordExtractor wordExtractor = new WordExtractor (FS); <br/> result = wordExtractor. getText (); <br/>}catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}; <br/> return result; <br/>}</p> <p> package textReader; <br/> import java. io. *; <br/> import org. apache. poi. hwpf. Extractor. wordExtractor; <br/> public class WordReader {<br/> public WordReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return read the Word content <br/> * /<br/> public String getTextFromWord (String filePath) {<br/> String result = null; <br/> File file = new File (filePath ); <br/> try {<br/> FileInputStream FCM = new FileInputStream (file); <br/> WordExtractor wordExtractor = new WordExtractor (FS ); <Br/> result = wordExtractor. getText (); <br/>}catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}; <br/> return result; <br/>}</p> <p> EXCEL </p> <p> Java code <br/> package textReader; <br/> import org. apache. poi. hssf. usermodel. HSSFWorkbook; <br/> import org. apache. poi. hssf. usermodel. HSSFSheet; <br/> import org. apache. poi. hssf. usermod El. HSSFRow; <br/> import org. apache. poi. hssf. usermodel. HSSFCell; </p> <p> import java. io. fileInputStream; <br/> import java. io. fileNotFoundException; <br/> import java. io. IOException; </p> <p> public class ExcelReader {</p> <p> @ SuppressWarnings ("deprecation ") <br/>/** <br/> * @ param filePath file path <br/> * @ return read Excel content <br/> */<br/> public string getTextFromExcel (String filePath) {<br/> StringBu Ffer buff = new StringBuffer (); <br/> try {<br/> // create a reference to an Excel Workbook file <br/> HSSFWorkbook wb = new HSSFWorkbook (new FileInputStream (filePath )); <br/> // create a reference to the worksheet. <Br/> for (int numSheets = 0; numSheets <wb. getNumberOfSheets (); numSheets ++) {<br/> if (null! = Wb. getSheetAt (numSheets) {<br/> HSSFSheet aSheet = wb. getSheetAt (numSheets); // obtain a sheet <br/> for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet. getLastRowNum (); rowNumOfSheet ++) {<br/> if (null! = ASheet. getRow (rowNumOfSheet) {<br/> HSSFRow aRow = aSheet. getRow (rowNumOfSheet); // get a row <br/> for (int cellNumOfRow = 0; cellNumOfRow <= aRow. getLastCellNum (); cellNumOfRow ++) {<br/> if (null! = ARow. getCell (cellNumOfRow) {<br/> HSSFCell aCell = aRow. getCell (cellNumOfRow); // obtain the column value <br/> switch (aCell. getCellType () {<br/> case HSSFCell. CELL_TYPE_FORMULA: <br/> break; <br/> case HSSFCell. CELL_TYPE_NUMERIC: <br/> buff. append (aCell. getNumericCellValue ()). append ('/t'); break; <br/> case HSSFCell. CELL_TYPE_STRING: <br/> buff. append (aCell. getStringCellValue ()). append ('/t'); break; <br/>}< br/>} <Br/>}< br/> buff. append ('/N'); <br/>}< br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}< br/> return buff. toString (); <br/>}< br/> package textReader; <br/> import org. apache. poi. hssf. usermodel. HSSFWorkbook; <br/> import org. apache. poi. hssf. usermodel. HSSFSheet; <br/> import org. apa Che. poi. hssf. usermodel. HSSFRow; <br/> import org. apache. poi. hssf. usermodel. HSSFCell; <br/> import java. io. fileInputStream; <br/> import java. io. fileNotFoundException; <br/> import java. io. IOException; <br/> public class ExcelReader {<br/> @ SuppressWarnings ("deprecation ") <br/>/** <br/> * @ param filePath file path <br/> * @ return read Excel content <br/> */<br/> public string getTextFromExcel (String filePath) {<br/> S TringBuffer buff = new StringBuffer (); <br/> try {<br/> // create a reference to an Excel Workbook file <br/> HSSFWorkbook wb = new HSSFWorkbook (new FileInputStream (filePath )); <br/> // create a reference to the worksheet. <Br/> for (int numSheets = 0; numSheets <wb. getNumberOfSheets (); numSheets ++) {<br/> if (null! = Wb. getSheetAt (numSheets) {<br/> HSSFSheet aSheet = wb. getSheetAt (numSheets); // obtain a sheet <br/> for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet. getLastRowNum (); rowNumOfSheet ++) {<br/> if (null! = ASheet. getRow (rowNumOfSheet) {<br/> HSSFRow aRow = aSheet. getRow (rowNumOfSheet); // get a row <br/> for (int cellNumOfRow = 0; cellNumOfRow <= aRow. getLastCellNum (); cellNumOfRow ++) {<br/> if (null! = ARow. getCell (cellNumOfRow) {<br/> HSSFCell aCell = aRow. getCell (cellNumOfRow); // obtain the column value <br/> switch (aCell. getCellType () {<br/> case HSSFCell. CELL_TYPE_FORMULA: <br/> break; <br/> case HSSFCell. CELL_TYPE_NUMERIC: <br/> buff. append (aCell. getNumericCellValue ()). append ('/t'); break; <br/> case HSSFCell. CELL_TYPE_STRING: <br/> buff. append (aCell. getStringCellValue ()). append ('/t'); break; <br/>}< br /> Buff. append ('/N'); <br/>}< br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}< br/> return buff. toString (); <br/>}</p> <p> PDF </p> <p> Java code <br/> package textReader; <br/> import java. io. fileInputStream; <br/> import java. io. fileNotFoundException; <br/> import java. io. IOException; </p> <p> import Org.w.boxw.parser. extends parser; <br/> import orgdomainbox. pdmodel. PDDocument; <br/> import orgdomainbox. util. extends textstripper; </p> <p> public class PdfReader {<br/> public PdfReader () {<br/>}< br/>/** <br/> * @ param filePath <br/> * @ return refers to the pdf file. <br/> * /<br/> public String getTextFromPdf (String filePath) {<br/> String result = null; <br/> FileInputStream is = null; <br/> PDDocument en T = null; <br/> try {<br/> is = new FileInputStream (filePath); <br/> specify parser = new partition parser (is ); <br/> parser. parse (); <br/> document = parser. getPDDocument (); <br/> extends textstripper stripper = new extends textstripper (); <br/> result = stripper. getText (document); <br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br />} Finally {<br/> if (is! = Null) {<br/> try {is. close ();} catch (IOException e) {e. printStackTrace () ;}< br/>}< br/> if (document! = Null) {<br/> try {document. close ();} catch (IOException e) {e. printStackTrace () ;}< br/>}< br/> return result; <br/>}</p> <p >}< br/> package textReader; <br/> import java. io. fileInputStream; <br/> import java. io. fileNotFoundException; <br/> import java. io. IOException; <br/> import org.20.box‑parser. extends parser; <br/> import orgdomainbox. pdmodel. PDDocument; <br/> import orgdomainbox. util. PDFTextStrip Per; <br/> public class PdfReader {<br/> public PdfReader () {<br/>}< br/>/** <br/> * @ param filePath <br/> * @ return refers to the pdf file. <br/> * /<br/> public String getTextFromPdf (String filePath) {<br/> String result = null; <br/> FileInputStream is = null; <br/> PDDocument document = null; <br/> try {<br/> is = new FileInputStream (filePath); <br/> partition parser = new partition parser (is); <br/> parser. parse (); <br /> Document = parser. getPDDocument (); <br/> extends textstripper stripper = new extends textstripper (); <br/> result = stripper. getText (document); <br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}finally {<br/> if (is! = Null) {<br/> try {is. close ();} catch (IOException e) {e. printStackTrace () ;}< br/>}< br/> if (document! = Null) {<br/> try {document. close ();} catch (IOException e) {e. printStackTrace () ;}< br/>}< br/> return result; <br/>}</p> <p> TXT </p> <p> Java code <br/> package textReader; <br/> import java. io. *; </p> <p> public class TxtReader {<br/> public TxtReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return refers to the txt file read. <br/> * /<br/> public String getTextFromTxt (String filePath) throws Exception {</p> <p> FileReader fr = new FileReader (filePath); <br/> BufferedReader br = new BufferedReader (fr ); <br/> StringBuffer buff = new StringBuffer (); <br/> String temp = null; <br/> while (temp = br. readLine ())! = Null) {<br/> buff. append (temp + "/r/n"); <br/>}< br/> br. close (); <br/> return buff. toString (); <br/>}< br/> package textReader; <br/> import java. io. *; <br/> public class TxtReader {<br/> public TxtReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return refers to the txt file read. <br/> * /<br/> public String getTextFromTxt (String filePath) throws Exception {<br/> FileReader fr = new FileReader (FilePath); <br/> BufferedReader br = new BufferedReader (fr); <br/> StringBuffer buff = new StringBuffer (); <br/> String temp = null; <br/> while (temp = br. readLine ())! = Null) {<br/> buff. append (temp + "/r/n"); <br/>}< br/> br. close (); <br/> return buff. toString (); <br/>}< br/> RTF </p> <p> Java code <br/> package textReader; <br/> import java. io. file; <br/> import java. io. fileInputStream; <br/> import java. io. IOException; <br/> import java. io. inputStream; </p> <p> import javax. swing. text. badLocationException; <br/> import javax. swing. text. defaultStyledDocument; <br/> impor T javax. swing. text. rtf. RTFEditorKit; </p> <p> public class RtfReader {<br/> public RtfReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return reads the rtf content <br/> * /<br/> public String getTextFromRtf (String filePath) {<br/> String result = null; <br/> File file = new File (filePath ); <br/> try {<br/> DefaultStyledDocument styledDoc = new DefaultStyledDocument (); <br/> InputStream is = New FileInputStream (file); <br/> new RTFEditorKit (). read (is, styledDoc, 0); <br/> result = new String (styledDoc. getText (0, styledDoc. getLength ()). getBytes ("ISO8859_1"); <br/> // extract text. to read Chinese, use ISO8859_1 encoding; otherwise, garbled characters may occur. <br/>} catch (IOException e) {<br/> e. printStackTrace (); <br/>} catch (BadLocationException e) {<br/> e. printStackTrace (); <br/>}< br/> return result; <br/>}</p> <p >}< br/> package tex TReader; <br/> import java. io. file; <br/> import java. io. fileInputStream; <br/> import java. io. IOException; <br/> import java. io. inputStream; <br/> import javax. swing. text. badLocationException; <br/> import javax. swing. text. defaultStyledDocument; <br/> import javax. swing. text. rtf. RTFEditorKit; <br/> public class RtfReader {<br/> public RtfReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ retur N read the rtf content <br/> */<br/> public String getTextFromRtf (String filePath) {<br/> String result = null; <br/> File file = new File (filePath); <br/> try {<br/> DefaultStyledDocument styledDoc = new DefaultStyledDocument (); <br/> InputStream is = new FileInputStream (file); <br/> new RTFEditorKit (). read (is, styledDoc, 0); <br/> result = new String (styledDoc. getText (0, styledDoc. getLength ()). getBytes ("ISO8859_1 "); <Br/> // extract text. to read Chinese characters, use ISO8859_1 encoding; otherwise, garbled characters may occur. <br/>} catch (IOException e) {<br/> e. printStackTrace (); <br/>} catch (BadLocationException e) {<br/> e. printStackTrace (); <br/>}< br/> return result; <br/>}< br/> HTML </p> <p> Java code <br/> package textReader; <br/> import java. io. *; </p> <p> public class HtmlReader {<br/> public HtmlReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ r Eturn obtains all html content <br/> */<br/> public String readHtml (String filePath) {<br/> BufferedReader br = null; <br/> StringBuffer sb = new StringBuffer (); <br/> try {<br/> br = new BufferedReader (new InputStreamReader (new FileInputStream (filePath ), "GB2312"); <br/> String temp = null; <br/> while (temp = br. readLine ())! = Null) {<br/> sb. append (temp); <br/>}< br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}< br/> return sb. toString (); <br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return: html text obtained <br/> */ <br/> public String getTextFromHtml (String filePath) {<br/> // obtain the content in the body tag <br/> String str = readHtml (filePath); <Br/> StringBuffer buff = new StringBuffer (); <br/> int maxindex = str. length ()-1; <br/> int begin = 0; <br/> int end; <br/> // intercept> and <br/> while (begin = str. indexOf ('>', begin) <maxindex) {<br/> end = str. indexOf ('<', begin); <br/> if (end-begin> 1) {<br/> buff. append (str. substring (++ begin, end); <br/>}< br/> begin = end + 1; <br/>}; <br/> return buff. toString (); <br/>}</p> <p >}< br/> p Ackage textReader; <br/> import java. io. *; <br/> public class HtmlReader {<br/> public HtmlReader () {<br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return get all html content <br/> * /<br/> public String readHtml (String filePath) {<br/> BufferedReader br = null; <br/> StringBuffer sb = new StringBuffer (); <br/> try {<br/> br = new BufferedReader (new InputStreamReader (new FileInputStream (filePath), "GB2312 ")); <Br/> String temp = null; <br/> while (temp = br. readLine ())! = Null) {<br/> sb. append (temp); <br/>}< br/>} catch (FileNotFoundException e) {<br/> e. printStackTrace (); <br/>}catch (IOException e) {<br/> e. printStackTrace (); <br/>}< br/> return sb. toString (); <br/>}< br/>/** <br/> * @ param filePath file path <br/> * @ return: html text obtained <br/> */ <br/> public String getTextFromHtml (String filePath) {<br/> // obtain the content in the body tag <br/> String str = readHtml (filePath); <br/> StringBuffer buff = new StringBuffer (); <br/> int maxindex = str. length ()-1; <br/> int begin = 0; <br/> int end; <br/> // intercept> and <br/> while (begin = str. indexOf ('>', begin) <maxindex) {<br/> end = str. indexOf ('<', begin); <br/> if (end-begin> 1) {<br/> buff. append (str. substring (++ begin, end); <br/>}< br/> begin = end + 1; <br/>}; <br/> return buff. toString (); <br/>}< br/>}

Note: If you use WPS to edit relevant documents, an error message is displayed, which should be avoided.

The error message is as follows:

WORD

Your document seemed to be mostly unicode, but the section definition was in bytes! Trying anyway, but things may well go wrong!

EXCEL

Java. lang. RuntimeException: Expected an EXTERNSHEET record but got (org. apache. poi. hssf. record. SSTRecord)
At org. apache. poi. hssf. model. LinkTable. readExtSheetRecord (LinkTable. java: 187)
At org. apache. poi. hssf. model. LinkTable. <init> (LinkTable. java: 163)
At org. apache. poi. hssf. model. Workbook. createWorkbook (Workbook. java: 199)
At org. apache. poi. hssf. usermodel. HSSFWorkbook. <init> (HSSFWorkbook. java: 273)
At org. apache. poi. hssf. usermodel. HSSFWorkbook. <init> (HSSFWorkbook. java: 196)
At org. apache. poi. hssf. usermodel. HSSFWorkbook. <init> (HSSFWorkbook. java: 312)
At org. apache. poi. hssf. usermodel. HSSFWorkbook. <init> (HSSFWorkbook. java: 293)
At textReader. ExcelReader. getTextFromExcel (ExcelReader. java: 23)
At DocumentInfo. getContent (DocumentInfo. java: 86)
At MainFunction. main (MainFunction. java: 19)

RTF

Java. io. IOException: Too restart close-groups in RTF text
At javax. swing. text. rtf. RTFParser. write (Unknown Source)
At javax. swing. text. rtf. RTFParser. writeSpecial (Unknown Source)
At javax. swing. text. rtf. AbstractFilter. write (Unknown Source)
At javax. swing. text. rtf. AbstractFilter. readFromStream (Unknown Source)
At javax. swing. text. rtf. RTFEditorKit. read (Unknown Source)
At textReader. RtfReader. getTextFromRtf (RtfReader. java: 25)
At DocumentInfo. getContent (DocumentInfo. java: 74)
At MainFunction. main (MainFunction. java: 19)

BODY {FONT-FAMILY: Tahoma; FONT-SIZE: 10pt} P {FONT-FAMILY: Tahoma; FONT-SIZE: 10pt} DIV {FONT-FAMILY: Tahoma; FONT-SIZE: 10pt} TD {FONT-FAMILY: Tahoma; FONT-SIZE: 10pt} By the way, why is the write error? Because level is auto-increment and auto-increment according to {And}, this error is prompted when the parentheses do not match. The format of the wps editing rtf file is incorrect. The {And} do not match. However, you can use word or WordPad to create a new file, edit it, and save it as an rtf file (not supported in wps). You can use NotePad to open it and find that many format descriptions are added, but {And} match, in this way, no error is reported. The following is a detailed description:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.