In Linux, node. js is used to extract the content of Word (doc/docx) and PDF text, and node. jsdocx

Source: Internet
Author: User

In Linux, node. js is used to extract the content of Word (doc/docx) and PDF text, and node. jsdocx

Preface

To create a full-text search engine, you need to extract documents such as word/pdf. There are some open source solutions such as xpdf for pdf.

However, Word documents are more complex.

Extract PDF text

XPDF is a free open-source software for displaying pdf files and converting pdf files into text images. It also supports Windows. Installing On Debian Linux is very simple:

apt-get install xpdf

Here we only use the pdftotext function. You can directly enter the help information to view the help information:

root@raspberrypi:/var/www# pdftotextpdftotext version 0.26.5Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.orgCopyright 1996-2011 Glyph & Cog, LLCUsage: pdftotext [options] <PDF-file> [<text-file>] -f <int>   : first page to convert -l <int>   : last page to convert -r <fp>   : resolution, in DPI (default is 72) -x <int>   : x-coordinate of the crop area top left corner -y <int>   : y-coordinate of the crop area top left corner -W <int>   : width of crop area in pixels (default is 0) -H <int>   : height of crop area in pixels (default is 0) -layout   : maintain original physical layout -fixed <fp>  : assume fixed-pitch (or tabular) text -raw    : keep strings in content stream order -htmlmeta   : generate a simple HTML file, including the meta information -enc <string>  : output text encoding name -listenc   : list available encodings -eol <string>  : output end-of-line convention (unix, dos, or mac) -nopgbrk   : don't insert page breaks between pages -bbox    : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string>  : owner password (for encrypted files) -upw <string>  : user password (for encrypted files) -q    : don't print any messages or errors -v    : print copyright and version info -h    : print usage information -help    : print usage information --help   : print usage information -?    : print usage information

Test:

Root @ raspberrypi:/var/www # effectotext onceai1_onceai.txt root @ raspberrypi:/var/www # cat onceai.txt product introduction: anchstone intelligent technology (Shanghai) Co., Ltd ....

Then, use child_process in node. js to directly call this command. pdftotext will output the content to a text file, and more operations may be required. The code is omitted.

Use antiword to extract. doc content

Here we use the antiword open-source software to extract the content of previous versions of word2003. The installation is also very simple:

apt-get install antiword

View help:

root@raspberrypi:/var/www# antiword Name: antiword Purpose: Display MS-Word files Author: (C) 1998-2005 Adri van Os Version: 0.37 (21 Oct 2005) Status: GNU General Public License Usage: antiword [switches] wordfile1 [wordfile2 ...] Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]  -f formatted text output  -t text output (default)  -a <paper size name> Adobe PDF output  -p <paper size name> PostScript output   paper size like: a4, letter or legal  -x <dtd> XML output   like: db (DocBook)  -m <mapping> character mapping file  -w <width> in characters of text output  -i <level> image level (PostScript only)  -L use landscape mode (PostScript only)  -r Show removed text  -s Show hidden (by Word) text

Antiword directly outputs the word content to the console:

root@raspberrypi:/var/www# antiword spec.docSYNC Mobile – Ford APAProject Number: DFYSTRequirements Specification

You can also use child_process to call this command in node. js.

Extract .docx content

For the docx document, because it is basically a zip file, you only need to decompress it in node. js, and then parse the text.docx \ word \ document. xml file.

Some libraries on Github parse docx into html,

For example:

Https://github.com/mwilliamson/mammoth.js

Https://github.com/lalalic/docx2html

.

Summary

The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.