In Linux, node. js is used to extract the content of Word (doc/docx) and PDF text, and node. jsdocx
Preface
To create a full-text search engine, you need to extract documents such as word/pdf. There are some open source solutions such as xpdf for pdf.
However, Word documents are more complex.
Extract PDF text
XPDF is a free open-source software for displaying pdf files and converting pdf files into text images. It also supports Windows. Installing On Debian Linux is very simple:
apt-get install xpdf
Here we only use the pdftotext function. You can directly enter the help information to view the help information:
root@raspberrypi:/var/www# pdftotextpdftotext version 0.26.5Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.orgCopyright 1996-2011 Glyph & Cog, LLCUsage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -fixed <fp> : assume fixed-pitch (or tabular) text -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta information -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information
Test:
Root @ raspberrypi:/var/www # effectotext onceai1_onceai.txt root @ raspberrypi:/var/www # cat onceai.txt product introduction: anchstone intelligent technology (Shanghai) Co., Ltd ....
Then, use child_process in node. js to directly call this command. pdftotext will output the content to a text file, and more operations may be required. The code is omitted.
Use antiword to extract. doc content
Here we use the antiword open-source software to extract the content of previous versions of word2003. The installation is also very simple:
apt-get install antiword
View help:
root@raspberrypi:/var/www# antiword Name: antiword Purpose: Display MS-Word files Author: (C) 1998-2005 Adri van Os Version: 0.37 (21 Oct 2005) Status: GNU General Public License Usage: antiword [switches] wordfile1 [wordfile2 ...] Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls] -f formatted text output -t text output (default) -a <paper size name> Adobe PDF output -p <paper size name> PostScript output paper size like: a4, letter or legal -x <dtd> XML output like: db (DocBook) -m <mapping> character mapping file -w <width> in characters of text output -i <level> image level (PostScript only) -L use landscape mode (PostScript only) -r Show removed text -s Show hidden (by Word) text
Antiword directly outputs the word content to the console:
root@raspberrypi:/var/www# antiword spec.docSYNC Mobile – Ford APAProject Number: DFYSTRequirements Specification
You can also use child_process to call this command in node. js.
Extract .docx content
For the docx document, because it is basically a zip file, you only need to decompress it in node. js, and then parse the text.docx \ word \ document. xml file.
Some libraries on Github parse docx into html,
For example:
Https://github.com/mwilliamson/mammoth.js
Https://github.com/lalalic/docx2html
.
Summary
The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message, thank you for your support.