PHP reads Doc,docx,xls,pdf,txt content

Source: Internet
Author: User
Tags trim ziparchive

One of my customers has the demand: Upload files, can be doc,docx,xls,pdf,txt format, now need to read the contents of these files in PHP, and then calculate the number of words in the file.

1.PHP read DOC-formatted files

PHP does not have a class to read the Word file, or a library, where we use the Antiword (http://www.winfield.demon.nl/) package to read the doc file.

Let's start by describing how to use Windows:

1. Open http://www.winfield.demon.nl/(antiword download page), find the corresponding version of Windows (http://www.winfield.demon.nl/#Windows), Download Antiword Windows version (Antiword-0_37-windows.zip);

2. Extract the downloaded files to the C packing directory;

Here's one more thing to note: Http://www.informatik.uni-frankfurt.de/~markus/antiword/00README.WIN This connection has a description file installed in Windows.

Need to set environment variable, my computer (right key)-> Advanced-> environment variable-> A new one in the above user variable

Variable Name: Home

Variable Value: C:\home This directory should exist, and if it does not exist, create a home folder under C disk.

Then in the system variable, modify path, plus%home%\antiword at the top of the path variable's value.

3. Start-> run->cmd into Antiword directory;

Enter Antiword-h to see the effect.

4. Then we use the antiword–t command to read the doc file contents, first copy a doc file to the C:\antiword directory, and then execute

>antiword–t file name. doc

You can see the contents of the Word file printed on the screen.

Maybe you will ask, what does this have to do with PHP reading word? Oh, don't worry, let's see how to use this command in PHP.

<?php

$file = "D:\xampp\htdocs\word_count\uploads\doc-english.doc";

$content = Shell_exec ("c:\antiword\antiword–f $file");

?>

This will read the contents of Word inside the content.

As to how to read the contents of DOC files under Linux, download the Linux version of the compressed package, which has Readme.txt files, in that way installed on it.

$content = Shell_exec ("/usr/local/bin/antiword-f $file");

2.PHP reading PDF file content

PHP also does not have a class library dedicated to reading PDF content. In this way we adopt a Third-party package (xpdf). Or do the operation under Windows, download, unzip it to the C-packing directory.

Start-> run->cmd->cd/d c:\xpdf
<?php

$file = "D:\xampp\htdocs\word_count\uploads\pdf-english.pdf";

$content = Shell_exec ("C:\\xpdf\\pdftotext $file-");

?>

This allows you to read the contents of the PDF file into the PHP variable.

The installation method under Linux is also very simple here is not listed here

<?php

$content = Shell_exec ("/usr/bin/pdftotext $file-");

?>

3.PHP read the contents of the zip file

First use PHP zip to extract the zip file, and then read the file in the decompression package, if it is word on the use of Antiword read, if the PDF is used xpdf read.

<?php

/**
* Read ZIP Valid file
*
* @param string $file file path
* @return String Total valid content
*/
function Readzipfile ($file = ' ") {
$content = "";
$inValidFileName = Array ();
$zip = new Ziparchive ();
if ($zip->open ($file) = = TR) {
for ($i = 0; $i < $zip->numfiles; $i + +) {
$entry = $zip->getnameindex ($i);
if (Preg_match (' #\. TXT) |\. (DOC) |\. (docx) |\. (PDF) $ #i ', $entry)) {
$zip->extractto (PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, Pathinfo_filename), Array (
$entry
) );
$content. = Checksystemos (PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, Pathinfo_filename). "/" . $entry);
} else {
$inValidFileName [$i] = $entry;
}
}
$zip->close ();
Rrmdir (PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, pathinfo_filename));
/*if (File_exists ($file)) {
Unlink ($file);
}*/
return $content;
} else {
Return "";
}
}

?>

4.PHP Read Docx file contents

The docx file is actually made up of a lot of XML files, in which the content exists in the word/document.xml.

We find a docx file, open using a zip file (or change the docx suffix name to zip, then unzip)

There are document.xml in the word directory

The contents of the Docx file are in the document.xml, so we can read the file.

<?php

/**
* Read Docx File
*
* @param string $file filepath
* @return String file content
*/
function Parseword ($file) {
$content = "";
$zip = new Ziparchive ();
if ($zip->open ($file) = = TR) {
for ($i = 0; $i < $zip->numfiles; $i + +) {
$entry = $zip->getnameindex ($i);
if (PathInfo ($entry, pathinfo_basename) = = "Document.xml") {
$zip->extractto (PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, Pathinfo_filename), Array (
$entry
) );
$filepath = PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, Pathinfo_filename). "/" . $entry;
$content = Strip_tags (file_get_contents ($filepath));
Break
}
}
$zip->close ();
Rrmdir (PathInfo ($file, Pathinfo_dirname). "/" . PathInfo ($file, pathinfo_filename));
return $content;
} else {
Return "";
}
}

?>

If you want to create docx files through PHP, or convert docx files to xhtml,pdf, you can use Phpdocx, (http://www.phpdocx.com/)

5.PHP Read txt

Use the PHP file_get_content function directly.

<?php

$file = "D:\xampp\htdocs\word_count\uploads\eng.txt";

$content = File_get_content ($file);

?>

6.PHP Read Excel

http://phpexcel.codeplex.com/

Now just read the contents of the file, how to calculate the number of words?

PHP has a self-contained function, Str_word_count, this function can calculate the number of words, but if you want to calculate the number of Antiword read doc file words will be a great error.

Here we use this function specifically to read the number of words
<?php

/**
* Statistic Word Count
*
* @param string $content Word content of the file
* @return int Word count of the content
*/
function Statisticwordscount ($text = ' ") {
$text = Trim (preg_replace ('/\d+/', ', $text)); Remove extra spaces
$text = Str_replace (str_split (' | '), ', $text); Remove these chars (can specify more)
$text = Str_replace (Str_split ('-'), ', $text); Remove these chars (can specify more)
$text = Trim (preg_replace ('/\s+/', ', $text)); Remove extra spaces
$text = preg_replace ('/-{2,}/', ', ', $text); Remove 2 or more dashes in a row
$len = strlen ($text);
if (0 = = $len) {
return 0;
}
$words = 1;
while ($len--) {
if (' = = = $text [$len]) {
+ + $words;
}
}
return $words;
}

?>

The detailed code is as follows:

<?php
/**
* Check System operation win or Linux
*
* @param string $file contain file path and file name
* @return File content
*/
function Checksystemos ($file = ' ") {
$content = "";
$type = S Str ($file, Strrpos ($file, '. ') + 1);
$type = PathInfo ($file, pathinfo_extension);
Global $UNIX _antiword_path, $UNIX _xpdf_path;
if (Strtoupper (S str (php_os, 0, 3)) = = = ' WIN ') {//this is a server using Windows
Switch (Strtolower ($type)) {
Case ' Doc ':
$content = Shell_exec ("c:\\antiword\\antiword-f $file");
Break
Case ' docx ':
$content = Parseword ($file);
Break
Case ' PDF ':
$content = Shell_exec ("C:\\xpdf\\pdftotext $file-");
Break
Case ' Zip ':
$content = Readzipfile ($file);
Break
Case ' txt ':
$content = file_get_contents ($file);
Break
}
else {//this is a server not using Windows
Switch (Strtolower ($type)) {
Case ' Doc ':
$content = Shell_exec ("/usr/local/bin/antiword-f $file");
Break
Case ' docx ':
$content = Parseword ($file);
Break
Case ' PDF ':
$content = Shell_exec ("/usr/bin/pdftotext $file-");
Break
Case ' Zip ':
$content = Readzipfile ($file);
Break
Case ' txt ':
$content = file_get_contents ($file);
Break
}
}
   /*if (file_exists ($file)) {

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.