Чтение файла word php

PHPWord — чтение MSWord документов средствами PHP

От автора: не так давно на нашем сайте был опубликован урок по созданию документов MS Word средствами языка PHP, и с использованием специальной библиотеки PHPWord. Но в комментариях к данному видео – прозвучал вопрос, как при помощи данной библиотеки читать готовые документы, что собственно и подтолкнуло меня к записи данного урока, в котором мы с Вами научимся, используя выше указанную библиотеку, читать ранее созданные документы MSWord.

скачать исходникискачать урок

В данном уроке мы продолжаем изучать возможности PHPWord, а именно рассмотрим инструменты по чтению готовых документов MS Word. Хотел бы отметить, что сегодня мы будем работать с уже установленной библиотекой, потому как это уже второй урок по данной теме, а значит, на основах подробно останавливаться не будем. Поэтому рекомендую, перед просмотром данного видео ознакомиться с первой часть урока – PHPWord — создание MS Word документов средствами PHP.

Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.

Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.

require ‘vendor/autoload.php’;

Для начала создадим переменную, в которой будет храниться путь к документу MSWord, с которым мы будем работать.

$source = __DIR__.«/docs/text.docx»;

Далее, вспомним, что в начале работы с библиотекой необходимо создать объект главного класса PHPWord, но это в том случае если создается новый документ. Если же осуществляется чтение готового файла MS Word – объект указанного класса необходимо создать для интересующего документа, но перед этим его нужно прочитать.

Для чтения готовых документов в PHPWord предусмотрена группа классов, отвечающих за чтение документов различных форматов. А значит, первым делом создадим объект специального “класса-риддера“.

$objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’);

Далее, используя данный объект – выполним чтение документа формата MS Word.

$phpWord = $objReader>load($source);

Таким образом, по сути, задача урока выполнена, так как документ прочитан и его данные располагаются в структуре только что созданного объекта $phpWord. Но давайте поговорим о том, как же получить данные хранящиеся в объекте.

По официальной документации любая информация документа MS Word, согласно библиотеке PHPWord, располагается в отдельных секциях. При этом каждая секция содержит определенный набор элементов – текст, таблица, изображение, ссылка и т.д. Элементы – же в свою очередь, так же могут быть сложными и включать в себя некий набор вложенных элементов, к примеру таблицы.

Поэтому, вызывая на исполнение метод getSections(), мы получаем доступ к секциям документа, при этом в качестве результата будет возвращен массив, а значит мы его можем обойти циклом foreach().

foreach($phpWord>getSections() as $section) {

$arrays = $section>getElements();

}

При этом в коде цикла, для каждой секции, получим массив входящих элементов, вызывая на исполнение метод getElements(). Так как возвращаемое значение – это массив, значит, используя выше указанный цикл, мы можем получить доступ к каждой его ячейке.

foreach($arrays as $e) {

}

При этом в переменной $e на каждой итерации цикла, содержится объект одного из элементов массива секций. Казалось бы, мы сразу можем получить текстовые данные MS Word, но для начала нужно проверить, что содержится в переменной $e.

if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) {

Если в данной переменной содержится объект класса ‘PhpOfficePhpWordElementTextRun’, значит мы работаем с сложной текстовой областью, в которой располагается несколько более простых элементов. Поэтому повторно вызываем метод getElements() и по результату проходимся в цикле foreach().

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

<?php

require ‘vendor/autoload.php’;

$source = __DIR__.«/docs/text.docx»;

$objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’);

$phpWord = $objReader>load($source);

$body = »;

foreach($phpWord>getSections() as $section) {

$arrays = $section>getElements();

foreach($arrays as $e) {

if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) {

foreach($e>getElements() as $text) {

$font = $text>getFontStyle();

$size = $font>getSize()/10;

$bold = $font>isBold() ? ‘font-weight:700;’ :»;

$color = $font>getColor();

$fontFamily = $font>getName();

$body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’;

$body .= $text>getText().‘</span>’;

}

}

}

}

include ‘templ.php’;

Таким образом, для текущего документа, в переменную $text, попадает объект элемента Text, то есть элемент простейшего текст, для получения которого достаточно вызвать на исполнение метод getText(). Для получения информации о форматировании текущего элемента, необходимо обратиться к методу getFontStyle(), который вернет объект в закрытых свойствах которого содержится указанная информация. Соответственно для доступа к значениям этих свойств необходимо использовать специальные методы:

getSize() – размер шрифта;

isBold() — возвращает истину, если используется полужирный шрифт;

getColor() – цвет текста;

getName() – имя шрифта.

Все содержимое документа, записывается в переменную $body, значение которой будет отображено на экране, используя шаблон. Пустые строки документа представляют собой объект элемента TextBreak, который можно обработать следующим образом:

else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) {

$body .= ‘<br />’;

}

Для обработки таблиц, придется добавить достаточно много строк кода, потому как таблица – это сложный элемент Table, который состоит из отдельных строк, а те в свою очередь из отдельных ячеек. И более того, каждая ячейка, может содержать еще вложенные элементы, потому как, к примеру в одной ячейке так же можно сформировать таблицу. Ниже приведу весь код, вместе с кодом обработки таблиц.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

<?php

require ‘vendor/autoload.php’;

$source = __DIR__.«/docs/text.docx»;

$objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’);

$phpWord = $objReader>load($source);

$body = »;

foreach($phpWord>getSections() as $section) {

$arrays = $section>getElements();

foreach($arrays as $e) {

if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) {

foreach($e>getElements() as $text) {

$font = $text>getFontStyle();

$size = $font>getSize()/10;

$bold = $font>isBold() ? ‘font-weight:700;’ :»;

$color = $font>getColor();

$fontFamily = $font>getName();

$body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’;

$body .= $text>getText().‘</span>’;

}

}

else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) {

$body .= ‘<br />’;

}

else if(get_class($e) === ‘PhpOfficePhpWordElementTable’) {

$body .= ‘<table border=»2px»>’;

$rows = $e>getRows();

foreach($rows as $row) {

$body .= ‘<tr>’;

$cells = $row>getCells();

foreach($cells as $cell) {

$body .= ‘<td style=»width:’.$cell>getWidth().‘»>’;

$celements = $cell>getElements();

foreach($celements as $celem) {

if(get_class($celem) === ‘PhpOfficePhpWordElementText’) {

$body .= $celem>getText();

}

else if(get_class($celem) === ‘PhpOfficePhpWordElementTextRun’) {

foreach($celem>getElements() as $text) {

$body .= $text>getText();

}

}

}

$body .= ‘</td>’;

}

$body .= ‘</tr>’;

}

$body .= ‘</table>’;

}

else {

$body .= $e>getText();

}

}

break;

}

include ‘templ.php’;

Для получения строк, необходимо вызвать метод getRows(), при этом в качестве результата будет возвращен массив объектов с информацией по каждой строке (элемент Row). Используя foreach(), обходим данный массив и для каждой строки получаем ячейки, при помощи метода getCells(). При этом опять же возвращается массив, который все так же мы обходим циклом. А далее для каждой ячейки вызываем на исполнение метод getElements(), для получения ее элементов. И так далее по принципу описанным выше.

Далее, осталось только отобразить значение переменной $body, любым удобным для Вас способом.

На этом данный урок я буду завершать. Как Вы видите, PHPWord предоставляет достаточно мощные инструменты по работе с документами MS Word, но и в тоже время сложные в плане получения данных из объектов.

Всего Вам доброго и удачного кодирования!!!

PHPWord

Latest Stable Version
CI
Code Quality
Code Coverage
Total Downloads
License
Join the chat at https://gitter.im/PHPOffice/PHPWord

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF.

PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers’ Documentation.

If you have any questions, please ask on StackOverFlow

Read more about PHPWord:

  • Features
  • Requirements
  • Installation
  • Getting started
  • Contributing
  • Developers’ Documentation

Features

With PHPWord, you can create OOXML, ODF, or RTF documents dynamically using your PHP scripts. Below are some of the things that you can do with PHPWord library:

  • Set document properties, e.g. title, subject, and creator.
  • Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
  • Create header and footer for each sections
  • Set default font type, font size, and paragraph style
  • Use UTF-8 and East Asia fonts/characters
  • Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
  • Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
  • Insert titles (headers) and table of contents
  • Insert text breaks and page breaks
  • Insert and format images, either local, remote, or as page watermarks
  • Insert binary OLE Objects such as Excel or Visio
  • Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
  • Insert list items as bulleted, numbered, or multilevel
  • Insert hyperlinks
  • Insert footnotes and endnotes
  • Insert drawing shapes (arc, curve, line, polyline, rect, oval)
  • Insert charts (pie, doughnut, bar, line, area, scatter, radar)
  • Insert form fields (textinput, checkbox, and dropdown)
  • Create document from templates
  • Use XSL 1.0 style sheets to transform headers, main document part, and footers of an OOXML template
  • … and many more features on progress

Requirements

PHPWord requires the following:

  • PHP 7.1+
  • XML Parser extension
  • Laminas Escaper component
  • Zip extension (optional, used to write OOXML and ODF)
  • GD extension (optional, used to add images)
  • XMLWriter extension (optional, used to write OOXML and ODF)
  • XSL extension (optional, used to apply XSL style sheet to template )
  • dompdf library (optional, used to write PDF)

Installation

PHPWord is installed via Composer.
To add a dependency to PHPWord in your project, either

Run the following to use the latest stable version

composer require phpoffice/phpword

or if you want the latest unreleased version

composer require phpoffice/phpword:dev-master

Getting started

The following is a basic usage example of the PHPWord library.

<?php
require_once 'bootstrap.php';

// Creating the new document...
$phpWord = new PhpOfficePhpWordPhpWord();

/* Note: any element you append to a document must reside inside of a Section. */

// Adding an empty Section to the document...
$section = $phpWord->addSection();
// Adding Text element to the Section having font styled by default...
$section->addText(
    '"Learn from yesterday, live for today, hope for tomorrow. '
        . 'The important thing is not to stop questioning." '
        . '(Albert Einstein)'
);

/*
 * Note: it's possible to customize font style of the Text element you add in three ways:
 * - inline;
 * - using named font style (new font style object will be implicitly created);
 * - using explicitly created font style object.
 */

// Adding Text element with font customized inline...
$section->addText(
    '"Great achievement is usually born of great sacrifice, '
        . 'and is never the result of selfishness." '
        . '(Napoleon Hill)',
    array('name' => 'Tahoma', 'size' => 10)
);

// Adding Text element with font customized using named font style...
$fontStyleName = 'oneUserDefinedStyle';
$phpWord->addFontStyle(
    $fontStyleName,
    array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true)
);
$section->addText(
    '"The greatest accomplishment is not in never falling, '
        . 'but in rising again after you fall." '
        . '(Vince Lombardi)',
    $fontStyleName
);

// Adding Text element with font customized using explicitly created font style object...
$fontStyle = new PhpOfficePhpWordStyleFont();
$fontStyle->setBold(true);
$fontStyle->setName('Tahoma');
$fontStyle->setSize(13);
$myTextElement = $section->addText('"Believe you can and you're halfway there." (Theodor Roosevelt)');
$myTextElement->setFontStyle($fontStyle);

// Saving the document as OOXML file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007');
$objWriter->save('helloWorld.docx');

// Saving the document as ODF file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText');
$objWriter->save('helloWorld.odt');

// Saving the document as HTML file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML');
$objWriter->save('helloWorld.html');

/* Note: we skip RTF, because it's not XML-based and requires a different example. */
/* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */

More examples are provided in the samples folder. For an easy access to those samples launch php -S localhost:8000 in the samples directory then browse to http://localhost:8000 to view the samples.
You can also read the Developers’ Documentation for more detail.

Contributing

We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute.

  • Read our contributing guide.
  • Fork us and request a pull to the master branch.
  • Submit bug reports or feature requests to GitHub.
  • Follow @PHPWord and @PHPOffice on Twitter.

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object?
I know that I can:

$file = fopen('c:file.doc', 'w+');
fwrite($file, $text);
fclose();

but Word will read it as an HTML file not a native .doc file.

asked Oct 9, 2008 at 18:09

UnkwnTech's user avatar

UnkwnTechUnkwnTech

87.1k65 gold badges183 silver badges229 bronze badges

1

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

You could use the Microsoft Office XML formats for reading and writing Word files — this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it’s called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I’ve never used this format for writing out Office documents from PHP, but I’m using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it’s no problem to navigate within and figure out how to extract the data you need.

The other option — a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) — would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think — it just depends on how much time you’ll invest.

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

Community's user avatar

answered Nov 5, 2008 at 13:04

Stefan Gehrig's user avatar

Stefan GehrigStefan Gehrig

82.3k24 gold badges158 silver badges188 bronze badges

1

this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>

UnkwnTech's user avatar

UnkwnTech

87.1k65 gold badges183 silver badges229 bronze badges

answered Nov 5, 2008 at 12:35

2

You can use Antiword, it is a free MS Word reader for Linux and most popular OS.

$document_file = 'c:file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);

answered May 23, 2009 at 0:57

Mantichora's user avatar

MantichoraMantichora

3854 silver badges8 bronze badges

5

I don’t know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.

answered Oct 10, 2008 at 0:23

Joe Lencioni's user avatar

Joe LencioniJoe Lencioni

10.2k17 gold badges54 silver badges66 bronze badges

Just updating the code

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("o" ,">",$new_line); 
        $new_line .= "n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>

Bill the Lizard's user avatar

answered Apr 4, 2011 at 2:43

WIlson's user avatar

WIlsonWIlson

611 silver badge1 bronze badge

4

Most probably you won’t be able to read Word documents without COM.

Writing was covered in this topic

Community's user avatar

answered Oct 10, 2008 at 2:17

Sergey Kornilov's user avatar

Sergey KornilovSergey Kornilov

1,7722 gold badges13 silver badges22 bronze badges

2007 might be a bit complicated as well.

The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.

Rename a .docx file to .zip and you’ll see what I mean.

So if you can work within zip files in PHP, you should be on the right path.

0

www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)…I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse…

answered Sep 13, 2009 at 17:45

Source gotten from

Use following class directly to read word document

class DocxConversion{
    private $filename;

    public function __construct($filePath) {
        $this->filename = $filePath;
    }

    private function read_doc() {
        $fileHandle = fopen($this->filename, "r");
        $line = @fread($fileHandle, filesize($this->filename));   
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
          {
            $pos = strpos($thisline, chr(0x00));
            if (($pos !== FALSE)||(strlen($thisline)==0))
              {
              } else {
                $outtext .= $thisline." ";
              }
          }
         $outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
        return $outtext;
    }

    private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "rn", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

 /************************excel sheet************************************/

function xlsx_to_text($input_file){
    $xml_filename = "xl/sharedStrings.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        $slide_number = 1; //loop through slide files
        while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text .= strip_tags($xml_handle->saveXML());
            $slide_number++;
        }
        if($slide_number == 1){
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}


    public function convertToText() {

        if(isset($this->filename) && !file_exists($this->filename)) {
            return "File Not exists";
        }

        $fileArray = pathinfo($this->filename);
        $file_ext  = $fileArray['extension'];
        if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
        {
            if($file_ext == "doc") {
                return $this->read_doc();
            } elseif($file_ext == "docx") {
                return $this->read_docx();
            } elseif($file_ext == "xlsx") {
                return $this->xlsx_to_text();
            }elseif($file_ext == "pptx") {
                return $this->pptx_to_text();
            }
        } else {
            return "Invalid File Type";
        }
    }

}

$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx 
echo $docText= $docObj->convertToText();

answered Jul 3, 2019 at 10:25

Mohamed Faalil's user avatar

Office 2007 .docx should be possible since it’s an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven’t seen many libraries written to match them yet.

answered Oct 10, 2008 at 2:45

acrosman's user avatar

acrosmanacrosman

12.8k10 gold badges40 silver badges55 bronze badges

I don’t know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called «catdoc»; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.

answered Oct 10, 2008 at 15:25

fijter's user avatar

fijterfijter

17.5k2 gold badges24 silver badges28 bronze badges

phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.

See the project web site at:

Contact lenses Guide

answered May 14, 2009 at 7:03

1

One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX.
You may see how it works having a look at its online tutorial.
You can insert or extract contents or even merge multiple Word files into a asingle one.

answered Sep 28, 2012 at 16:44

Eduardo's user avatar

Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.

answered Jan 24, 2009 at 5:09

Josh Smeaton's user avatar

Josh SmeatonJosh Smeaton

47.6k24 gold badges129 silver badges164 bronze badges

1

even i’m working on same kind of project [An Onlinw Word Processor]!
But i’ve choosen c#.net and ASP.net. But through the survey i did; i got to know that

By Using Open XML SDK and VSTO [Visual Studio Tools For Office]

we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..

So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!

But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!

answered Sep 5, 2010 at 14:17

Noddy Cha's user avatar

Noddy ChaNoddy Cha

8511 gold badge12 silver badges19 bronze badges

I have the same case
I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy.
All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP
so simple CURL would do it.

answered Oct 11, 2010 at 19:12

Omer's user avatar

1

//For DOCX.If you want to preserve white spaces, also take care of tables tr and tc, use the codes below: Modify it to your taste. Cos it downloads the file from a remote or local

//=========DOCX===========
function extractDocxText($url,$file_name){
        $docx = get_url($url);
        file_put_contents("tempf.docx",$docx);
        $xml_filename = "word/document.xml"; //content file name
        $zip_handle = new ZipArchive;
        $output_text = "";
        if(true === $zip_handle->open("tempf.docx")){
            if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
                $xml_datas = $zip_handle->getFromIndex($xml_index);
                //file_put_contents($input_file.".xml",$xml_datas);
                $replace_newlines = preg_replace('/<w:p w[0-9-Za-z]+:[a-zA-Z0-9]+="[a-zA-z"0-9 :="]+">/',"nr",$xml_datas);
                $replace_tableRows = preg_replace('/<w:tr>/',"nr",$replace_newlines);
                $replace_tab = preg_replace('/<w:tab/>/',"t",$replace_tableRows);
                $replace_paragraphs = preg_replace('/</w:p>/',"nr",$replace_tab);
                $replace_other_Tags = strip_tags($replace_paragraphs);          
                $output_text = $replace_other_Tags;
            }else{
                $output_text .="";
            }
            $zip_handle->close();
        }else{
        $output_text .=" ";
        }
        chmod("tempf.docx", 0777);  unlink(realpath("tempf.docx"));
        //save to file or echo content
        file_put_contents($file_name,$output_text);
        echo $output_text;
    }

//========PDF===========
//Requires installation in your Linux server
//sudo su
//apt-get install xpdf
function extractPdfText($url,$PDF_fullpath_or_Filename){
    $pdf = get_url($url);
    file_put_contents ("temppdf.txt", $pdf);
    $content = pdf2text("temppdf.txt");
    chmod("temppdf.txt", 0777); unlink(realpath("temppdf.txt"));
    echo $content;
    file_put_contents($PDF_fullpath_or_Filename,$content);
    }



//========DOC==========
function extractDocText($url,$file_name){
    $doc = get_url($url);
    file_put_contents ("tempf.txt", $doc);

    $fileHandle = fopen("tempf.txt", "r");
    $line = @fread($fileHandle, filesize("tempf.txt"));
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline){
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
        {} else {$outtext .= $thisline."nr";}
        }
    $content = preg_replace('/[a-zA-Z0-9s,.-nrt@/_()]/','  ',$outtext);

    //chmod("tempf.txt", 0777); unlink(realpath("tempf.txt"));
    echo $content;
    file_put_contents($file_name,$content);
    }


//========XLSX==========
function extractXlsxText($url,$file_name){
    $xlsx = get_url($url);
    file_put_contents ("tempf.txt", $xlsx);
    $content = "";
    $dir = 'tempforxlsx';
    // Unzip
    $zip = new ZipArchive();
    $zip->open("tempf.txt");
    $zip->extractTo($dir);
    // Open up shared strings & the first worksheet
    $strings = simplexml_load_file($dir . '/xl/sharedStrings.xml');
    $sheet   = simplexml_load_file($dir . '/xl/worksheets/sheet1.xml');
    // Parse the rows
    $xlrows = $sheet->sheetData->row;
    foreach ($xlrows as $xlrow) {
        $arr = array();

        // In each row, grab it's value
        foreach ($xlrow->c as $cell) {
            $v = (string) $cell->v;

            // If it has a "t" (type?) of "s" (string?), use the value to look up string value
            if (isset($cell['t']) && $cell['t'] == 's') {
                $s  = array();
                $si = $strings->si[(int) $v];

                // Register & alias the default namespace or you'll get empty results in the xpath query
                $si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
                // Cat together all of the 't' (text?) node values
                foreach($si->xpath('.//n:t') as $t) {
                    $content .= $t."  ";}   }
            }
        }
    echo $content;
    file_put_contents($file_name,$content);
    }


//========PPT========== 
function extractPptText($url,$file_name){
    $ppt = file_get_contents($url);
    file_put_contents ("tempf.ppt", $ppt);
    $fileHandle = fopen("tempf.ppt", "r");
    $line = @fread($fileHandle, filesize("tempf.ppt"));
    $lines = explode(chr(0x0f),$line);
    $outtext = '';

    foreach($lines as $thisline) {
        if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
            $text_line = substr($thisline, 4);
            $end_pos   = strpos($text_line, chr(0x00));
            $text_line = substr($text_line, 0, $end_pos);
            $text_line = preg_replace('/[^a-zA-Z0-9s,.-nrt@/_()]/',"  ",$text_line);
            $outtext = substr($text_line, 0, $end_pos)."n".$outtext;
        }
    }
    //echo $outtext;
    file_put_contents($file_name,$outtext);
    }

//========PPTX==========
function extractPptxText($url,$file_name){
    $xls = get_url($url);
    file_put_contents ("tempf.txt", $xls);
    $zip_handle = new ZipArchive;
    $output_text = ' ';
    if(true === $zip_handle->open("tempf.txt")){
        $slide_number = 1; //loop through slide files
        while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index); // these four lines of codes
                                                                // below were
            $xml_handle = new DOMDocument ();                   // added by me in order
            $xml_handle->preserveWhiteSpace = true;             // to preserve space between
            $xml_handle->formatOutput = true;                   // each read data
            $xml_handle->loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text .= $xml_handle->saveXML();
            $slide_number++;
            }
        if($slide_number == 1){
            $output_text .= "";
        }
        $zip_handle->close();
    }else{
    $output_text .= "";
    }
    echo $output_text;
    file_put_contents($file_name,$output_text);
    }

    /*

==========================================================================
=========================================================================
And below is get_url() function: Better than fie_get_contents();
*/

function get_url( $url,$timeout = 5 )
    {
        $url = str_replace( "&amp;", "&", urldecode(trim($url)) );
        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
        curl_setopt( $ch, CURLOPT_URL, $url );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
        curl_setopt( $ch, CURLOPT_ENCODING, "" );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
        curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
        curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
        curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
        curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
        curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
        $content = curl_exec( $ch );
        //$response = curl_getinfo( $ch ); 
        curl_close ( $ch );
        return $content;
    }

How to read and view docx Files using PHP. Now days processing Word Document is becoming more popular. Even you can create a new Word Document and process with it. My previous article describes you to create Word Document by using PHP.

Today we are going to discuss about reading the Docx files and convert it into text and view it online. Let’s begin with steps and codes,

<?php
function kv_read_word($input_file){	
	 $kv_strip_texts = ''; 
         $kv_texts = ''; 	
	if(!$input_file || !file_exists($input_file)) return false;
		
	$zip = zip_open($input_file);
		
	if (!$zip || is_numeric($zip)) return false;
	
	
	while ($zip_entry = zip_read($zip)) {
			
		if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
			
		if (zip_entry_name($zip_entry) != "word/document.xml") continue;

		$kv_texts .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
			
		zip_entry_close($zip_entry);
	}
	
	zip_close($zip);
		

	$kv_texts = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $kv_texts);
	$kv_texts = str_replace('</w:r></w:p>', "rn", $kv_texts);
	$kv_strip_texts = nl2br(strip_tags($kv_texts,’‘));

	return $kv_strip_texts;
}
?>

The above function will helps you to get parse the text’s in a Word Document and  return it.

Now, we need to give the input file and its path as input to the function and print it for results.

<?php
$kv_texts = kv_read_word('path/to/the/file/kvcodes.docx');
if($kv_texts !== false) {		
	echo nl2br($kv_texts);	
}
else {
	echo 'Can't Read that file.';
}
?>

That’s it to read a docx file  and print it as text.

I have another article for WordPress user, who can try this to process Docx files using php and WordPress

How to Read and get Texts from Docx Files in WordPress

Contents

  • Introduction
    • Features
    • File formats
  • Installing/configuring
    • Requirements
    • Installation
    • Using samples
  • General usage
    • Basic example
    • Settings
    • Default font
    • Document properties
    • Measurement units
  • Containers
    • Sections
    • Headers
    • Footers
    • Other containers
  • Elements
    • Texts
    • Breaks
    • Lists
    • Tables
    • Images
    • Objects
    • Table of contents
    • Footnotes & endnotes
    • Checkboxes
    • Textboxes
    • Fields
    • Lines
    • Shapes
    • Charts
    • FormFields
  • Styles
    • Section
    • Font
    • Paragraph
    • Table
  • Templates processing
  • Writers & readers
    • OOXML
    • OpenDocument
    • RTF
    • HTML
    • PDF
  • Recipes
  • Frequently asked questions
  • References

Introduction

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), and Rich Text Format (RTF).

PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading this Developers’ Documentation and the API Documentation.

Features

  • Set document properties, e.g. title, subject, and creator.
  • Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
  • Create header and footer for each sections
  • Set default font type, font size, and paragraph style
  • Use UTF-8 and East Asia fonts/characters
  • Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
  • Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
  • Insert titles (headers) and table of contents
  • Insert text breaks and page breaks
  • Insert and format images, either local, remote, or as page watermarks
  • Insert binary OLE Objects such as Excel or Visio
  • Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
  • Insert list items as bulleted, numbered, or multilevel
  • Insert hyperlinks
  • Insert footnotes and endnotes
  • Insert drawing shapes (arc, curve, line, polyline, rect, oval)
  • Insert charts (pie, doughnut, bar, line, area, scatter, radar)
  • Insert form fields (textinput, checkbox, and dropdown)
  • Create document from templates
  • Use XSL 1.0 style sheets to transform main document part of OOXML template
  • … and many more features on progress

File formats

Below are the supported features for each file formats.

Writers

Features DOCX ODT RTF HTML PDF
Document Properties Standard
Custom
Element Type Text
Text Run
Title
Link
Preserve Text
Text Break
Page Break
List
Table
Image
Object
Watermark
Table of Contents
Header
Footer
Footnote
Endnote
Graphs 2D basic graphs
2D advanced graphs
3D graphs
Math OMML support
MathML support
Bonus Encryption
Protection

Readers

Features DOCX ODT RTF HTML
Document Properties Standard
Custom
Element Type Text
Text Run
Title
Link
Preserve Text
Text Break
Page Break
List
Table
Image
Object
Watermark
Table of Contents
Header
Footer
Footnote
Endnote
Graphs 2D basic graphs
2D advanced graphs
3D graphs
Math OMML support
MathML support
Bonus Encryption
Protection

Contributing

We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute:

  • Read our contributing guide
  • Fork us and request a pull to the develop branch
  • Submit bug reports or feature requests to GitHub
  • Follow @PHPWord and @PHPOffice on Twitter

Installing/configuring

Requirements

Mandatory:

  • PHP 5.3+
  • PHP Zip extension
  • PHP XML Parser extension

Optional PHP extensions:

  • GD
  • XMLWriter
  • XSL

Installation

There are two ways to install PHPWord, i.e. via Composer or manually by downloading the library.

Using Composer

To install via Composer, add the following lines to your composer.json:

{
    "require": {
       "phpoffice/phpword": "dev-master"
    }
}

Manual install

To install manually, download PHPWord package from github. Extract the package and put the contents to your machine. To use the library, include src/PhpWord/Autoloader.php in your script and invoke Autoloader::register.

require_once '/path/to/src/PhpWord/Autoloader.php';
PhpOfficePhpWordAutoloader::register();

Using samples

After installation, you can browse and use the samples that we’ve provided, either by command line or using browser. If you can access your PHPWord library folder using browser, point your browser to the samples folder, e.g. http://localhost/PhpWord/samples/.

General usage

Basic example

The following is a basic example of the PHPWord library. More examples are provided in the samples folder.

<?php
require_once 'src/PhpWord/Autoloader.php';
PhpOfficePhpWordAutoloader::register();

// Creating the new document...
$phpWord = new PhpOfficePhpWordPhpWord();

/* Note: any element you append to a document must reside inside of a Section. */

// Adding an empty Section to the document...
$section = $phpWord->addSection();
// Adding Text element to the Section having font styled by default...
$section->addText(
    htmlspecialchars(
        '"Learn from yesterday, live for today, hope for tomorrow. '
            . 'The important thing is not to stop questioning." '
            . '(Albert Einstein)'
    )
);

/*
 * Note: it's possible to customize font style of the Text element you add in three ways:
 * - inline;
 * - using named font style (new font style object will be implicitly created);
 * - using explicitly created font style object.
 */

// Adding Text element with font customized inline...
$section->addText(
    htmlspecialchars(
        '"Great achievement is usually born of great sacrifice, '
            . 'and is never the result of selfishness." '
            . '(Napoleon Hill)'
    ),
    array('name' => 'Tahoma', 'size' => 10)
);

// Adding Text element with font customized using named font style...
$fontStyleName = 'oneUserDefinedStyle';
$phpWord->addFontStyle(
    $fontStyleName,
    array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true)
);
$section->addText(
    htmlspecialchars(
        '"The greatest accomplishment is not in never falling, '
            . 'but in rising again after you fall." '
            . '(Vince Lombardi)'
    ),
    $fontStyleName
);

// Adding Text element with font customized using explicitly created font style object...
$fontStyle = new PhpOfficePhpWordStyleFont();
$fontStyle->setBold(true);
$fontStyle->setName('Tahoma');
$fontStyle->setSize(13);
$myTextElement = $section->addText(
    htmlspecialchars('"Believe you can and you're halfway there." (Theodor Roosevelt)')
);
$myTextElement->setFontStyle($fontStyle);

// Saving the document as OOXML file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007');
$objWriter->save('helloWorld.docx');

// Saving the document as ODF file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText');
$objWriter->save('helloWorld.odt');

// Saving the document as HTML file...
$objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML');
$objWriter->save('helloWorld.html');

/* Note: we skip RTF, because it's not XML-based and requires a different example. */
/* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */

Settings

The PhpOfficePhpWordSettings class provides some options that will affect the behavior of PHPWord. Below are the options.

XML Writer compatibility

This option sets XMLWriter::setIndent and XMLWriter::setIndentString. The default value of this option is true (compatible), which is required for OpenOffice to render OOXML document correctly. You can set this option to false during development to make the resulting XML file easier to read.

PhpOfficePhpWordSettings::setCompatibility(false);

Zip class

By default, PHPWord uses PHP ZipArchive to read or write ZIP compressed archive and the files inside them. If you can’t have ZipArchive installed on your server, you can use pure PHP library alternative, PCLZip, which included with PHPWord.

PhpOfficePhpWordSettings::setZipClass(PhpOfficePhpWordSettings::PCLZIP);

Default font

By default, every text appears in Arial 10 point. You can alter the default font by using the following two functions:

$phpWord->setDefaultFontName('Times New Roman');
$phpWord->setDefaultFontSize(12);

Document information

You can set the document information such as title, creator, and company name. Use the following functions:

$properties = $phpWord->getDocInfo();
$properties->setCreator('My name');
$properties->setCompany('My factory');
$properties->setTitle('My title');
$properties->setDescription('My description');
$properties->setCategory('My category');
$properties->setLastModifiedBy('My name');
$properties->setCreated(mktime(0, 0, 0, 3, 12, 2014));
$properties->setModified(mktime(0, 0, 0, 3, 14, 2014));
$properties->setSubject('My subject');
$properties->setKeywords('my, key, word');

Measurement units

The base length unit in Open Office XML is twip. Twip means «TWentieth of an Inch Point», i.e. 1 twip = 1/1440 inch.

You can use PHPWord helper functions to convert inches, centimeters, or points to twips.

// Paragraph with 6 points space after
$phpWord->addParagraphStyle('My Style', array(
    'spaceAfter' => PhpOfficePhpWordSharedConverter::pointToTwip(6))
);

$section = $phpWord->addSection();
$sectionStyle = $section->getStyle();
// half inch left margin
$sectionStyle->setMarginLeft(PhpOfficePhpWordSharedConverter::inchToTwip(.5));
// 2 cm right margin
$sectionStyle->setMarginRight(PhpOfficePhpWordSharedConverter::cmToTwip(2));

Containers

Containers are objects where you can put elements (texts, lists, tables, etc). There are 3 main containers, i.e. sections, headers, and footers. There are 3 elements that can also act as containers, i.e. textruns, table cells, and footnotes.

Sections

Every visible element in word is placed inside of a section. To create a section, use the following code:

$section = $phpWord->addSection($sectionStyle);

The $sectionStyle is an optional associative array that sets the section. Example:

$sectionStyle = array(
    'orientation' => 'landscape',
    'marginTop' => 600,
    'colsNum' => 2,
);

Page number

You can change a section page number by using the pageNumberingStart style of the section.

// Method 1
$section = $phpWord->addSection(array('pageNumberingStart' => 1));

// Method 2
$section = $phpWord->addSection();
$section->getStyle()->setPageNumberingStart(1);

Multicolumn

You can change a section layout to multicolumn (like in a newspaper) by using the breakType and colsNum style of the section.

// Method 1
$section = $phpWord->addSection(array('breakType' => 'continuous', 'colsNum' => 2));

// Method 2
$section = $phpWord->addSection();
$section->getStyle()->setBreakType('continuous');
$section->getStyle()->setColsNum(2);

Line numbering

You can apply line numbering to a section by using the lineNumbering style of the section.

// Method 1
$section = $phpWord->addSection(array('lineNumbering' => array()));

// Method 2
$section = $phpWord->addSection();
$section->getStyle()->setLineNumbering(array());

Below are the properties of the line numbering style.

  • start Line numbering starting value
  • increment Line number increments
  • distance Distance between text and line numbering in twip
  • restart Line numbering restart setting continuous|newPage|newSection

Headers

Each section can have its own header reference. To create a header use the addHeader method:

$header = $section->addHeader();

Be sure to save the result in a local object. You can use all elements that are available for the footer. See «Footer» section for detail. Additionally, only inside of the header reference you can add watermarks or background pictures. See «Watermarks» section.

Footers

Each section can have its own footer reference. To create a footer, use the addFooter method:

$footer = $section->addFooter();

Be sure to save the result in a local object to add elements to a footer. You can add the following elements to footers:

  • Texts addText and createTextrun
  • Text breaks
  • Images
  • Tables
  • Preserve text

See the «Elements» section for the detail of each elements.

Other containers

Textruns, table cells, and footnotes are elements that can also act as containers. See the corresponding «Elements» section for the detail of each elements.

Elements

Below are the matrix of element availability in each container. The column shows the containers while the rows lists the elements.

Num Element Section Header Footer Cell Text Run Footnote
1 Text v v v v v v
2 Text Run v v v v
3 Link v v v v v v
4 Title v ? ? ? ? ?
5 Preserve Text ? v v v*
6 Text Break v v v v v v
7 Page Break v
8 List v v v v
9 Table v v v v
10 Image v v v v v v
11 Watermark v
12 Object v v v v v v
13 TOC v
14 Footnote v v** v**
15 Endnote v v** v**
16 CheckBox v v v v
17 TextBox v v v v
18 Field v v v v v v
19 Line v v v v v v
20 Shape v v v v v v
21 Chart v
22 Form Fields v v v v v v

Legend:

  • v Available
  • v* Available only when inside header/footer
  • v** Available only when inside section
  • - Not available
  • ? Should be available

Texts

Text can be added by using addText and addTextRun method. addText is used for creating simple paragraphs that only contain texts with the same style. addTextRun is used for creating complex paragraphs that contain text with different style (some bold, other italics, etc) or other elements, e.g. images or links. The syntaxes are as follow:

$section->addText($text, [$fontStyle], [$paragraphStyle]);
$textrun = $section->addTextRun([$paragraphStyle]);

You can use the $fontStyle and $paragraphStyle variable to define text formatting. There are 2 options to style the inserted text elements, i.e. inline style by using array or defined style by adding style definition.

Inline style examples:

$fontStyle = array('name' => 'Times New Roman', 'size' => 9);
$paragraphStyle = array('align' => 'both');
$section->addText('I am simple paragraph', $fontStyle, $paragraphStyle);

$textrun = $section->addTextRun();
$textrun->addText('I am bold', array('bold' => true));
$textrun->addText('I am italic', array('italic' => true));
$textrun->addText('I am colored', array('color' => 'AACC00'));

Defined style examples:

$fontStyle = array('color' => '006699', 'size' => 18, 'bold' => true);
$phpWord->addFontStyle('fStyle', $fontStyle);
$text = $section->addText('Hello world!', 'fStyle');

$paragraphStyle = array('align' => 'center');
$phpWord->addParagraphStyle('pStyle', $paragraphStyle);
$text = $section->addText('Hello world!', 'pStyle');

Titles

If you want to structure your document or build table of contents, you need titles or headings. To add a title to the document, use the addTitleStyle and addTitle method.

$phpWord->addTitleStyle($depth, [$fontStyle], [$paragraphStyle]);
$section->addTitle($text, [$depth]);

Its necessary to add a title style to your document because otherwise the title won’t be detected as a real title.

Links

You can add Hyperlinks to the document by using the function addLink:

$section->addLink($linkSrc, [$linkName], [$fontStyle], [$paragraphStyle]);
  • $linkSrc The URL of the link.
  • $linkName Placeholder of the URL that appears in the document.
  • $fontStyle See «Font style» section.
  • $paragraphStyle See «Paragraph style» section.

Preserve texts

The addPreserveText method is used to add a page number or page count to headers or footers.

$footer->addPreserveText('Page {PAGE} of {NUMPAGES}.');

Breaks

Text breaks

Text breaks are empty new lines. To add text breaks, use the following syntax. All paramaters are optional.

$section->addTextBreak([$breakCount], [$fontStyle], [$paragraphStyle]);
  • $breakCount How many lines
  • $fontStyle See «Font style» section.
  • $paragraphStyle See «Paragraph style» section.

Page breaks

There are two ways to insert a page breaks, using the addPageBreak method or using the pageBreakBefore style of paragraph.

$section->addPageBreak();

Lists

To add a list item use the function addListItem.

Basic usage:

$section->addListItem($text, [$depth], [$fontStyle], [$listStyle], [$paragraphStyle]);

Parameters:

  • $text Text that appears in the document.
  • $depth Depth of list item.
  • $fontStyle See «Font style» section.
  • $listStyle List style of the current element TYPE_NUMBER, TYPE_ALPHANUM, TYPE_BULLET_FILLED, etc. See list of constants in PHPWord_Style_ListItem.
  • $paragraphStyle See «Paragraph style» section.

Advanced usage:

You can also create your own numbering style by changing the $listStyle parameter with the name of your numbering style.

$phpWord->addNumberingStyle(
    'multilevel',
    array('type' => 'multilevel', 'levels' => array(
        array('format' => 'decimal', 'text' => '%1.', 'left' => 360, 'hanging' => 360, 'tabPos' => 360),
        array('format' => 'upperLetter', 'text' => '%2.', 'left' => 720, 'hanging' => 360, 'tabPos' => 720),
        )
     )
);
$section->addListItem('List Item I', 0, null, 'multilevel');
$section->addListItem('List Item I.a', 1, null, 'multilevel');
$section->addListItem('List Item I.b', 1, null, 'multilevel');
$section->addListItem('List Item II', 0, null, 'multilevel');

Tables

To add tables, rows, and cells, use the addTable, addRow, and addCell methods:

$table = $section->addTable([$tableStyle]);
$table->addRow([$height], [$rowStyle]);
$cell = $table->addCell($width, [$cellStyle]);

Table style can be defined with addTableStyle:

$tableStyle = array(
    'borderColor' => '006699',
    'borderSize' => 6,
    'cellMargin' => 50
);
$firstRowStyle = array('bgColor' => '66BBFF');
$phpWord->addTableStyle('myTable', $tableStyle, $firstRowStyle);
$table = $section->addTable('myTable');

Cell span

You can span a cell on multiple columns by using gridSpan or multiple rows by using vMerge.

$cell = $table->addCell(200);
$cell->getStyle()->setGridSpan(5);

See Sample_09_Tables.php for more code sample.

Images

To add an image, use the addImage method to sections, headers, footers, textruns, or table cells.

$section->addImage($src, [$style]);
  • source String path to a local image or URL of a remote image
  • styles Array fo styles for the image. See below.

Examples:

$section = $phpWord->addSection();
$section->addImage(
    'mars.jpg',
    array(
        'width' => 100,
        'height' => 100,
        'marginTop' => -1,
        'marginLeft' => -1,
        'wrappingStyle' => 'behind'
    )
);
$footer = $section->addFooter();
$footer->addImage('http://example.com/image.php');
$textrun = $section->addTextRun();
$textrun->addImage('http://php.net/logo.jpg');

Watermarks

To add a watermark (or page background image), your section needs a header reference. After creating a header, you can use the addWatermark method to add a watermark.

$section = $phpWord->addSection();
$header = $section->addHeader();
$header->addWatermark('resources/_earth.jpg', array('marginTop' => 200, 'marginLeft' => 55));

Objects

You can add OLE embeddings, such as Excel spreadsheets or PowerPoint presentations to the document by using addObject method.

$section->addObject($src, [$style]);

Table of contents

To add a table of contents (TOC), you can use the addTOC method. Your TOC can only be generated if you have add at least one title (See «Titles»).

$section->addTOC([$fontStyle], [$tocStyle], [$minDepth], [$maxDepth]);
  • $fontStyle: See font style section
  • $tocStyle: See available options below
  • $minDepth: Minimum depth of header to be shown. Default 1
  • $maxDepth: Maximum depth of header to be shown. Default 9

Options for $tocStyle:

  • tabLeader Fill type between the title text and the page number. Use the defined constants in PHPWord_Style_TOC.
  • tabPos The position of the tab where the page number appears in twips.
  • indent The indent factor of the titles in twips.

Footnotes & endnotes

You can create footnotes with addFootnote and endnotes with addEndnote in texts or textruns, but it’s recommended to use textrun to have better layout. You can use addText, addLink, addTextBreak, addImage, addObject on footnotes and endnotes.

On textrun:

$textrun = $section->addTextRun();
$textrun->addText('Lead text.');
$footnote = $textrun->addFootnote();
$footnote->addText('Footnote text can have ');
$footnote->addLink('http://test.com', 'links');
$footnote->addText('.');
$footnote->addTextBreak();
$footnote->addText('And text break.');
$textrun->addText('Trailing text.');
$endnote = $textrun->addEndnote();
$endnote->addText('Endnote put at the end');

On text:

$section->addText('Lead text.');
$footnote = $section->addFootnote();
$footnote->addText('Footnote text.');

The footnote reference number will be displayed with decimal number starting from 1. This number use FooterReference style which you can redefine by addFontStyle method. Default value for this style is array('superScript' => true);

Checkboxes

Checkbox elements can be added to sections or table cells by using addCheckBox.

$section->addCheckBox($name, $text, [$fontStyle], [$paragraphStyle])
  • $name Name of the check box.
  • $text Text following the check box
  • $fontStyle See «Font style» section.
  • $paragraphStyle See «Paragraph style» section.

Textboxes

To be completed.

Fields

To be completed.

Lines

To be completed.

Shapes

To be completed.

Charts

To be completed.

Form fields

To be completed.

Styles

Section

Below are the available styles for section:

  • orientation Page orientation, i.e. ‘portrait’ (default) or ‘landscape’
  • marginTop Page margin top in twips
  • marginLeft Page margin left in twips
  • marginRight Page margin right in twips
  • marginBottom Page margin bottom in twips
  • borderTopSize Border top size in twips
  • borderTopColor Border top color
  • borderLeftSize Border left size in twips
  • borderLeftColor Border left color
  • borderRightSize Border right size in twips
  • borderRightColor Border right color
  • borderBottomSize Border bottom size in twips
  • borderBottomColor Border bottom color
  • headerHeight Spacing to top of header
  • footerHeight Spacing to bottom of footer
  • gutter Page gutter spacing
  • colsNum Number of columns
  • colsSpace Spacing between columns
  • breakType Section break type (nextPage, nextColumn, continuous, evenPage, oddPage)

The following two styles are automatically set by the use of the orientation style. You can alter them but that’s not recommended.

  • pageSizeW Page width in twips
  • pageSizeH Page height in twips

Font

Available font styles:

  • name Font name, e.g. Arial
  • size Font size, e.g. 20, 22,
  • hint Font content type, default, eastAsia, or cs
  • bold Bold, true or false
  • italic Italic, true or false
  • superScript Superscript, true or false
  • subScript Subscript, true or false
  • underline Underline, dash, dotted, etc.
  • strikethrough Strikethrough, true or false
  • doubleStrikethrough Double strikethrough, true or false
  • color Font color, e.g. FF0000
  • fgColor Font highlight color, e.g. yellow, green, blue
  • bgColor Font background color, e.g. FF0000
  • smallCaps Small caps, true or false
  • allCaps All caps, true or false

Paragraph

Available paragraph styles:

  • align Paragraph alignment, left, right or center
  • spaceBefore Space before paragraph
  • spaceAfter Space after paragraph
  • indent Indent by how much
  • hanging Hanging by how much
  • basedOn Parent style
  • next Style for next paragraph
  • widowControl Allow first/last line to display on a separate page, true or false
  • keepNext Keep paragraph with next paragraph, true or false
  • keepLines Keep all lines on one page, true or false
  • pageBreakBefore Start paragraph on next page, true or false
  • lineHeight text line height, e.g. 1.0, 1.5, ect…
  • tabs Set of custom tab stops

Table

Table styles:

  • width Table width in percent
  • bgColor Background color, e.g. ‘9966CC’
  • border(Top|Right|Bottom|Left)Size Border size in twips
  • border(Top|Right|Bottom|Left)Color Border color, e.g. ‘9966CC’
  • cellMargin(Top|Right|Bottom|Left) Cell margin in twips

Row styles:

  • tblHeader Repeat table row on every new page, true or false
  • cantSplit Table row cannot break across pages, true or false
  • exactHeight Row height is exact or at least

Cell styles:

  • width Cell width in twips
  • valign Vertical alignment, top, center, both, bottom
  • textDirection Direction of text
  • bgColor Background color, e.g. ‘9966CC’
  • border(Top|Right|Bottom|Left)Size Border size in twips
  • border(Top|Right|Bottom|Left)Color Border color, e.g. ‘9966CC’
  • gridSpan Number of columns spanned
  • vMerge restart or continue

Image

Available image styles:

  • width Width in pixels
  • height Height in pixels
  • align Image alignment, left, right, or center
  • marginTop Top margin in inches, can be negative
  • marginLeft Left margin in inches, can be negative
  • wrappingStyle Wrapping style, inline, square, tight, behind, or infront

Numbering level

  • start Starting value
  • format Numbering format bullet|decimal|upperRoman|lowerRoman|upperLetter|lowerLetter
  • restart Restart numbering level symbol
  • suffix Content between numbering symbol and paragraph text tab|space|nothing
  • text Numbering level text e.g. %1 for nonbullet or bullet character
  • align Numbering symbol align left|center|right|both
  • left See paragraph style
  • hanging See paragraph style
  • tabPos See paragraph style
  • font Font name
  • hint See font style

Templates processing

You can create a .docx document template with included search-patterns which can be replaced by any value you wish. Only single-line values can be replaced.

To deal with a template file, use new TemplateProcessor statement. After TemplateProcessor instance creation the document template is copied into the temporary directory. Then you can use TemplateProcessor::setValue method to change the value of a search pattern. The search-pattern model is: ${search-pattern}.

Example:

$templateProcessor = new TemplateProcessor('Template.docx');
$templateProcessor->setValue('Name', 'Somebody someone');
$templateProcessor->setValue('Street', 'Coming-Undone-Street 32');

It is not possible to directly add new OOXML elements to the template file being processed, but it is possible to transform main document part of the template using XSLT (see TemplateProcessor::applyXslStyleSheet).

See Sample_07_TemplateCloneRow.php for example on how to create multirow from a single row in a template by using TemplateProcessor::cloneRow.

See Sample_23_TemplateBlock.php for example on how to clone a block of text using TemplateProcessor::cloneBlock and delete a block of text using TemplateProcessor::deleteBlock.

Writers & readers

OOXML

The package of OOXML document consists of the following files.

  • _rels/
    • .rels
  • docProps/
    • app.xml
    • core.xml
    • custom.xml
  • word/
    • rels/
      • document.rels.xml
    • media/
    • theme/
      • theme1.xml
    • document.xml
    • fontTable.xml
    • numbering.xml
    • settings.xml
    • styles.xml
    • webSettings.xml
  • [Content_Types].xml

OpenDocument

Package

The package of OpenDocument document consists of the following files.

  • META-INF/
    • manifest.xml
  • Pictures/
  • content.xml
  • meta.xml
  • styles.xml

content.xml

The structure of content.xml is described below.

  • office:document-content
    • office:font-facedecls
    • office:automatic-styles
    • office:body
      • office:text
        • draw:*
        • office:forms
        • table:table
        • text:list
        • text:numbered-paragraph
        • text:p
        • text:table-of-contents
        • text:section
      • office:chart
      • office:image
      • office:drawing

styles.xml

The structure of styles.xml is described below.

  • office:document-styles
    • office:styles
    • office:automatic-styles
    • office:master-styles
      • office:master-page

RTF

To be completed.

HTML

To be completed.

PDF

To be completed.

Recipes

Create float left image

Use absolute positioning relative to margin horizontally and to line vertically.

$imageStyle = array(
    'width' => 40,
    'height' => 40
    'wrappingStyle' => 'square',
    'positioning' => 'absolute',
    'posHorizontalRel' => 'margin',
    'posVerticalRel' => 'line',
);
$textrun->addImage('resources/_earth.jpg', $imageStyle);
$textrun->addText($lipsumText);

Download the produced file automatically

Use php://output as the filename.

$phpWord = new PhpOfficePhpWordPhpWord();
$section = $phpWord->createSection();
$section->addText('Hello World!');
$file = 'HelloWorld.docx';
header("Content-Description: File Transfer");
header('Content-Disposition: attachment; filename="' . $file . '"');
header('Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document');
header('Content-Transfer-Encoding: binary');
header('Cache-Control: must-revalidate, post-check=0, pre-check=0');
header('Expires: 0');
$xmlWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007');
$xmlWriter->save("php://output");

Create numbered headings

Define a numbering style and title styles, and match the two styles (with pStyle and numStyle) like below.

$phpWord->addNumberingStyle(
    'hNum',
    array('type' => 'multilevel', 'levels' => array(
        array('pStyle' => 'Heading1', 'format' => 'decimal', 'text' => '%1'),
        array('pStyle' => 'Heading2', 'format' => 'decimal', 'text' => '%1.%2'),
        array('pStyle' => 'Heading3', 'format' => 'decimal', 'text' => '%1.%2.%3'),
        )
    )
);
$phpWord->addTitleStyle(1, array('size' => 16), array('numStyle' => 'hNum', 'numLevel' => 0));
$phpWord->addTitleStyle(2, array('size' => 14), array('numStyle' => 'hNum', 'numLevel' => 1));
$phpWord->addTitleStyle(3, array('size' => 12), array('numStyle' => 'hNum', 'numLevel' => 2));

$section->addTitle('Heading 1', 1);
$section->addTitle('Heading 2', 2);
$section->addTitle('Heading 3', 3);

Add a link within a title

Apply ‘HeadingN’ paragraph style to TextRun or Link. Sample code:

$phpWord = new PhpOfficePhpWordPhpWord();
$phpWord->addTitleStyle(1, array('size' => 16, 'bold' => true));
$phpWord->addTitleStyle(2, array('size' => 14, 'bold' => true));
$phpWord->addFontStyle('Link', array('color' => '0000FF', 'underline' => 'single'));

$section = $phpWord->addSection();

// Textrun
$textrun = $section->addTextRun('Heading1');
$textrun->addText('The ');
$textrun->addLink('https://github.com/PHPOffice/PHPWord', 'PHPWord', 'Link');

// Link
$section->addLink('https://github.com/', 'GitHub', 'Link', 'Heading2');

Remove [Compatibility Mode] text in the MS Word title bar

Use the MetadataCompatibilitysetOoxmlVersion(n) method with n is the version of Office (14 = Office 2010, 15 = Office 2013).

$phpWord->getCompatibility()->setOoxmlVersion(15);

Frequently asked questions

Is this the same with PHPWord that I found in CodePlex?

No. This one is much better with tons of new features that you can’t find in PHPWord 0.6.3. The development in CodePlex is halted and switched to GitHub to allow more participation from the crowd. The more the merrier, right?

I’ve been running PHPWord from CodePlex flawlessly, but I can’t use the latest PHPWord from GitHub. Why?

PHPWord requires PHP 5.3+ since 0.8, while PHPWord 0.6.3 from CodePlex can run with PHP 5.2. There’s a lot of new features that we can get from PHP 5.3 and it’s been around since 2009! You should upgrade your PHP version to use PHPWord 0.8+.

References

ISO/IEC 29500, Third edition, 2012-09-01

  • Part 1: Fundamentals and Markup Language Reference
  • Part 2: Open Packaging Conventions
  • Part 3: Markup Compatibility and Extensibility
  • Part 4: Transitional Migration Features

Formal specifications

  • Oasis OpenDocument Standard Version 1.2
  • Rich Text Format (RTF) Specification, version 1.9.1

Other resources

  • DocumentFormat.OpenXml.Wordprocessing Namespace on MSDN

Можно ли читать и записывать файлы Word (2003 и 2007) на PHP без использования COM-объекта? Я знаю, что можно сделать так:

$file = fopen(‘c:file.doc’, ‘w+’);

fwrite($file, $text);

fclose();

но Word будет читать его как HTML-файл, а не как собственный файл .doc.

Ответ 1

Чтение двоичных документов Word потребовало бы создания анализатора в соответствии с опубликованными спецификациями формата файлов DOC. Я думаю, что это не является реально выполнимым решением. Вы можете использовать форматы Microsoft Office XML для чтения и записи файлов Word они совместимы с версиями Word 2003 и 2007. Для чтения необходимо убедиться, что документы Word сохранены в правильном формате (он называется Word 2003 XML-Document в Word 2007). Для записи достаточно следовать общедоступной XML-схеме. Я никогда не использовал этот формат для записи документов Office из PHP, но я использую его для чтения рабочего листа Excel (естественно, сохраненного как XML-Spreadsheet 2003) и отображения его данных на веб-странице. Поскольку файлы представляют собой обычные XML-данные, не составляет труда сориентироваться в них и понять, как извлечь нужные данные. Другой вариант вариант только для Word 2007 (если форматы файлов OpenXML не установлены в вашем Word 2003) это пересортировка в OpenXML. Формат файла DOCX это просто ZIP-архив с включенными XML-файлами. На MSDN есть много ресурсов по формату файлов OpenXML, так что вы должны быть в состоянии понять, как читать нужные вам данные. Запись будет намного сложнее, я думаю, все зависит от того, сколько времени вы потратите на это. Возможно, вы можете взглянуть на PHPExcel библиотеку, способную писать в файлы Excel 2007 и читать из файлов Excel 2007, используя стандарт OpenXML. Вы можете получить представление о работе, связанной с чтением и записью документов OpenXML Word.

Ответ 2

Данное решение работает с vs < office 2007, и это чистый PHP без всякого COM:

<?php

 /*****************************************************************

Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))

чтобы определить, где находится текст:

— разделяем содержимое файла на фрагменты по chr(13)

— отбрасываем все фрагменты, содержащие NUL

— сшиваем оставшиеся вместе

— очищаем с помощью регулярного выражения

*****************************************************************/

function parseWord($userDoc)  {

    $fileHandle = fopen($userDoc, «r»);

    $line = @fread($fileHandle, filesize($userDoc));   

    $lines = explode(chr(0x0D),$line);

    $outtext = «»;

    foreach($lines as $thisline) {

        $pos = strpos($thisline, chr(0x00));

        if (($pos !== FALSE)||(strlen($thisline)==0)) {

          } else {

            $outtext .= $thisline.» «;

          }

      }

     $outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);

    return $outtext;

$userDoc = «cv.doc»;

$text = parseWord($userDoc);

echo $text;

?>

Ответ 3

Просто обновляем код из предыдущего ответа:

<?php

 /*****************************************************************

Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))

чтобы определить, где находится текст:

— разделяем содержимое файла на фрагменты по chr(13)

— отбрасываем все фрагменты, содержащие NUL

— сшиваем оставшиеся вместе

— очищаем с помощью регулярного выражения

*****************************************************************/

function parseWord($userDoc)  {

    $fileHandle = fopen($userDoc, «r»);

    $word_text = @fread($fileHandle, filesize($userDoc));

    $line = «»;

    $tam = filesize($userDoc);

    $nulos = 0;

    $caracteres = 0;

    for($i=1536; $i<$tam; $i++) {

        $line .= $word_text[$i];

        if( $word_text[$i] == 0) {

            $nulos++;

        } else {

            $nulos=0;

            $caracteres++;

        }

        if( $nulos>1996)

        {   

            break;  

        }

    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);

    //$outtext = «<pre>»;

    $outtext = «»;

    foreach($lines as $thisline) {

        $tam = strlen($thisline);

        if( !$tam ) {

            continue;

        }

        $new_line = «»; 

        for($i=0; $i<$tam; $i++) {

            $onechar = $thisline[$i];

            if( $onechar > chr(240) ) {

                continue;

            }

            if( $onechar >= chr(0x20) ) {

                $caracteres++;

                $new_line .= $onechar;

            }

            if( $onechar == chr(0x14) ) {

                $new_line .= «</a>»;

            }

            if( $onechar == chr(0x07) ) {

                $new_line .= «t»;

                if( isset($thisline[$i+1]) ) {

                    if( $thisline[$i+1] == chr(0x07) ) {

                        $new_line .= «n»;

                    }

                }

            }

        }

        //troca por hiperlink

        $new_line = str_replace(«HYPERLINK» ,»<a href=»,$new_line); 

        $new_line = str_replace(«o» ,»>»,$new_line); 

        $new_line .= «n»;

        //link de imagens

        $new_line = str_replace(«INCLUDEPICTURE» ,»<br><img src=»,$new_line); 

        $new_line = str_replace(«*» ,»><br>»,$new_line); 

        $new_line = str_replace(«MERGEFORMATINET» ,»»,$new_line); 

        $outtext .= nl2br($new_line);

    }

    return $outtext;

$userDoc = «custo.doc»;

$userDoc = «Cultura.doc»;

$text = parseWord($userDoc);

echo $text;

?>

Ответ 4

www.phplivedocx.org это сервис на основе SOAP, который выполняет онлайнтестирование файлов. Файлы также имеют достаточно примеров для его использования. Я думаю, что без COM это просто невозможно на Linuxсервере, и единственная идея изменить doc файл в другой файл, который PHP может разобрать…

Ответ 5

Используя Open XML SDK и VSTO [Visual Studio Tools For Office], мы можем легко работать с файлами Word, манипулировать ими и даже конвертировать внутри в различные форматы, такие как .odt,.pdf,.docx и т. д. Итак, зайдите на сайт msdn.microsoft.com и внимательно изучите вкладку office development. Это самый простой способ сделать это, так как все функции, которые нам нужно реализовать, уже доступны в .net!!! Но так как вы хотите сделать свой проект на PHP, вы можете сделать это в Visual Studio и .net, потому как PHP также является одним из .net Compliant Language!!!

Ответ 6

Используйте следующий класс непосредственно для чтения документа Word:

class DocxConversion{

    private $filename;

    public function __construct($filePath) {

        $this->filename = $filePath;

    }

    private function read_doc() {

        $fileHandle = fopen($this->filename, «r»);

        $line = @fread($fileHandle, filesize($this->filename));   

        $lines = explode(chr(0x0D),$line);

        $outtext = «»;

        foreach($lines as $thisline) {

            $pos = strpos($thisline, chr(0x00));

            if (($pos !== FALSE)||(strlen($thisline)==0)) {

              } else {

                $outtext .= $thisline.» «;

              }

          }

         $outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);

        return $outtext;

    }

    private function read_docx(){

        $striped_content = »;

        $content = »;

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != «word/document.xml») continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);

        }// end while

        zip_close($zip);

        $content = str_replace(‘</w:r></w:p></w:tc><w:tc>’, » «, $content);

        $content = str_replace(‘</w:r></w:p>’, «rn», $content);

        $striped_content = strip_tags($content);

        return $striped_content;

    }

 /************************excel sheet************************************/

function xlsx_to_text($input_file){

    $xml_filename = «xl/sharedStrings.xml»; //content file name

    $zip_handle = new ZipArchive;

    $output_text = «»;

    if(true === $zip_handle->open($input_file)){

        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){

            $xml_datas = $zip_handle->getFromIndex($xml_index);

            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $output_text = strip_tags($xml_handle->saveXML());

        }else{

            $output_text .=»»;

        }

        $zip_handle->close();

    }else{

    $output_text .=»»;

    }

    return $output_text;

}

/*************************power point files*****************************/

function pptx_to_text($input_file){

    $zip_handle = new ZipArchive;

    $output_text = «»;

    if(true === $zip_handle->open($input_file)){

        $slide_number = 1; //loop through slide files

        while(($xml_index = $zip_handle->locateName(«ppt/slides/slide».$slide_number.».xml»)) !== false){

            $xml_datas = $zip_handle->getFromIndex($xml_index);

            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $output_text .= strip_tags($xml_handle->saveXML());

            $slide_number++;

        }

        if($slide_number == 1){

            $output_text .=»»;

        }

        $zip_handle->close();

    }else{

    $output_text .=»»;

    }

    return $output_text;

}

    public function convertToText() {

       if(isset($this->filename) && !file_exists($this->filename)) {

            return «File Not exists»;

        }

        $fileArray = pathinfo($this->filename);

        $file_ext  = $fileArray[‘extension’];

        if($file_ext == «doc» || $file_ext == «docx» || $file_ext == «xlsx» || $file_ext == «pptx») {

            if($file_ext == «doc») {

                return $this->read_doc();

            } elseif($file_ext == «docx») {

                return $this->read_docx();

            } elseif($file_ext == «xlsx») {

                return $this->xlsx_to_text();

            }elseif($file_ext == «pptx») {

                return $this->pptx_to_text();

            }

        } else {

            return «Invalid File Type»;

        }

    }

}

$docObj = new DocxConversion(«test.docx»); //замените имя документа правильным расширением doc или docx

echo $docText= $docObj->convertToText();

Время на прочтение
4 мин

Количество просмотров 61K

Недавно возникла задача получения чистого текста из различных форматов документооборота — будь-то документы Microsoft Word или PDF. Задача была выполнена даже с чуть более широким списком возможных входных данных. Итак, этой статьёй я открываю список публикаций о чтении текста из следующих типов файлов: DOC, DOCX, RTF, ODT и PDF — с помощью PHP без использования сторонних утилит.

Для начала отвечу на вполне разумный вопрос: «Зачем это, собственно, надо?» Правильно, чистый текст, полученный из, к примеру, документа Word представляет собой достаточно перемешанную кашу. Но этого «бардака» вполне достаточно для построения, например, индекса для поиска по обширному хранилищу офисных документов.

Другой вполне разумный вопрос: «Почему не использовать сторонние утилиты, например, antiword или xpdf, ну или в крайнем случае OLE под Windows?» Таковы уж были поставленные условия, да и OLE работает люто-бешено медленно, даже если задачу можно решить с помощью этой технологии.

Сегодня, в качестве «затравки», я расскажу о достаточно простых для поставленной задачи форматах — это Office Open XML, больше известный как DOCX от Microsoft и OpenDocument Format, он же ODT от ODF Aliance.

Для начала заглянем вовнутрь парочки файлов и увидим буквально следующее (сзади docx, спереди odt):

Самое важное, что мы здесь видим, это первые два символа PK в начале данных. Это значит, что оба файла представляют собой переименованный в .docx/.odt zip-архив. Открываем, например, по Ctrl+PageDown в Total Commander и лицезреем вполне приемлемую структуру (слева odt, справа docx):

Итак, нужные нам файлы с данными — это content.xml в ODT и word/document.xml в DOCX. Чтобы прочитать текстовые данные из них напишем несложный код:

  1. function odt2text($filename) {
  2.     return getTextFromZippedXML($filename, «content.xml»);
  3. }
  4. function docx2text($filename) {
  5.     return getTextFromZippedXML($filename, «word/document.xml»);
  6. }
  7. function getTextFromZippedXML($archiveFile, $contentFile) {
  8.     // Создаёт «реинкарнацию» zip-архива…
  9.     $zip = new ZipArchive;
  10.     // И пытаемся открыть переданный zip-файл
  11.     if ($zip->open($archiveFile)) {
  12.         // В случае успеха ищем в архиве файл с данными
  13.         if (($index = $zip->locateName($contentFile)) !== false) {
  14.             // Если находим, то читаем его в строку
  15.             $content = $zip->getFromIndex($index);
  16.             // Закрываем zip-архив, он нам больше не нужен
  17.             $zip->close();
  18.  
  19.             // После этого подгружаем все entity и по возможности include’ы других файлов
  20.             // Проглатываем ошибки и предупреждения
  21.             $xml = DOMDocument::loadXML($content, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
  22.             // После чего возвращаем данные без XML-тегов форматирования
  23.  
  24.             return strip_tags($xml->saveXML());
  25.         }
  26.         $zip->close();
  27.     }
  28.     // Если что-то пошло не так, возвращаем пустую строку
  29.     return «»;
  30. }

Всего каких-то 30 строк, и мы получаем текстовые данные из файла. Код работает под PHP 5.2+ и требует php_zip.dll под Windows или ключика --enable-zip под Linux. При отсутствии возможности использования ZipArchive (старая версия PHP или отсутствие библиотек) вполне может сгодиться библиотека PclZip, реализующая чтение zip-файлов без соответствующих средств в системе.

Отмечу, что данный код является лишь заготовкой для решения задач чтения текста. После череды статей под лозунгом «Текст любой ценой», я постараюсь описать принципы и реализацию чтения форматированного текста.

По теме:

  • msdn.microsoft.com/en-us/library/aa338205.aspx
  • www.i-rs.ru/Produkty/ODF-ISO-IEC-26300-2006/Dokumentaciya/Format-Open-Document-dlya-ofisnyh-prilozhenij-OpenDocument-v1.0.odt
  • Текст любой ценой: PDF
  • Текст любой ценой: RTF
  • Текст любой ценой: WCBFF и DOC

В следующий раз я расскажу о чтении текста из PDF без помощи xpdf. Более сложной, но вполне посильной для PHP задачи.

Понравилась статья? Поделить с друзьями:
  • Чтение файла excel средствами 1с
  • Чтение файл word 2007
  • Чтение текста в excel
  • Чтение таблицы excel python
  • Чтение таблиц excel python