От автора: не так давно на нашем сайте был опубликован урок по созданию документов MS Word средствами языка PHP, и с использованием специальной библиотеки PHPWord. Но в комментариях к данному видео – прозвучал вопрос, как при помощи данной библиотеки читать готовые документы, что собственно и подтолкнуло меня к записи данного урока, в котором мы с Вами научимся, используя выше указанную библиотеку, читать ранее созданные документы MSWord.
В данном уроке мы продолжаем изучать возможности PHPWord, а именно рассмотрим инструменты по чтению готовых документов MS Word. Хотел бы отметить, что сегодня мы будем работать с уже установленной библиотекой, потому как это уже второй урок по данной теме, а значит, на основах подробно останавливаться не будем. Поэтому рекомендую, перед просмотром данного видео ознакомиться с первой часть урока – PHPWord — создание MS Word документов средствами PHP.
Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.
Итак, заготовка, тестового скрипта состоит из одного единственного файла index.php, в коде которого выполнена установка библиотеки.
require ‘vendor/autoload.php’; |
Для начала создадим переменную, в которой будет храниться путь к документу MSWord, с которым мы будем работать.
$source = __DIR__.«/docs/text.docx»; |
Далее, вспомним, что в начале работы с библиотекой необходимо создать объект главного класса PHPWord, но это в том случае если создается новый документ. Если же осуществляется чтение готового файла MS Word – объект указанного класса необходимо создать для интересующего документа, но перед этим его нужно прочитать.
Для чтения готовых документов в PHPWord предусмотрена группа классов, отвечающих за чтение документов различных форматов. А значит, первым делом создадим объект специального “класса-риддера“.
$objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); |
Далее, используя данный объект – выполним чтение документа формата MS Word.
$phpWord = $objReader—>load($source); |
Таким образом, по сути, задача урока выполнена, так как документ прочитан и его данные располагаются в структуре только что созданного объекта $phpWord. Но давайте поговорим о том, как же получить данные хранящиеся в объекте.
По официальной документации любая информация документа MS Word, согласно библиотеке PHPWord, располагается в отдельных секциях. При этом каждая секция содержит определенный набор элементов – текст, таблица, изображение, ссылка и т.д. Элементы – же в свою очередь, так же могут быть сложными и включать в себя некий набор вложенных элементов, к примеру таблицы.
Поэтому, вызывая на исполнение метод getSections(), мы получаем доступ к секциям документа, при этом в качестве результата будет возвращен массив, а значит мы его можем обойти циклом foreach().
foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); } |
При этом в коде цикла, для каждой секции, получим массив входящих элементов, вызывая на исполнение метод getElements(). Так как возвращаемое значение – это массив, значит, используя выше указанный цикл, мы можем получить доступ к каждой его ячейке.
foreach($arrays as $e) { } |
При этом в переменной $e на каждой итерации цикла, содержится объект одного из элементов массива секций. Казалось бы, мы сразу можем получить текстовые данные MS Word, но для начала нужно проверить, что содержится в переменной $e.
if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { |
Если в данной переменной содержится объект класса ‘PhpOfficePhpWordElementTextRun’, значит мы работаем с сложной текстовой областью, в которой располагается несколько более простых элементов. Поэтому повторно вызываем метод getElements() и по результату проходимся в цикле foreach().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
<?php require ‘vendor/autoload.php’; $source = __DIR__.«/docs/text.docx»; $objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); $phpWord = $objReader—>load($source); $body = »; foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); foreach($arrays as $e) { if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { foreach($e—>getElements() as $text) { $font = $text—>getFontStyle(); $size = $font—>getSize()/10; $bold = $font—>isBold() ? ‘font-weight:700;’ :»; $color = $font—>getColor(); $fontFamily = $font—>getName(); $body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’; $body .= $text—>getText().‘</span>’; } } } } include ‘templ.php’; |
Таким образом, для текущего документа, в переменную $text, попадает объект элемента Text, то есть элемент простейшего текст, для получения которого достаточно вызвать на исполнение метод getText(). Для получения информации о форматировании текущего элемента, необходимо обратиться к методу getFontStyle(), который вернет объект в закрытых свойствах которого содержится указанная информация. Соответственно для доступа к значениям этих свойств необходимо использовать специальные методы:
getSize() – размер шрифта;
isBold() — возвращает истину, если используется полужирный шрифт;
getColor() – цвет текста;
getName() – имя шрифта.
Все содержимое документа, записывается в переменную $body, значение которой будет отображено на экране, используя шаблон. Пустые строки документа представляют собой объект элемента TextBreak, который можно обработать следующим образом:
else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) { $body .= ‘<br />’; } |
Для обработки таблиц, придется добавить достаточно много строк кода, потому как таблица – это сложный элемент Table, который состоит из отдельных строк, а те в свою очередь из отдельных ячеек. И более того, каждая ячейка, может содержать еще вложенные элементы, потому как, к примеру в одной ячейке так же можно сформировать таблицу. Ниже приведу весь код, вместе с кодом обработки таблиц.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
<?php require ‘vendor/autoload.php’; $source = __DIR__.«/docs/text.docx»; $objReader = PhpOfficePhpWordIOFactory::createReader(‘Word2007’); $phpWord = $objReader—>load($source); $body = »; foreach($phpWord—>getSections() as $section) { $arrays = $section—>getElements(); foreach($arrays as $e) { if(get_class($e) === ‘PhpOfficePhpWordElementTextRun’) { foreach($e—>getElements() as $text) { $font = $text—>getFontStyle(); $size = $font—>getSize()/10; $bold = $font—>isBold() ? ‘font-weight:700;’ :»; $color = $font—>getColor(); $fontFamily = $font—>getName(); $body .= ‘<span style=»font-size:’ . $size . ’em;font-family:’ . $fontFamily . ‘; ‘.$bold.‘; color:#’.$color.‘»>’; $body .= $text—>getText().‘</span>’; } } else if(get_class($e) === ‘PhpOfficePhpWordElementTextBreak’) { $body .= ‘<br />’; } else if(get_class($e) === ‘PhpOfficePhpWordElementTable’) { $body .= ‘<table border=»2px»>’; $rows = $e—>getRows(); foreach($rows as $row) { $body .= ‘<tr>’; $cells = $row—>getCells(); foreach($cells as $cell) { $body .= ‘<td style=»width:’.$cell—>getWidth().‘»>’; $celements = $cell—>getElements(); foreach($celements as $celem) { if(get_class($celem) === ‘PhpOfficePhpWordElementText’) { $body .= $celem—>getText(); } else if(get_class($celem) === ‘PhpOfficePhpWordElementTextRun’) { foreach($celem—>getElements() as $text) { $body .= $text—>getText(); } } } $body .= ‘</td>’; } $body .= ‘</tr>’; } $body .= ‘</table>’; } else { $body .= $e—>getText(); } } break; } include ‘templ.php’; |
Для получения строк, необходимо вызвать метод getRows(), при этом в качестве результата будет возвращен массив объектов с информацией по каждой строке (элемент Row). Используя foreach(), обходим данный массив и для каждой строки получаем ячейки, при помощи метода getCells(). При этом опять же возвращается массив, который все так же мы обходим циклом. А далее для каждой ячейки вызываем на исполнение метод getElements(), для получения ее элементов. И так далее по принципу описанным выше.
Далее, осталось только отобразить значение переменной $body, любым удобным для Вас способом.
На этом данный урок я буду завершать. Как Вы видите, PHPWord предоставляет достаточно мощные инструменты по работе с документами MS Word, но и в тоже время сложные в плане получения данных из объектов.
Всего Вам доброго и удачного кодирования!!!
PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF.
PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers’ Documentation.
If you have any questions, please ask on StackOverFlow
Read more about PHPWord:
- Features
- Requirements
- Installation
- Getting started
- Contributing
- Developers’ Documentation
Features
With PHPWord, you can create OOXML, ODF, or RTF documents dynamically using your PHP scripts. Below are some of the things that you can do with PHPWord library:
- Set document properties, e.g. title, subject, and creator.
- Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
- Create header and footer for each sections
- Set default font type, font size, and paragraph style
- Use UTF-8 and East Asia fonts/characters
- Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
- Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
- Insert titles (headers) and table of contents
- Insert text breaks and page breaks
- Insert and format images, either local, remote, or as page watermarks
- Insert binary OLE Objects such as Excel or Visio
- Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
- Insert list items as bulleted, numbered, or multilevel
- Insert hyperlinks
- Insert footnotes and endnotes
- Insert drawing shapes (arc, curve, line, polyline, rect, oval)
- Insert charts (pie, doughnut, bar, line, area, scatter, radar)
- Insert form fields (textinput, checkbox, and dropdown)
- Create document from templates
- Use XSL 1.0 style sheets to transform headers, main document part, and footers of an OOXML template
- … and many more features on progress
Requirements
PHPWord requires the following:
- PHP 7.1+
- XML Parser extension
- Laminas Escaper component
- Zip extension (optional, used to write OOXML and ODF)
- GD extension (optional, used to add images)
- XMLWriter extension (optional, used to write OOXML and ODF)
- XSL extension (optional, used to apply XSL style sheet to template )
- dompdf library (optional, used to write PDF)
Installation
PHPWord is installed via Composer.
To add a dependency to PHPWord in your project, either
Run the following to use the latest stable version
composer require phpoffice/phpword
or if you want the latest unreleased version
composer require phpoffice/phpword:dev-master
Getting started
The following is a basic usage example of the PHPWord library.
<?php require_once 'bootstrap.php'; // Creating the new document... $phpWord = new PhpOfficePhpWordPhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document... $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default... $section->addText( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline... $section->addText( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)', array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style... $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)', $fontStyleName ); // Adding Text element with font customized using explicitly created font style object... $fontStyle = new PhpOfficePhpWordStyleFont(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText('"Believe you can and you're halfway there." (Theodor Roosevelt)'); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */
More examples are provided in the samples folder. For an easy access to those samples launch php -S localhost:8000
in the samples directory then browse to http://localhost:8000 to view the samples.
You can also read the Developers’ Documentation for more detail.
Contributing
We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute.
- Read our contributing guide.
- Fork us and request a pull to the master branch.
- Submit bug reports or feature requests to GitHub.
- Follow @PHPWord and @PHPOffice on Twitter.
Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object?
I know that I can:
$file = fopen('c:file.doc', 'w+');
fwrite($file, $text);
fclose();
but Word will read it as an HTML file not a native .doc file.
asked Oct 9, 2008 at 18:09
UnkwnTechUnkwnTech
87.1k65 gold badges183 silver badges229 bronze badges
1
Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.
You could use the Microsoft Office XML formats for reading and writing Word files — this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it’s called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I’ve never used this format for writing out Office documents from PHP, but I’m using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it’s no problem to navigate within and figure out how to extract the data you need.
The other option — a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) — would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think — it just depends on how much time you’ll invest.
Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.
answered Nov 5, 2008 at 13:04
Stefan GehrigStefan Gehrig
82.3k24 gold badges158 silver badges188 bronze badges
1
this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
UnkwnTech
87.1k65 gold badges183 silver badges229 bronze badges
answered Nov 5, 2008 at 12:35
2
You can use Antiword, it is a free MS Word reader for Linux and most popular OS.
$document_file = 'c:file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);
answered May 23, 2009 at 0:57
MantichoraMantichora
3854 silver badges8 bronze badges
5
I don’t know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.
answered Oct 10, 2008 at 0:23
Joe LencioniJoe Lencioni
10.2k17 gold badges54 silver badges66 bronze badges
Just updating the code
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$word_text = @fread($fileHandle, filesize($userDoc));
$line = "";
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++)
{
$line .= $word_text[$i];
if( $word_text[$i] == 0)
{
$nulos++;
}
else
{
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = "<pre>";
$outtext = "";
foreach($lines as $thisline)
{
$tam = strlen($thisline);
if( !$tam )
{
continue;
}
$new_line = "";
for($i=0; $i<$tam; $i++)
{
$onechar = $thisline[$i];
if( $onechar > chr(240) )
{
continue;
}
if( $onechar >= chr(0x20) )
{
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) )
{
$new_line .= "</a>";
}
if( $onechar == chr(0x07) )
{
$new_line .= "t";
if( isset($thisline[$i+1]) )
{
if( $thisline[$i+1] == chr(0x07) )
{
$new_line .= "n";
}
}
}
}
//troca por hiperlink
$new_line = str_replace("HYPERLINK" ,"<a href=",$new_line);
$new_line = str_replace("o" ,">",$new_line);
$new_line .= "n";
//link de imagens
$new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line);
$new_line = str_replace("*" ,"><br>",$new_line);
$new_line = str_replace("MERGEFORMATINET" ,"",$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);
echo $text;
?>
answered Apr 4, 2011 at 2:43
WIlsonWIlson
611 silver badge1 bronze badge
4
Most probably you won’t be able to read Word documents without COM.
Writing was covered in this topic
answered Oct 10, 2008 at 2:17
Sergey KornilovSergey Kornilov
1,7722 gold badges13 silver badges22 bronze badges
2007 might be a bit complicated as well.
The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.
Rename a .docx file to .zip and you’ll see what I mean.
So if you can work within zip files in PHP, you should be on the right path.
0
www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)…I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse…
answered Sep 13, 2009 at 17:45
Source gotten from
Use following class directly to read word document
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9s,.-nrt@/_()]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "rn", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}
}
$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx
echo $docText= $docObj->convertToText();
answered Jul 3, 2019 at 10:25
Office 2007 .docx should be possible since it’s an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven’t seen many libraries written to match them yet.
answered Oct 10, 2008 at 2:45
acrosmanacrosman
12.8k10 gold badges40 silver badges55 bronze badges
I don’t know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called «catdoc»; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.
answered Oct 10, 2008 at 15:25
fijterfijter
17.5k2 gold badges24 silver badges28 bronze badges
phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.
See the project web site at:
answered May 14, 2009 at 7:03
1
One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX.
You may see how it works having a look at its online tutorial.
You can insert or extract contents or even merge multiple Word files into a asingle one.
answered Sep 28, 2012 at 16:44
Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.
answered Jan 24, 2009 at 5:09
Josh SmeatonJosh Smeaton
47.6k24 gold badges129 silver badges164 bronze badges
1
even i’m working on same kind of project [An Onlinw Word Processor]!
But i’ve choosen c#.net and ASP.net. But through the survey i did; i got to know that
By Using Open XML SDK and VSTO [Visual Studio Tools For Office]
we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..
So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!
But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!
answered Sep 5, 2010 at 14:17
Noddy ChaNoddy Cha
8511 gold badge12 silver badges19 bronze badges
I have the same case
I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy.
All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP
so simple CURL would do it.
answered Oct 11, 2010 at 19:12
1
//For DOCX.If you want to preserve white spaces, also take care of tables tr and tc, use the codes below: Modify it to your taste. Cos it downloads the file from a remote or local
//=========DOCX===========
function extractDocxText($url,$file_name){
$docx = get_url($url);
file_put_contents("tempf.docx",$docx);
$xml_filename = "word/document.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open("tempf.docx")){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
//file_put_contents($input_file.".xml",$xml_datas);
$replace_newlines = preg_replace('/<w:p w[0-9-Za-z]+:[a-zA-Z0-9]+="[a-zA-z"0-9 :="]+">/',"nr",$xml_datas);
$replace_tableRows = preg_replace('/<w:tr>/',"nr",$replace_newlines);
$replace_tab = preg_replace('/<w:tab/>/',"t",$replace_tableRows);
$replace_paragraphs = preg_replace('/</w:p>/',"nr",$replace_tab);
$replace_other_Tags = strip_tags($replace_paragraphs);
$output_text = $replace_other_Tags;
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .=" ";
}
chmod("tempf.docx", 0777); unlink(realpath("tempf.docx"));
//save to file or echo content
file_put_contents($file_name,$output_text);
echo $output_text;
}
//========PDF===========
//Requires installation in your Linux server
//sudo su
//apt-get install xpdf
function extractPdfText($url,$PDF_fullpath_or_Filename){
$pdf = get_url($url);
file_put_contents ("temppdf.txt", $pdf);
$content = pdf2text("temppdf.txt");
chmod("temppdf.txt", 0777); unlink(realpath("temppdf.txt"));
echo $content;
file_put_contents($PDF_fullpath_or_Filename,$content);
}
//========DOC==========
function extractDocText($url,$file_name){
$doc = get_url($url);
file_put_contents ("tempf.txt", $doc);
$fileHandle = fopen("tempf.txt", "r");
$line = @fread($fileHandle, filesize("tempf.txt"));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline){
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{} else {$outtext .= $thisline."nr";}
}
$content = preg_replace('/[a-zA-Z0-9s,.-nrt@/_()]/',' ',$outtext);
//chmod("tempf.txt", 0777); unlink(realpath("tempf.txt"));
echo $content;
file_put_contents($file_name,$content);
}
//========XLSX==========
function extractXlsxText($url,$file_name){
$xlsx = get_url($url);
file_put_contents ("tempf.txt", $xlsx);
$content = "";
$dir = 'tempforxlsx';
// Unzip
$zip = new ZipArchive();
$zip->open("tempf.txt");
$zip->extractTo($dir);
// Open up shared strings & the first worksheet
$strings = simplexml_load_file($dir . '/xl/sharedStrings.xml');
$sheet = simplexml_load_file($dir . '/xl/worksheets/sheet1.xml');
// Parse the rows
$xlrows = $sheet->sheetData->row;
foreach ($xlrows as $xlrow) {
$arr = array();
// In each row, grab it's value
foreach ($xlrow->c as $cell) {
$v = (string) $cell->v;
// If it has a "t" (type?) of "s" (string?), use the value to look up string value
if (isset($cell['t']) && $cell['t'] == 's') {
$s = array();
$si = $strings->si[(int) $v];
// Register & alias the default namespace or you'll get empty results in the xpath query
$si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
// Cat together all of the 't' (text?) node values
foreach($si->xpath('.//n:t') as $t) {
$content .= $t." ";} }
}
}
echo $content;
file_put_contents($file_name,$content);
}
//========PPT==========
function extractPptText($url,$file_name){
$ppt = file_get_contents($url);
file_put_contents ("tempf.ppt", $ppt);
$fileHandle = fopen("tempf.ppt", "r");
$line = @fread($fileHandle, filesize("tempf.ppt"));
$lines = explode(chr(0x0f),$line);
$outtext = '';
foreach($lines as $thisline) {
if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
$text_line = substr($thisline, 4);
$end_pos = strpos($text_line, chr(0x00));
$text_line = substr($text_line, 0, $end_pos);
$text_line = preg_replace('/[^a-zA-Z0-9s,.-nrt@/_()]/'," ",$text_line);
$outtext = substr($text_line, 0, $end_pos)."n".$outtext;
}
}
//echo $outtext;
file_put_contents($file_name,$outtext);
}
//========PPTX==========
function extractPptxText($url,$file_name){
$xls = get_url($url);
file_put_contents ("tempf.txt", $xls);
$zip_handle = new ZipArchive;
$output_text = ' ';
if(true === $zip_handle->open("tempf.txt")){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index); // these four lines of codes
// below were
$xml_handle = new DOMDocument (); // added by me in order
$xml_handle->preserveWhiteSpace = true; // to preserve space between
$xml_handle->formatOutput = true; // each read data
$xml_handle->loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= $xml_handle->saveXML();
$slide_number++;
}
if($slide_number == 1){
$output_text .= "";
}
$zip_handle->close();
}else{
$output_text .= "";
}
echo $output_text;
file_put_contents($file_name,$output_text);
}
/*
==========================================================================
=========================================================================
And below is get_url() function: Better than fie_get_contents();
*/
function get_url( $url,$timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
//$response = curl_getinfo( $ch );
curl_close ( $ch );
return $content;
}
How to read and view docx Files using PHP. Now days processing Word Document is becoming more popular. Even you can create a new Word Document and process with it. My previous article describes you to create Word Document by using PHP.
Today we are going to discuss about reading the Docx files and convert it into text and view it online. Let’s begin with steps and codes,
<?php function kv_read_word($input_file){ $kv_strip_texts = ''; $kv_texts = ''; if(!$input_file || !file_exists($input_file)) return false; $zip = zip_open($input_file); if (!$zip || is_numeric($zip)) return false; while ($zip_entry = zip_read($zip)) { if (zip_entry_open($zip, $zip_entry) == FALSE) continue; if (zip_entry_name($zip_entry) != "word/document.xml") continue; $kv_texts .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry)); zip_entry_close($zip_entry); } zip_close($zip); $kv_texts = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $kv_texts); $kv_texts = str_replace('</w:r></w:p>', "rn", $kv_texts); $kv_strip_texts = nl2br(strip_tags($kv_texts,’‘)); return $kv_strip_texts; } ?>
The above function will helps you to get parse the text’s in a Word Document and return it.
Now, we need to give the input file and its path as input to the function and print it for results.
<?php $kv_texts = kv_read_word('path/to/the/file/kvcodes.docx'); if($kv_texts !== false) { echo nl2br($kv_texts); } else { echo 'Can't Read that file.'; } ?>
That’s it to read a docx file and print it as text.
I have another article for WordPress user, who can try this to process Docx files using php and WordPress
How to Read and get Texts from Docx Files in WordPress
Contents
- Introduction
- Features
- File formats
- Installing/configuring
- Requirements
- Installation
- Using samples
- General usage
- Basic example
- Settings
- Default font
- Document properties
- Measurement units
- Containers
- Sections
- Headers
- Footers
- Other containers
- Elements
- Texts
- Breaks
- Lists
- Tables
- Images
- Objects
- Table of contents
- Footnotes & endnotes
- Checkboxes
- Textboxes
- Fields
- Lines
- Shapes
- Charts
- FormFields
- Styles
- Section
- Font
- Paragraph
- Table
- Templates processing
- Writers & readers
- OOXML
- OpenDocument
- RTF
- HTML
- Recipes
- Frequently asked questions
- References
Introduction
PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), and Rich Text Format (RTF).
PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading this Developers’ Documentation and the API Documentation.
Features
- Set document properties, e.g. title, subject, and creator.
- Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
- Create header and footer for each sections
- Set default font type, font size, and paragraph style
- Use UTF-8 and East Asia fonts/characters
- Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
- Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
- Insert titles (headers) and table of contents
- Insert text breaks and page breaks
- Insert and format images, either local, remote, or as page watermarks
- Insert binary OLE Objects such as Excel or Visio
- Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
- Insert list items as bulleted, numbered, or multilevel
- Insert hyperlinks
- Insert footnotes and endnotes
- Insert drawing shapes (arc, curve, line, polyline, rect, oval)
- Insert charts (pie, doughnut, bar, line, area, scatter, radar)
- Insert form fields (textinput, checkbox, and dropdown)
- Create document from templates
- Use XSL 1.0 style sheets to transform main document part of OOXML template
- … and many more features on progress
File formats
Below are the supported features for each file formats.
Writers
Features | DOCX | ODT | RTF | HTML | ||
---|---|---|---|---|---|---|
Document Properties | Standard | ✓ | ✓ | ✓ | ✓ | |
Custom | ✓ | ✓ | ||||
Element Type | Text | ✓ | ✓ | ✓ | ✓ | ✓ |
Text Run | ✓ | ✓ | ✓ | ✓ | ✓ | |
Title | ✓ | ✓ | ✓ | ✓ | ||
Link | ✓ | ✓ | ✓ | ✓ | ✓ | |
Preserve Text | ✓ | |||||
Text Break | ✓ | ✓ | ✓ | ✓ | ✓ | |
Page Break | ✓ | ✓ | ||||
List | ✓ | |||||
Table | ✓ | ✓ | ✓ | ✓ | ✓ | |
Image | ✓ | ✓ | ✓ | ✓ | ||
Object | ✓ | |||||
Watermark | ✓ | |||||
Table of Contents | ✓ | |||||
Header | ✓ | |||||
Footer | ✓ | |||||
Footnote | ✓ | ✓ | ||||
Endnote | ✓ | ✓ | ||||
Graphs | 2D basic graphs | ✓ | ||||
2D advanced graphs | ||||||
3D graphs | ✓ | |||||
Math | OMML support | |||||
MathML support | ||||||
Bonus | Encryption | |||||
Protection |
Readers
Features | DOCX | ODT | RTF | HTML | |
---|---|---|---|---|---|
Document Properties | Standard | ✓ | |||
Custom | ✓ | ||||
Element Type | Text | ✓ | ✓ | ✓ | ✓ |
Text Run | ✓ | ||||
Title | ✓ | ✓ | |||
Link | ✓ | ||||
Preserve Text | ✓ | ||||
Text Break | ✓ | ||||
Page Break | ✓ | ||||
List | ✓ | ✓ | ✓ | ||
Table | ✓ | ✓ | |||
Image | ✓ | ||||
Object | |||||
Watermark | |||||
Table of Contents | |||||
Header | ✓ | ||||
Footer | ✓ | ||||
Footnote | ✓ | ||||
Endnote | ✓ | ||||
Graphs | 2D basic graphs | ||||
2D advanced graphs | |||||
3D graphs | |||||
Math | OMML support | ||||
MathML support | |||||
Bonus | Encryption | ||||
Protection |
Contributing
We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute:
- Read our contributing guide
- Fork us and request a pull to the develop branch
- Submit bug reports or feature requests to GitHub
- Follow @PHPWord and @PHPOffice on Twitter
Installing/configuring
Requirements
Mandatory:
- PHP 5.3+
- PHP Zip extension
- PHP XML Parser extension
Optional PHP extensions:
- GD
- XMLWriter
- XSL
Installation
There are two ways to install PHPWord, i.e. via Composer or manually by downloading the library.
Using Composer
To install via Composer, add the following lines to your composer.json
:
{ "require": { "phpoffice/phpword": "dev-master" } }
Manual install
To install manually, download PHPWord package from github. Extract the package and put the contents to your machine. To use the library, include src/PhpWord/Autoloader.php
in your script and invoke Autoloader::register
.
require_once '/path/to/src/PhpWord/Autoloader.php'; PhpOfficePhpWordAutoloader::register();
Using samples
After installation, you can browse and use the samples that we’ve provided, either by command line or using browser. If you can access your PHPWord library folder using browser, point your browser to the samples
folder, e.g. http://localhost/PhpWord/samples/
.
General usage
Basic example
The following is a basic example of the PHPWord library. More examples are provided in the samples folder.
<?php require_once 'src/PhpWord/Autoloader.php'; PhpOfficePhpWordAutoloader::register(); // Creating the new document... $phpWord = new PhpOfficePhpWordPhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document... $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default... $section->addText( htmlspecialchars( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ) ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline... $section->addText( htmlspecialchars( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)' ), array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style... $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( htmlspecialchars( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)' ), $fontStyleName ); // Adding Text element with font customized using explicitly created font style object... $fontStyle = new PhpOfficePhpWordStyleFont(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText( htmlspecialchars('"Believe you can and you're halfway there." (Theodor Roosevelt)') ); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file... $objWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */
Settings
The PhpOfficePhpWordSettings
class provides some options that will affect the behavior of PHPWord. Below are the options.
XML Writer compatibility
This option sets XMLWriter::setIndent and XMLWriter::setIndentString. The default value of this option is true
(compatible), which is required for OpenOffice to render OOXML document correctly. You can set this option to false
during development to make the resulting XML file easier to read.
PhpOfficePhpWordSettings::setCompatibility(false);
Zip class
By default, PHPWord uses PHP ZipArchive to read or write ZIP compressed archive and the files inside them. If you can’t have ZipArchive installed on your server, you can use pure PHP library alternative, PCLZip, which included with PHPWord.
PhpOfficePhpWordSettings::setZipClass(PhpOfficePhpWordSettings::PCLZIP);
Default font
By default, every text appears in Arial 10 point. You can alter the default font by using the following two functions:
$phpWord->setDefaultFontName('Times New Roman'); $phpWord->setDefaultFontSize(12);
Document information
You can set the document information such as title, creator, and company name. Use the following functions:
$properties = $phpWord->getDocInfo(); $properties->setCreator('My name'); $properties->setCompany('My factory'); $properties->setTitle('My title'); $properties->setDescription('My description'); $properties->setCategory('My category'); $properties->setLastModifiedBy('My name'); $properties->setCreated(mktime(0, 0, 0, 3, 12, 2014)); $properties->setModified(mktime(0, 0, 0, 3, 14, 2014)); $properties->setSubject('My subject'); $properties->setKeywords('my, key, word');
Measurement units
The base length unit in Open Office XML is twip. Twip means «TWentieth of an Inch Point», i.e. 1 twip = 1/1440 inch.
You can use PHPWord helper functions to convert inches, centimeters, or points to twips.
// Paragraph with 6 points space after $phpWord->addParagraphStyle('My Style', array( 'spaceAfter' => PhpOfficePhpWordSharedConverter::pointToTwip(6)) ); $section = $phpWord->addSection(); $sectionStyle = $section->getStyle(); // half inch left margin $sectionStyle->setMarginLeft(PhpOfficePhpWordSharedConverter::inchToTwip(.5)); // 2 cm right margin $sectionStyle->setMarginRight(PhpOfficePhpWordSharedConverter::cmToTwip(2));
Containers
Containers are objects where you can put elements (texts, lists, tables, etc). There are 3 main containers, i.e. sections, headers, and footers. There are 3 elements that can also act as containers, i.e. textruns, table cells, and footnotes.
Sections
Every visible element in word is placed inside of a section. To create a section, use the following code:
$section = $phpWord->addSection($sectionStyle);
The $sectionStyle
is an optional associative array that sets the section. Example:
$sectionStyle = array( 'orientation' => 'landscape', 'marginTop' => 600, 'colsNum' => 2, );
Page number
You can change a section page number by using the pageNumberingStart
style of the section.
// Method 1 $section = $phpWord->addSection(array('pageNumberingStart' => 1)); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setPageNumberingStart(1);
Multicolumn
You can change a section layout to multicolumn (like in a newspaper) by using the breakType
and colsNum
style of the section.
// Method 1 $section = $phpWord->addSection(array('breakType' => 'continuous', 'colsNum' => 2)); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setBreakType('continuous'); $section->getStyle()->setColsNum(2);
Line numbering
You can apply line numbering to a section by using the lineNumbering
style of the section.
// Method 1 $section = $phpWord->addSection(array('lineNumbering' => array())); // Method 2 $section = $phpWord->addSection(); $section->getStyle()->setLineNumbering(array());
Below are the properties of the line numbering style.
start
Line numbering starting valueincrement
Line number incrementsdistance
Distance between text and line numbering in twiprestart
Line numbering restart setting continuous|newPage|newSection
Headers
Each section can have its own header reference. To create a header use the addHeader
method:
$header = $section->addHeader();
Be sure to save the result in a local object. You can use all elements that are available for the footer. See «Footer» section for detail. Additionally, only inside of the header reference you can add watermarks or background pictures. See «Watermarks» section.
Footers
Each section can have its own footer reference. To create a footer, use the addFooter
method:
$footer = $section->addFooter();
Be sure to save the result in a local object to add elements to a footer. You can add the following elements to footers:
- Texts
addText
andcreateTextrun
- Text breaks
- Images
- Tables
- Preserve text
See the «Elements» section for the detail of each elements.
Other containers
Textruns, table cells, and footnotes are elements that can also act as containers. See the corresponding «Elements» section for the detail of each elements.
Elements
Below are the matrix of element availability in each container. The column shows the containers while the rows lists the elements.
Num | Element | Section | Header | Footer | Cell | Text Run | Footnote |
---|---|---|---|---|---|---|---|
1 | Text | v | v | v | v | v | v |
2 | Text Run | v | v | v | v | — | — |
3 | Link | v | v | v | v | v | v |
4 | Title | v | ? | ? | ? | ? | ? |
5 | Preserve Text | ? | v | v | v* | — | — |
6 | Text Break | v | v | v | v | v | v |
7 | Page Break | v | — | — | — | — | — |
8 | List | v | v | v | v | — | — |
9 | Table | v | v | v | v | — | — |
10 | Image | v | v | v | v | v | v |
11 | Watermark | — | v | — | — | — | — |
12 | Object | v | v | v | v | v | v |
13 | TOC | v | — | — | — | — | — |
14 | Footnote | v | — | — | v** | v** | — |
15 | Endnote | v | — | — | v** | v** | — |
16 | CheckBox | v | v | v | v | — | — |
17 | TextBox | v | v | v | v | — | — |
18 | Field | v | v | v | v | v | v |
19 | Line | v | v | v | v | v | v |
20 | Shape | v | v | v | v | v | v |
21 | Chart | v | — | — | — | — | — |
22 | Form Fields | v | v | v | v | v | v |
Legend:
v
Availablev*
Available only when inside header/footerv**
Available only when inside section-
Not available?
Should be available
Texts
Text can be added by using addText
and addTextRun
method. addText
is used for creating simple paragraphs that only contain texts with the same style. addTextRun
is used for creating complex paragraphs that contain text with different style (some bold, other italics, etc) or other elements, e.g. images or links. The syntaxes are as follow:
$section->addText($text, [$fontStyle], [$paragraphStyle]); $textrun = $section->addTextRun([$paragraphStyle]);
You can use the $fontStyle
and $paragraphStyle
variable to define text formatting. There are 2 options to style the inserted text elements, i.e. inline style by using array or defined style by adding style definition.
Inline style examples:
$fontStyle = array('name' => 'Times New Roman', 'size' => 9); $paragraphStyle = array('align' => 'both'); $section->addText('I am simple paragraph', $fontStyle, $paragraphStyle); $textrun = $section->addTextRun(); $textrun->addText('I am bold', array('bold' => true)); $textrun->addText('I am italic', array('italic' => true)); $textrun->addText('I am colored', array('color' => 'AACC00'));
Defined style examples:
$fontStyle = array('color' => '006699', 'size' => 18, 'bold' => true); $phpWord->addFontStyle('fStyle', $fontStyle); $text = $section->addText('Hello world!', 'fStyle'); $paragraphStyle = array('align' => 'center'); $phpWord->addParagraphStyle('pStyle', $paragraphStyle); $text = $section->addText('Hello world!', 'pStyle');
Titles
If you want to structure your document or build table of contents, you need titles or headings. To add a title to the document, use the addTitleStyle
and addTitle
method.
$phpWord->addTitleStyle($depth, [$fontStyle], [$paragraphStyle]); $section->addTitle($text, [$depth]);
Its necessary to add a title style to your document because otherwise the title won’t be detected as a real title.
Links
You can add Hyperlinks to the document by using the function addLink:
$section->addLink($linkSrc, [$linkName], [$fontStyle], [$paragraphStyle]);
$linkSrc
The URL of the link.$linkName
Placeholder of the URL that appears in the document.$fontStyle
See «Font style» section.$paragraphStyle
See «Paragraph style» section.
Preserve texts
The addPreserveText
method is used to add a page number or page count to headers or footers.
$footer->addPreserveText('Page {PAGE} of {NUMPAGES}.');
Breaks
Text breaks
Text breaks are empty new lines. To add text breaks, use the following syntax. All paramaters are optional.
$section->addTextBreak([$breakCount], [$fontStyle], [$paragraphStyle]);
$breakCount
How many lines$fontStyle
See «Font style» section.$paragraphStyle
See «Paragraph style» section.
Page breaks
There are two ways to insert a page breaks, using the addPageBreak
method or using the pageBreakBefore
style of paragraph.
$section->addPageBreak();
Lists
To add a list item use the function addListItem
.
Basic usage:
$section->addListItem($text, [$depth], [$fontStyle], [$listStyle], [$paragraphStyle]);
Parameters:
$text
Text that appears in the document.$depth
Depth of list item.$fontStyle
See «Font style» section.$listStyle
List style of the current element TYPE_NUMBER, TYPE_ALPHANUM, TYPE_BULLET_FILLED, etc. See list of constants in PHPWord_Style_ListItem.$paragraphStyle
See «Paragraph style» section.
Advanced usage:
You can also create your own numbering style by changing the $listStyle
parameter with the name of your numbering style.
$phpWord->addNumberingStyle( 'multilevel', array('type' => 'multilevel', 'levels' => array( array('format' => 'decimal', 'text' => '%1.', 'left' => 360, 'hanging' => 360, 'tabPos' => 360), array('format' => 'upperLetter', 'text' => '%2.', 'left' => 720, 'hanging' => 360, 'tabPos' => 720), ) ) ); $section->addListItem('List Item I', 0, null, 'multilevel'); $section->addListItem('List Item I.a', 1, null, 'multilevel'); $section->addListItem('List Item I.b', 1, null, 'multilevel'); $section->addListItem('List Item II', 0, null, 'multilevel');
Tables
To add tables, rows, and cells, use the addTable
, addRow
, and addCell
methods:
$table = $section->addTable([$tableStyle]); $table->addRow([$height], [$rowStyle]); $cell = $table->addCell($width, [$cellStyle]);
Table style can be defined with addTableStyle
:
$tableStyle = array( 'borderColor' => '006699', 'borderSize' => 6, 'cellMargin' => 50 ); $firstRowStyle = array('bgColor' => '66BBFF'); $phpWord->addTableStyle('myTable', $tableStyle, $firstRowStyle); $table = $section->addTable('myTable');
Cell span
You can span a cell on multiple columns by using gridSpan
or multiple rows by using vMerge
.
$cell = $table->addCell(200); $cell->getStyle()->setGridSpan(5);
See Sample_09_Tables.php
for more code sample.
Images
To add an image, use the addImage
method to sections, headers, footers, textruns, or table cells.
$section->addImage($src, [$style]);
- source String path to a local image or URL of a remote image
- styles Array fo styles for the image. See below.
Examples:
$section = $phpWord->addSection(); $section->addImage( 'mars.jpg', array( 'width' => 100, 'height' => 100, 'marginTop' => -1, 'marginLeft' => -1, 'wrappingStyle' => 'behind' ) ); $footer = $section->addFooter(); $footer->addImage('http://example.com/image.php'); $textrun = $section->addTextRun(); $textrun->addImage('http://php.net/logo.jpg');
Watermarks
To add a watermark (or page background image), your section needs a header reference. After creating a header, you can use the addWatermark
method to add a watermark.
$section = $phpWord->addSection(); $header = $section->addHeader(); $header->addWatermark('resources/_earth.jpg', array('marginTop' => 200, 'marginLeft' => 55));
Objects
You can add OLE embeddings, such as Excel spreadsheets or PowerPoint presentations to the document by using addObject
method.
$section->addObject($src, [$style]);
Table of contents
To add a table of contents (TOC), you can use the addTOC
method. Your TOC can only be generated if you have add at least one title (See «Titles»).
$section->addTOC([$fontStyle], [$tocStyle], [$minDepth], [$maxDepth]);
$fontStyle
: See font style section$tocStyle
: See available options below$minDepth
: Minimum depth of header to be shown. Default 1$maxDepth
: Maximum depth of header to be shown. Default 9
Options for $tocStyle
:
tabLeader
Fill type between the title text and the page number. Use the defined constants in PHPWord_Style_TOC.tabPos
The position of the tab where the page number appears in twips.indent
The indent factor of the titles in twips.
Footnotes & endnotes
You can create footnotes with addFootnote
and endnotes with addEndnote
in texts or textruns, but it’s recommended to use textrun to have better layout. You can use addText
, addLink
, addTextBreak
, addImage
, addObject
on footnotes and endnotes.
On textrun:
$textrun = $section->addTextRun(); $textrun->addText('Lead text.'); $footnote = $textrun->addFootnote(); $footnote->addText('Footnote text can have '); $footnote->addLink('http://test.com', 'links'); $footnote->addText('.'); $footnote->addTextBreak(); $footnote->addText('And text break.'); $textrun->addText('Trailing text.'); $endnote = $textrun->addEndnote(); $endnote->addText('Endnote put at the end');
On text:
$section->addText('Lead text.'); $footnote = $section->addFootnote(); $footnote->addText('Footnote text.');
The footnote reference number will be displayed with decimal number starting from 1. This number use FooterReference
style which you can redefine by addFontStyle
method. Default value for this style is array('superScript' => true)
;
Checkboxes
Checkbox elements can be added to sections or table cells by using addCheckBox
.
$section->addCheckBox($name, $text, [$fontStyle], [$paragraphStyle])
$name
Name of the check box.$text
Text following the check box$fontStyle
See «Font style» section.$paragraphStyle
See «Paragraph style» section.
Textboxes
To be completed.
Fields
To be completed.
Lines
To be completed.
Shapes
To be completed.
Charts
To be completed.
Form fields
To be completed.
Styles
Section
Below are the available styles for section:
orientation
Page orientation, i.e. ‘portrait’ (default) or ‘landscape’marginTop
Page margin top in twipsmarginLeft
Page margin left in twipsmarginRight
Page margin right in twipsmarginBottom
Page margin bottom in twipsborderTopSize
Border top size in twipsborderTopColor
Border top colorborderLeftSize
Border left size in twipsborderLeftColor
Border left colorborderRightSize
Border right size in twipsborderRightColor
Border right colorborderBottomSize
Border bottom size in twipsborderBottomColor
Border bottom colorheaderHeight
Spacing to top of headerfooterHeight
Spacing to bottom of footergutter
Page gutter spacingcolsNum
Number of columnscolsSpace
Spacing between columnsbreakType
Section break type (nextPage, nextColumn, continuous, evenPage, oddPage)
The following two styles are automatically set by the use of the orientation
style. You can alter them but that’s not recommended.
pageSizeW
Page width in twipspageSizeH
Page height in twips
Font
Available font styles:
name
Font name, e.g. Arialsize
Font size, e.g. 20, 22,hint
Font content type, default, eastAsia, or csbold
Bold, true or falseitalic
Italic, true or falsesuperScript
Superscript, true or falsesubScript
Subscript, true or falseunderline
Underline, dash, dotted, etc.strikethrough
Strikethrough, true or falsedoubleStrikethrough
Double strikethrough, true or falsecolor
Font color, e.g. FF0000fgColor
Font highlight color, e.g. yellow, green, bluebgColor
Font background color, e.g. FF0000smallCaps
Small caps, true or falseallCaps
All caps, true or false
Paragraph
Available paragraph styles:
align
Paragraph alignment, left, right or centerspaceBefore
Space before paragraphspaceAfter
Space after paragraphindent
Indent by how muchhanging
Hanging by how muchbasedOn
Parent stylenext
Style for next paragraphwidowControl
Allow first/last line to display on a separate page, true or falsekeepNext
Keep paragraph with next paragraph, true or falsekeepLines
Keep all lines on one page, true or falsepageBreakBefore
Start paragraph on next page, true or falselineHeight
text line height, e.g. 1.0, 1.5, ect…tabs
Set of custom tab stops
Table
Table styles:
width
Table width in percentbgColor
Background color, e.g. ‘9966CC’border(Top|Right|Bottom|Left)Size
Border size in twipsborder(Top|Right|Bottom|Left)Color
Border color, e.g. ‘9966CC’cellMargin(Top|Right|Bottom|Left)
Cell margin in twips
Row styles:
tblHeader
Repeat table row on every new page, true or falsecantSplit
Table row cannot break across pages, true or falseexactHeight
Row height is exact or at least
Cell styles:
width
Cell width in twipsvalign
Vertical alignment, top, center, both, bottomtextDirection
Direction of textbgColor
Background color, e.g. ‘9966CC’border(Top|Right|Bottom|Left)Size
Border size in twipsborder(Top|Right|Bottom|Left)Color
Border color, e.g. ‘9966CC’gridSpan
Number of columns spannedvMerge
restart or continue
Image
Available image styles:
width
Width in pixelsheight
Height in pixelsalign
Image alignment, left, right, or centermarginTop
Top margin in inches, can be negativemarginLeft
Left margin in inches, can be negativewrappingStyle
Wrapping style, inline, square, tight, behind, or infront
Numbering level
start
Starting valueformat
Numbering format bullet|decimal|upperRoman|lowerRoman|upperLetter|lowerLetterrestart
Restart numbering level symbolsuffix
Content between numbering symbol and paragraph text tab|space|nothingtext
Numbering level text e.g. %1 for nonbullet or bullet characteralign
Numbering symbol align left|center|right|bothleft
See paragraph stylehanging
See paragraph styletabPos
See paragraph stylefont
Font namehint
See font style
Templates processing
You can create a .docx document template with included search-patterns which can be replaced by any value you wish. Only single-line values can be replaced.
To deal with a template file, use new TemplateProcessor
statement. After TemplateProcessor instance creation the document template is copied into the temporary directory. Then you can use TemplateProcessor::setValue
method to change the value of a search pattern. The search-pattern model is: ${search-pattern}
.
Example:
$templateProcessor = new TemplateProcessor('Template.docx'); $templateProcessor->setValue('Name', 'Somebody someone'); $templateProcessor->setValue('Street', 'Coming-Undone-Street 32');
It is not possible to directly add new OOXML elements to the template file being processed, but it is possible to transform main document part of the template using XSLT (see TemplateProcessor::applyXslStyleSheet
).
See Sample_07_TemplateCloneRow.php
for example on how to create multirow from a single row in a template by using TemplateProcessor::cloneRow
.
See Sample_23_TemplateBlock.php
for example on how to clone a block of text using TemplateProcessor::cloneBlock
and delete a block of text using TemplateProcessor::deleteBlock
.
Writers & readers
OOXML
The package of OOXML document consists of the following files.
- _rels/
- .rels
- docProps/
- app.xml
- core.xml
- custom.xml
- word/
- rels/
- document.rels.xml
- media/
- theme/
- theme1.xml
- document.xml
- fontTable.xml
- numbering.xml
- settings.xml
- styles.xml
- webSettings.xml
- rels/
- [Content_Types].xml
OpenDocument
Package
The package of OpenDocument document consists of the following files.
- META-INF/
- manifest.xml
- Pictures/
- content.xml
- meta.xml
- styles.xml
content.xml
The structure of content.xml
is described below.
- office:document-content
- office:font-facedecls
- office:automatic-styles
- office:body
- office:text
- draw:*
- office:forms
- table:table
- text:list
- text:numbered-paragraph
- text:p
- text:table-of-contents
- text:section
- office:chart
- office:image
- office:drawing
- office:text
styles.xml
The structure of styles.xml
is described below.
- office:document-styles
- office:styles
- office:automatic-styles
- office:master-styles
- office:master-page
RTF
To be completed.
HTML
To be completed.
To be completed.
Recipes
Create float left image
Use absolute positioning relative to margin horizontally and to line vertically.
$imageStyle = array( 'width' => 40, 'height' => 40 'wrappingStyle' => 'square', 'positioning' => 'absolute', 'posHorizontalRel' => 'margin', 'posVerticalRel' => 'line', ); $textrun->addImage('resources/_earth.jpg', $imageStyle); $textrun->addText($lipsumText);
Download the produced file automatically
Use php://output
as the filename.
$phpWord = new PhpOfficePhpWordPhpWord(); $section = $phpWord->createSection(); $section->addText('Hello World!'); $file = 'HelloWorld.docx'; header("Content-Description: File Transfer"); header('Content-Disposition: attachment; filename="' . $file . '"'); header('Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document'); header('Content-Transfer-Encoding: binary'); header('Cache-Control: must-revalidate, post-check=0, pre-check=0'); header('Expires: 0'); $xmlWriter = PhpOfficePhpWordIOFactory::createWriter($phpWord, 'Word2007'); $xmlWriter->save("php://output");
Create numbered headings
Define a numbering style and title styles, and match the two styles (with pStyle
and numStyle
) like below.
$phpWord->addNumberingStyle( 'hNum', array('type' => 'multilevel', 'levels' => array( array('pStyle' => 'Heading1', 'format' => 'decimal', 'text' => '%1'), array('pStyle' => 'Heading2', 'format' => 'decimal', 'text' => '%1.%2'), array('pStyle' => 'Heading3', 'format' => 'decimal', 'text' => '%1.%2.%3'), ) ) ); $phpWord->addTitleStyle(1, array('size' => 16), array('numStyle' => 'hNum', 'numLevel' => 0)); $phpWord->addTitleStyle(2, array('size' => 14), array('numStyle' => 'hNum', 'numLevel' => 1)); $phpWord->addTitleStyle(3, array('size' => 12), array('numStyle' => 'hNum', 'numLevel' => 2)); $section->addTitle('Heading 1', 1); $section->addTitle('Heading 2', 2); $section->addTitle('Heading 3', 3);
Add a link within a title
Apply ‘HeadingN’ paragraph style to TextRun or Link. Sample code:
$phpWord = new PhpOfficePhpWordPhpWord(); $phpWord->addTitleStyle(1, array('size' => 16, 'bold' => true)); $phpWord->addTitleStyle(2, array('size' => 14, 'bold' => true)); $phpWord->addFontStyle('Link', array('color' => '0000FF', 'underline' => 'single')); $section = $phpWord->addSection(); // Textrun $textrun = $section->addTextRun('Heading1'); $textrun->addText('The '); $textrun->addLink('https://github.com/PHPOffice/PHPWord', 'PHPWord', 'Link'); // Link $section->addLink('https://github.com/', 'GitHub', 'Link', 'Heading2');
Remove [Compatibility Mode] text in the MS Word title bar
Use the MetadataCompatibilitysetOoxmlVersion(n)
method with n
is the version of Office (14 = Office 2010, 15 = Office 2013).
$phpWord->getCompatibility()->setOoxmlVersion(15);
Frequently asked questions
Is this the same with PHPWord that I found in CodePlex?
No. This one is much better with tons of new features that you can’t find in PHPWord 0.6.3. The development in CodePlex is halted and switched to GitHub to allow more participation from the crowd. The more the merrier, right?
I’ve been running PHPWord from CodePlex flawlessly, but I can’t use the latest PHPWord from GitHub. Why?
PHPWord requires PHP 5.3+ since 0.8, while PHPWord 0.6.3 from CodePlex can run with PHP 5.2. There’s a lot of new features that we can get from PHP 5.3 and it’s been around since 2009! You should upgrade your PHP version to use PHPWord 0.8+.
References
ISO/IEC 29500, Third edition, 2012-09-01
- Part 1: Fundamentals and Markup Language Reference
- Part 2: Open Packaging Conventions
- Part 3: Markup Compatibility and Extensibility
- Part 4: Transitional Migration Features
Formal specifications
- Oasis OpenDocument Standard Version 1.2
- Rich Text Format (RTF) Specification, version 1.9.1
Other resources
- DocumentFormat.OpenXml.Wordprocessing Namespace on MSDN
Можно ли читать и записывать файлы Word (2003 и 2007) на PHP без использования COM-объекта? Я знаю, что можно сделать так:
$file = fopen(‘c:file.doc’, ‘w+’);
fwrite($file, $text);
fclose();
но Word будет читать его как HTML-файл, а не как собственный файл .doc.
Ответ 1
Чтение двоичных документов Word потребовало бы создания анализатора в соответствии с опубликованными спецификациями формата файлов DOC. Я думаю, что это не является реально выполнимым решением. Вы можете использовать форматы Microsoft Office XML для чтения и записи файлов Word — они совместимы с версиями Word 2003 и 2007. Для чтения необходимо убедиться, что документы Word сохранены в правильном формате (он называется Word 2003 XML-Document в Word 2007). Для записи достаточно следовать общедоступной XML-схеме. Я никогда не использовал этот формат для записи документов Office из PHP, но я использую его для чтения рабочего листа Excel (естественно, сохраненного как XML-Spreadsheet 2003) и отображения его данных на веб-странице. Поскольку файлы представляют собой обычные XML-данные, не составляет труда сориентироваться в них и понять, как извлечь нужные данные. Другой вариант — вариант только для Word 2007 (если форматы файлов OpenXML не установлены в вашем Word 2003) — это пересортировка в OpenXML. Формат файла DOCX — это просто ZIP-архив с включенными XML-файлами. На MSDN есть много ресурсов по формату файлов OpenXML, так что вы должны быть в состоянии понять, как читать нужные вам данные. Запись будет намного сложнее, я думаю, все зависит от того, сколько времени вы потратите на это. Возможно, вы можете взглянуть на PHPExcel — библиотеку, способную писать в файлы Excel 2007 и читать из файлов Excel 2007, используя стандарт OpenXML. Вы можете получить представление о работе, связанной с чтением и записью документов OpenXML Word.
Ответ 2
Данное решение работает с vs < office 2007, и это чистый PHP без всякого COM:
<?php
/*****************************************************************
Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))
чтобы определить, где находится текст:
— разделяем содержимое файла на фрагменты по chr(13)
— отбрасываем все фрагменты, содержащие NUL
— сшиваем оставшиеся вместе
— очищаем с помощью регулярного выражения
*****************************************************************/
function parseWord($userDoc) {
$fileHandle = fopen($userDoc, «r»);
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = «»;
foreach($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0)) {
} else {
$outtext .= $thisline.» «;
}
}
$outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);
return $outtext;
}
$userDoc = «cv.doc»;
$text = parseWord($userDoc);
echo $text;
?>
Ответ 3
Просто обновляем код из предыдущего ответа:
<?php
/*****************************************************************
Этот подход использует обнаружение NUL (chr(00)) и конца строки (chr(13))
чтобы определить, где находится текст:
— разделяем содержимое файла на фрагменты по chr(13)
— отбрасываем все фрагменты, содержащие NUL
— сшиваем оставшиеся вместе
— очищаем с помощью регулярного выражения
*****************************************************************/
function parseWord($userDoc) {
$fileHandle = fopen($userDoc, «r»);
$word_text = @fread($fileHandle, filesize($userDoc));
$line = «»;
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++) {
$line .= $word_text[$i];
if( $word_text[$i] == 0) {
$nulos++;
} else {
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = «<pre>»;
$outtext = «»;
foreach($lines as $thisline) {
$tam = strlen($thisline);
if( !$tam ) {
continue;
}
$new_line = «»;
for($i=0; $i<$tam; $i++) {
$onechar = $thisline[$i];
if( $onechar > chr(240) ) {
continue;
}
if( $onechar >= chr(0x20) ) {
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) ) {
$new_line .= «</a>»;
}
if( $onechar == chr(0x07) ) {
$new_line .= «t»;
if( isset($thisline[$i+1]) ) {
if( $thisline[$i+1] == chr(0x07) ) {
$new_line .= «n»;
}
}
}
}
//troca por hiperlink
$new_line = str_replace(«HYPERLINK» ,»<a href=»,$new_line);
$new_line = str_replace(«o» ,»>»,$new_line);
$new_line .= «n»;
//link de imagens
$new_line = str_replace(«INCLUDEPICTURE» ,»<br><img src=»,$new_line);
$new_line = str_replace(«*» ,»><br>»,$new_line);
$new_line = str_replace(«MERGEFORMATINET» ,»»,$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = «custo.doc»;
$userDoc = «Cultura.doc»;
$text = parseWord($userDoc);
echo $text;
?>
Ответ 4
www.phplivedocx.org — это сервис на основе SOAP, который выполняет онлайн—тестирование файлов. Файлы также имеют достаточно примеров для его использования. Я думаю, что без COM это просто невозможно на Linux—сервере, и единственная идея — изменить doc файл в другой файл, который PHP может разобрать…
Ответ 5
Используя Open XML SDK и VSTO [Visual Studio Tools For Office], мы можем легко работать с файлами Word, манипулировать ими и даже конвертировать внутри в различные форматы, такие как .odt,.pdf,.docx и т. д. Итак, зайдите на сайт msdn.microsoft.com и внимательно изучите вкладку office development. Это самый простой способ сделать это, так как все функции, которые нам нужно реализовать, уже доступны в .net!!! Но так как вы хотите сделать свой проект на PHP, вы можете сделать это в Visual Studio и .net, потому как PHP также является одним из .net Compliant Language!!!
Ответ 6
Используйте следующий класс непосредственно для чтения документа Word:
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, «r»);
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = «»;
foreach($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0)) {
} else {
$outtext .= $thisline.» «;
}
}
$outtext = preg_replace(«/[^a-zA-Z0-9s,.-nrt@/_()]/»,»»,$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = »;
$content = »;
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != «word/document.xml») continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace(‘</w:r></w:p></w:tc><w:tc>’, » «, $content);
$content = str_replace(‘</w:r></w:p>’, «rn», $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = «xl/sharedStrings.xml»; //content file name
$zip_handle = new ZipArchive;
$output_text = «»;
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .=»»;
}
$zip_handle->close();
}else{
$output_text .=»»;
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = «»;
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName(«ppt/slides/slide».$slide_number.».xml»)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .=»»;
}
$zip_handle->close();
}else{
$output_text .=»»;
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return «File Not exists»;
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray[‘extension’];
if($file_ext == «doc» || $file_ext == «docx» || $file_ext == «xlsx» || $file_ext == «pptx») {
if($file_ext == «doc») {
return $this->read_doc();
} elseif($file_ext == «docx») {
return $this->read_docx();
} elseif($file_ext == «xlsx») {
return $this->xlsx_to_text();
}elseif($file_ext == «pptx») {
return $this->pptx_to_text();
}
} else {
return «Invalid File Type»;
}
}
}
$docObj = new DocxConversion(«test.docx»); //замените имя документа правильным расширением doc или docx
echo $docText= $docObj->convertToText();
Время на прочтение
4 мин
Количество просмотров 61K
Недавно возникла задача получения чистого текста из различных форматов документооборота — будь-то документы Microsoft Word или PDF. Задача была выполнена даже с чуть более широким списком возможных входных данных. Итак, этой статьёй я открываю список публикаций о чтении текста из следующих типов файлов: DOC, DOCX, RTF, ODT и PDF — с помощью PHP без использования сторонних утилит.
Для начала отвечу на вполне разумный вопрос: «Зачем это, собственно, надо?» Правильно, чистый текст, полученный из, к примеру, документа Word представляет собой достаточно перемешанную кашу. Но этого «бардака» вполне достаточно для построения, например, индекса для поиска по обширному хранилищу офисных документов.
Другой вполне разумный вопрос: «Почему не использовать сторонние утилиты, например, antiword или xpdf, ну или в крайнем случае OLE под Windows?» Таковы уж были поставленные условия, да и OLE работает люто-бешено медленно, даже если задачу можно решить с помощью этой технологии.
Сегодня, в качестве «затравки», я расскажу о достаточно простых для поставленной задачи форматах — это Office Open XML, больше известный как DOCX от Microsoft и OpenDocument Format, он же ODT от ODF Aliance.
Для начала заглянем вовнутрь парочки файлов и увидим буквально следующее (сзади docx, спереди odt):
Самое важное, что мы здесь видим, это первые два символа PK
в начале данных. Это значит, что оба файла представляют собой переименованный в .docx/.odt zip-архив. Открываем, например, по Ctrl+PageDown
в Total Commander и лицезреем вполне приемлемую структуру (слева odt, справа docx):
Итак, нужные нам файлы с данными — это content.xml в ODT и word/document.xml в DOCX. Чтобы прочитать текстовые данные из них напишем несложный код:
- function odt2text($filename) {
- return getTextFromZippedXML($filename, «content.xml»);
- }
- function docx2text($filename) {
- return getTextFromZippedXML($filename, «word/document.xml»);
- }
- function getTextFromZippedXML($archiveFile, $contentFile) {
- // Создаёт «реинкарнацию» zip-архива…
- $zip = new ZipArchive;
- // И пытаемся открыть переданный zip-файл
- if ($zip->open($archiveFile)) {
- // В случае успеха ищем в архиве файл с данными
- if (($index = $zip->locateName($contentFile)) !== false) {
- // Если находим, то читаем его в строку
- $content = $zip->getFromIndex($index);
- // Закрываем zip-архив, он нам больше не нужен
- $zip->close();
- // После этого подгружаем все entity и по возможности include’ы других файлов
- // Проглатываем ошибки и предупреждения
- $xml = DOMDocument::loadXML($content, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
- // После чего возвращаем данные без XML-тегов форматирования
- return strip_tags($xml->saveXML());
- }
- $zip->close();
- }
- // Если что-то пошло не так, возвращаем пустую строку
- return «»;
- }
Всего каких-то 30 строк, и мы получаем текстовые данные из файла. Код работает под PHP 5.2+ и требует php_zip.dll
под Windows или ключика --enable-zip
под Linux. При отсутствии возможности использования ZipArchive
(старая версия PHP или отсутствие библиотек) вполне может сгодиться библиотека PclZip, реализующая чтение zip-файлов без соответствующих средств в системе.
Отмечу, что данный код является лишь заготовкой для решения задач чтения текста. После череды статей под лозунгом «Текст любой ценой», я постараюсь описать принципы и реализацию чтения форматированного текста.
По теме:
- msdn.microsoft.com/en-us/library/aa338205.aspx
- www.i-rs.ru/Produkty/ODF-ISO-IEC-26300-2006/Dokumentaciya/Format-Open-Document-dlya-ofisnyh-prilozhenij-OpenDocument-v1.0.odt
- Текст любой ценой: PDF
- Текст любой ценой: RTF
- Текст любой ценой: WCBFF и DOC
В следующий раз я расскажу о чтении текста из PDF без помощи xpdf. Более сложной, но вполне посильной для PHP задачи.