I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».
edi9999
19.3k13 gold badges88 silver badges127 bronze badges
asked Sep 22, 2008 at 17:08
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import document, opendocx
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
answered Dec 30, 2009 at 12:08
mikemaccanamikemaccana
106k96 gold badges376 silver badges477 bronze badges
13
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
answered Sep 22, 2008 at 17:22
PhiLhoPhiLho
40.3k6 gold badges96 silver badges132 bronze badges
1
In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.
Efren
3,7614 gold badges32 silver badges73 bronze badges
answered Sep 22, 2008 at 23:12
Tony MeyerTony Meyer
9,9596 gold badges44 silver badges47 bronze badges
1
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).
answered Nov 15, 2009 at 11:01
romeokromeok
6611 gold badge9 silver badges13 bronze badges
1
You can use docx2txt
to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
answered Dec 9, 2014 at 10:51
edi9999edi9999
19.3k13 gold badges88 silver badges127 bronze badges
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.
A second choice would be to interop with word and do the search through it.
answered Sep 22, 2008 at 17:16
kokoskokos
42.9k5 gold badges35 silver badges32 bronze badges
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
answered Sep 22, 2008 at 17:17
shooshshoosh
76.1k54 gold badges208 silver badges323 bronze badges
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There’s no easy way to find that using a simple text scan.
answered Sep 22, 2008 at 23:03
ilitiritilitirit
15.9k18 gold badges72 silver badges109 bronze badges
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
answered Sep 22, 2008 at 17:17
Andy BriceAndy Brice
2,3071 gold badge21 silver badges27 bronze badges
You may also consider using the library from OpenXMLDeveloper.org
answered Oct 18, 2008 at 19:34
billbbillb
3,5981 gold badge31 silver badges36 bronze badges
I’ve been doing a lot of searching for a method to find and replace text in a docx file with little luck. I’ve tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. «The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.»).
import zipfile
replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")
with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
tempXmlStr = tempXmlFile.read()
for key in replaceText.keys():
tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))
with open("C:/temp.xml", "w+") as tempXmlFile:
tempXmlFile.write(tempXmlStr)
for file in templateDocx.filelist:
if not file.filename == "word/document.xml":
newDocx.writestr(file.filename, templateDocx.read(file))
newDocx.write("C:/temp.xml", "word/document.xml")
templateDocx.close()
newDocx.close()
My question is what’s wrong with this method? I’m pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?
.
Here is a walkthrough of my thought process for everyone else trying to learn this stuff:
Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {«XXXCLIENTNAMEXXX» : «Joe Bob», «XXXMEETDATEXXX» : «May 31, 2013»}).
Step 2) Open the template docx file using the zipfile module.
Step 3) Open a new new docx file with the append access mode.
Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.
Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.
Step 6) Write the xml text string to a new temporary xml file.
Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.
Step Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.
Step 9) Close your template and new docx archives.
Step 10) Open your new docx document and enjoy your replaced text!
—Edit— Missing closing parentheses ‘)’ on lines 7 and 11
I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».
edi9999
19.3k13 gold badges88 silver badges127 bronze badges
asked Sep 22, 2008 at 17:08
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import document, opendocx
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
answered Dec 30, 2009 at 12:08
mikemaccanamikemaccana
106k96 gold badges376 silver badges477 bronze badges
13
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
answered Sep 22, 2008 at 17:22
PhiLhoPhiLho
40.3k6 gold badges96 silver badges132 bronze badges
1
In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.
Efren
3,7614 gold badges32 silver badges73 bronze badges
answered Sep 22, 2008 at 23:12
Tony MeyerTony Meyer
9,9596 gold badges44 silver badges47 bronze badges
1
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).
answered Nov 15, 2009 at 11:01
romeokromeok
6611 gold badge9 silver badges13 bronze badges
1
You can use docx2txt
to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
answered Dec 9, 2014 at 10:51
edi9999edi9999
19.3k13 gold badges88 silver badges127 bronze badges
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.
A second choice would be to interop with word and do the search through it.
answered Sep 22, 2008 at 17:16
kokoskokos
42.9k5 gold badges35 silver badges32 bronze badges
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
answered Sep 22, 2008 at 17:17
shooshshoosh
76.1k54 gold badges208 silver badges323 bronze badges
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There’s no easy way to find that using a simple text scan.
answered Sep 22, 2008 at 23:03
ilitiritilitirit
15.9k18 gold badges72 silver badges109 bronze badges
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
answered Sep 22, 2008 at 17:17
Andy BriceAndy Brice
2,3071 gold badge21 silver badges27 bronze badges
You may also consider using the library from OpenXMLDeveloper.org
answered Oct 18, 2008 at 19:34
billbbillb
3,5981 gold badge31 silver badges36 bronze badges
Finding and Replacing Text in a Word Document Using python-docx
Recently I wrote a program using Python to automate the generation of dividend slips for my company.
My requirements were simple:
- I needed to read from a CSV file the amounts, dates and addresses of shareholders
- Check that a dividend slip file didn’t already exist
- Open the word template which contained the dividend slip
- Replace certain ‘trigger words’ with the amount, date and addresses
- Save the file, not overwriting the original template
In order to accomplish this, I turned to my favorite search-engine and looked for Python packages which would allow me to interface with Word documents.
Straight away, I found python-docx.
python-docx doesn’t have any find/replace features as standard, but I figured out I could just iterate through the paragraphs and then replace text using python’s str() functions.
Here’s my script:
import docx
import csv
import pathlib
if __name__ == "__main__":
with open("Dividends.csv") as f:
reader = csv.DictReader(f)
for line in reader:
print("Checking: " + line["Date"] + ".docx")
if pathlib.Path(line["Date"] + ".docx").exists() == False:
doc = docx.Document("dividend-template.docx")
if line["Amount"] != "":
print("Generating new Dividend slip for: " + line["Date"])
for paragraph in doc.paragraphs:
if "DTE" in paragraph.text:
orig_text = paragraph.text
new_text = str.replace(orig_text, "DTE", line["Date"])
paragraph.text = new_text
if "AMNT" in paragraph.text:
orig_text = paragraph.text
new_text = str.replace(orig_text, "AMNT", line["Amount"])
paragraph.text = new_text
if "ADDR" in paragraph.text:
orig_text = paragraph.text
new_text = str.replace(orig_text, "ADDR", line["Address"])
paragraph.text = new_text
print("Saving: " + line["Date"] + ".docx")
doc.save(line["Date"] + ".docx")
Построение отчётов на основании большого количества файлов – рядовая задача. Но всё становится сложнее, если вместо одного формата исходных файлов, мы получаем кучу файлов разного расширения.
Ранее я говорил о том, как скачивали файлы из базы и распознавать их, здесь я расскажу о том, как вытаскивать информацию (для анализа данных для спринта), по ключевым словам, или фразам из документов разных форматов (.rtf,. doc,. docx,. xls,. xlsx,. pdf). Вообще эту тему можно отнести к text mining, data mining. Text Mining — это если простыми словами, то добыча информации из текстов. Data Mining – это примерно то же самое что и ™ только не в тексте, а большом наборе данных для последующего анализа.
В этой задаче столкнулся с такой проблемой как отсутствие нормальных библиотек для Python для парсинга информации с файлов!
Так как файлы были разных форматов (.rtf,. doc,. docx,. xls,. xlsx,. pdf) и что бы открыть и прочитать информацию из них помощью Python нужно было найти подходящие библиотеки. В. pdf были сканы, и мы вопрос с ними уже решили (об этом я рассказывал в предыдущей статье). Для работы с форматами. xls,. xlsx есть отличная библиотека pandas, которая на ура справляется с поставленной целью и не только. Pandas это высокоуровневая Python библиотека для анализа данных. В экосистеме Python, pandas является лучшей и быстроразвивающейся библиотекой для обработки и анализа данных. Мне приходится пользоваться ею практически каждый день!
Осталось решить вопрос с. rtf,. doc,. docx.
.rtf
Rich Text Format, RTF — это формат текста придуманный группой программистов из Microsoft и Adobe в 82 году.
Для работы с этим форматом использовал разные библиотеки в том числе pyth.
#RTF
import pandas as pd
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
import re
import glob
#from pyth.plugins.xhtml.writer import XHTMLWriter
.doc
.doc — то же является текстовым форматом, но более лучшим чем предыдущий.
Для работы с ним использовал разные библиотеки, но также, как и с форматом. rtf проблема заключалась в том, что в документах были таблицы и нормально прочитать их не получалось никакой библиотекой (ни. rtf, ни. doc), просто текст читает без проблем если там нет таблиц (.rtf и. doc)!
В связи с этим. rtf и. doc просто приходилось конвертировать в формат. docx (про него ниже) и также сделал просто. exe-шник с помощью Python который конвертирует эти форматы в. docx.
import os
import win32com.client
for all_doc in files_doc:
print(dir_name + ‘/’ + all_doc + ‘x’)
word = win32com.client.Dispatch(‘Word.Application’)
wb = word.Documents.Open(dir_name + ‘/’ + all_doc)
wb.SaveAs2(dir_name + ‘/’ + all_doc + ‘x’, FileFormat=16)
wb.Close()
.docx
Начиная с 2007 появился новый формат на основе XML — docx.
И так, с форматами. xls,. docx, проблем никаких не возникло. С помощью необходимых библиотек (docx, pandas, tkinter) работа с файлами, вытаскивание информации, по ключевым словам, или фразам была реализована! Сделан графический интерфейс (графический пользовательский интерфейс (ГПИ) (англ. graphical user interface, GUI)) также скомпилирована в. exe и добавлена инструкция.
import tkinter as tk
from docx import Document
from os import listdir
from pandas import ExcelFile
from pandas import DataFrame
my_columns = df1.columns
for col in my_columns:
my_val_ind = df1[df1[col].astype(str).apply(lambda x: any(s in x for s in spis_to_find))].index
# ищем вхождение ключевой фразы из списка spis_to_find (либо функц any) и берем индекс
# print(my_val_ind)
if len(my_val_ind ) > 0: # если индекс какой то был найден то тянем значения
for ival in df1[df1.index == my_val_ind[0]].values:
Data_xlsx.append(xlsx_name)
key_conteins.append(ival)
Что получилось: