I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».
edi9999
19.3k13 gold badges88 silver badges127 bronze badges
asked Sep 22, 2008 at 17:08
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import document, opendocx
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
answered Dec 30, 2009 at 12:08
mikemaccanamikemaccana
106k96 gold badges376 silver badges477 bronze badges
13
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
answered Sep 22, 2008 at 17:22
PhiLhoPhiLho
40.3k6 gold badges96 silver badges132 bronze badges
1
In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.
Efren
3,7614 gold badges32 silver badges73 bronze badges
answered Sep 22, 2008 at 23:12
Tony MeyerTony Meyer
9,9596 gold badges44 silver badges47 bronze badges
1
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).
answered Nov 15, 2009 at 11:01
romeokromeok
6611 gold badge9 silver badges13 bronze badges
1
You can use docx2txt
to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
answered Dec 9, 2014 at 10:51
edi9999edi9999
19.3k13 gold badges88 silver badges127 bronze badges
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.
A second choice would be to interop with word and do the search through it.
answered Sep 22, 2008 at 17:16
kokoskokos
42.9k5 gold badges35 silver badges32 bronze badges
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
answered Sep 22, 2008 at 17:17
shooshshoosh
76.1k54 gold badges208 silver badges323 bronze badges
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There’s no easy way to find that using a simple text scan.
answered Sep 22, 2008 at 23:03
ilitiritilitirit
15.9k18 gold badges72 silver badges109 bronze badges
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
answered Sep 22, 2008 at 17:17
Andy BriceAndy Brice
2,3071 gold badge21 silver badges27 bronze badges
You may also consider using the library from OpenXMLDeveloper.org
answered Oct 18, 2008 at 19:34
billbbillb
3,5981 gold badge31 silver badges36 bronze badges
I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».
edi9999
19.3k13 gold badges88 silver badges127 bronze badges
asked Sep 22, 2008 at 17:08
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import document, opendocx
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
answered Dec 30, 2009 at 12:08
mikemaccanamikemaccana
106k96 gold badges376 silver badges477 bronze badges
13
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
answered Sep 22, 2008 at 17:22
PhiLhoPhiLho
40.3k6 gold badges96 silver badges132 bronze badges
1
In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.
Efren
3,7614 gold badges32 silver badges73 bronze badges
answered Sep 22, 2008 at 23:12
Tony MeyerTony Meyer
9,9596 gold badges44 silver badges47 bronze badges
1
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).
answered Nov 15, 2009 at 11:01
romeokromeok
6611 gold badge9 silver badges13 bronze badges
1
You can use docx2txt
to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
answered Dec 9, 2014 at 10:51
edi9999edi9999
19.3k13 gold badges88 silver badges127 bronze badges
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.
A second choice would be to interop with word and do the search through it.
answered Sep 22, 2008 at 17:16
kokoskokos
42.9k5 gold badges35 silver badges32 bronze badges
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
answered Sep 22, 2008 at 17:17
shooshshoosh
76.1k54 gold badges208 silver badges323 bronze badges
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There’s no easy way to find that using a simple text scan.
answered Sep 22, 2008 at 23:03
ilitiritilitirit
15.9k18 gold badges72 silver badges109 bronze badges
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
answered Sep 22, 2008 at 17:17
Andy BriceAndy Brice
2,3071 gold badge21 silver badges27 bronze badges
You may also consider using the library from OpenXMLDeveloper.org
answered Oct 18, 2008 at 19:34
billbbillb
3,5981 gold badge31 silver badges36 bronze badges
I’ve been doing a lot of searching for a method to find and replace text in a docx file with little luck. I’ve tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. «The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.»).
import zipfile
replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")
with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
tempXmlStr = tempXmlFile.read()
for key in replaceText.keys():
tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))
with open("C:/temp.xml", "w+") as tempXmlFile:
tempXmlFile.write(tempXmlStr)
for file in templateDocx.filelist:
if not file.filename == "word/document.xml":
newDocx.writestr(file.filename, templateDocx.read(file))
newDocx.write("C:/temp.xml", "word/document.xml")
templateDocx.close()
newDocx.close()
My question is what’s wrong with this method? I’m pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?
.
Here is a walkthrough of my thought process for everyone else trying to learn this stuff:
Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {«XXXCLIENTNAMEXXX» : «Joe Bob», «XXXMEETDATEXXX» : «May 31, 2013»}).
Step 2) Open the template docx file using the zipfile module.
Step 3) Open a new new docx file with the append access mode.
Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.
Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.
Step 6) Write the xml text string to a new temporary xml file.
Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.
Step Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.
Step 9) Close your template and new docx archives.
Step 10) Open your new docx document and enjoy your replaced text!
—Edit— Missing closing parentheses ‘)’ on lines 7 and 11
2 / 2 / 0 Регистрация: 05.05.2020 Сообщений: 15 |
|
1 |
|
17.07.2020, 13:47. Показов 24030. Ответов 2
Привет. Как с помощью python-docx выполнить поиск определенного слова с последующей заменой? Документацию прочел, но так и не нашел соответствующей функции. Может есть возможность сделать это с помощью другой библиотеки? Буду рад любой помощи)
0 |
Fudthhh Модератор 2745 / 1538 / 505 Регистрация: 21.02.2017 Сообщений: 4,126 Записей в блоге: 1 |
||||
17.07.2020, 14:25 |
2 |
|||
Сообщение было отмечено Piwpaf как решение РешениеPiwpaf, ???
1 |
2 / 2 / 0 Регистрация: 05.05.2020 Сообщений: 15 |
|
17.07.2020, 14:30 [ТС] |
3 |
Блин, вы заставили меня покраснеть за мою невнимательность. Спасибо большое
0 |
Project description
docx_search
A python library built to search for keywords in .docx files.
Installation
Easy:
pip install docx_search
Advanced:
Requires git.
Git clone https://github.com/WaldemarBjornstrom/docx_search
Then place the downloaded files in your project.
Usage
import docx_search docx_search.run('Keyword', 'Path\to\file') # Returns a list with files that contain the keyword.
NOTE! The Keyword is case sensitive. And remember tu use double backslashes in the file path as python sees a single backslash as an escape marker.
License
This project is licensed under the GNU GPLv3 License. See LICENSE
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.