Поиск слов в word python

I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.

Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».

edi9999's user avatar

edi9999

19.3k13 gold badges88 silver badges127 bronze badges

asked Sep 22, 2008 at 17:08

Gerry's user avatar

After reading your post above, I made a 100% native Python docx module to solve this specific problem.

# Import the module
from docx import document, opendocx

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

The docx module is at https://python-docx.readthedocs.org/en/latest/

answered Dec 30, 2009 at 12:08

mikemaccana's user avatar

mikemaccanamikemaccana

106k96 gold badges376 silver badges477 bronze badges

13

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

answered Sep 22, 2008 at 17:22

PhiLho's user avatar

PhiLhoPhiLho

40.3k6 gold badges96 silver badges132 bronze badges

1

In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».

>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()

Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.

Efren's user avatar

Efren

3,7614 gold badges32 silver badges73 bronze badges

answered Sep 22, 2008 at 23:12

Tony Meyer's user avatar

Tony MeyerTony Meyer

9,9596 gold badges44 silver badges47 bronze badges

1

A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!

<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">

<w:r w:rsidRPr="003F6D7A">

<w:rPr>

<w:b /> 

</w:rPr>

<w:t>Hello</w:t> 

</w:r>

<w:r>

<w:t xml:space="preserve">World.</w:t> 

</w:r>

</w:p>

You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.

Or you can try using Aspose.Words.

It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).

answered Nov 15, 2009 at 11:01

romeok's user avatar

romeokromeok

6611 gold badge9 silver badges13 bronze badges

1

You can use docx2txt to get the text inside the docx, than search in that txt

npm install -g docx2txt
docx2txt input.docx # This will  print the text to stdout

answered Dec 9, 2014 at 10:51

edi9999's user avatar

edi9999edi9999

19.3k13 gold badges88 silver badges127 bronze badges

A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.

A second choice would be to interop with word and do the search through it.

answered Sep 22, 2008 at 17:16

kokos's user avatar

kokoskokos

42.9k5 gold badges35 silver badges32 bronze badges

a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.

answered Sep 22, 2008 at 17:17

shoosh's user avatar

shooshshoosh

76.1k54 gold badges208 silver badges323 bronze badges

OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:

<b>Looking <i>for</i> this <u>phrase</u>

There’s no easy way to find that using a simple text scan.

Berkay Turancı's user avatar

answered Sep 22, 2008 at 23:03

ilitirit's user avatar

ilitiritilitirit

15.9k18 gold badges72 silver badges109 bronze badges

You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.

answered Sep 22, 2008 at 17:17

Andy Brice's user avatar

Andy BriceAndy Brice

2,3071 gold badge21 silver badges27 bronze badges

You may also consider using the library from OpenXMLDeveloper.org

answered Oct 18, 2008 at 19:34

billb's user avatar

billbbillb

3,5981 gold badge31 silver badges36 bronze badges

I’d like to search a Word 2007 file (.docx) for a text string, e.g., «some special phrase» that could/would be found from a search within Word.

Is there a way from Python to see the text? I have no interest in formatting — I just want to classify documents as having or not having «some special phrase».

edi9999's user avatar

edi9999

19.3k13 gold badges88 silver badges127 bronze badges

asked Sep 22, 2008 at 17:08

Gerry's user avatar

After reading your post above, I made a 100% native Python docx module to solve this specific problem.

# Import the module
from docx import document, opendocx

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

The docx module is at https://python-docx.readthedocs.org/en/latest/

answered Dec 30, 2009 at 12:08

mikemaccana's user avatar

mikemaccanamikemaccana

106k96 gold badges376 silver badges477 bronze badges

13

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

answered Sep 22, 2008 at 17:22

PhiLho's user avatar

PhiLhoPhiLho

40.3k6 gold badges96 silver badges132 bronze badges

1

In this example, «Course Outline.docx» is a Word 2007 document, which does contain the word «Windows», and does not contain the phrase «random other string».

>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()

Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the ‘document.xml’ file in the ‘word’ folder. If you wanted to be more sophisticated, you could then parse the XML, but if you’re just looking for a phrase (which you know won’t be a tag), then you can just look in the XML for the string.

Efren's user avatar

Efren

3,7614 gold badges32 silver badges73 bronze badges

answered Sep 22, 2008 at 23:12

Tony Meyer's user avatar

Tony MeyerTony Meyer

9,9596 gold badges44 silver badges47 bronze badges

1

A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!

<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">

<w:r w:rsidRPr="003F6D7A">

<w:rPr>

<w:b /> 

</w:rPr>

<w:t>Hello</w:t> 

</w:r>

<w:r>

<w:t xml:space="preserve">World.</w:t> 

</w:r>

</w:p>

You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other «dead ends» just because the OOXML spec is around 6000 pages long and MS Word can write lots of «stuff» you don’t expect. So you could end up writing your own document processing library.

Or you can try using Aspose.Words.

It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can’t post a second link, stackoverflow does not let me yet).

answered Nov 15, 2009 at 11:01

romeok's user avatar

romeokromeok

6611 gold badge9 silver badges13 bronze badges

1

You can use docx2txt to get the text inside the docx, than search in that txt

npm install -g docx2txt
docx2txt input.docx # This will  print the text to stdout

answered Dec 9, 2014 at 10:51

edi9999's user avatar

edi9999edi9999

19.3k13 gold badges88 silver badges127 bronze badges

A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you’re not interested in.

A second choice would be to interop with word and do the search through it.

answered Sep 22, 2008 at 17:16

kokos's user avatar

kokoskokos

42.9k5 gold badges35 silver badges32 bronze badges

a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.

answered Sep 22, 2008 at 17:17

shoosh's user avatar

shooshshoosh

76.1k54 gold badges208 silver badges323 bronze badges

OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:

<b>Looking <i>for</i> this <u>phrase</u>

There’s no easy way to find that using a simple text scan.

Berkay Turancı's user avatar

answered Sep 22, 2008 at 23:03

ilitirit's user avatar

ilitiritilitirit

15.9k18 gold badges72 silver badges109 bronze badges

You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.

answered Sep 22, 2008 at 17:17

Andy Brice's user avatar

Andy BriceAndy Brice

2,3071 gold badge21 silver badges27 bronze badges

You may also consider using the library from OpenXMLDeveloper.org

answered Oct 18, 2008 at 19:34

billb's user avatar

billbbillb

3,5981 gold badge31 silver badges36 bronze badges

I’ve been doing a lot of searching for a method to find and replace text in a docx file with little luck. I’ve tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. «The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.»).

import zipfile

replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")

with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
    tempXmlStr = tempXmlFile.read()

for key in replaceText.keys():
    tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr)

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file))

newDocx.write("C:/temp.xml", "word/document.xml")

templateDocx.close()
newDocx.close()

My question is what’s wrong with this method? I’m pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?

.

Here is a walkthrough of my thought process for everyone else trying to learn this stuff:

Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {«XXXCLIENTNAMEXXX» : «Joe Bob», «XXXMEETDATEXXX» : «May 31, 2013»}).

Step 2) Open the template docx file using the zipfile module.

Step 3) Open a new new docx file with the append access mode.

Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.

Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.

Step 6) Write the xml text string to a new temporary xml file.

Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.

Step 8) Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.

Step 9) Close your template and new docx archives.

Step 10) Open your new docx document and enjoy your replaced text!

—Edit— Missing closing parentheses ‘)’ on lines 7 and 11

2 / 2 / 0

Регистрация: 05.05.2020

Сообщений: 15

1

17.07.2020, 13:47. Показов 24030. Ответов 2


Студворк — интернет-сервис помощи студентам

Привет. Как с помощью python-docx выполнить поиск определенного слова с последующей заменой? Документацию прочел, но так и не нашел соответствующей функции. Может есть возможность сделать это с помощью другой библиотеки? Буду рад любой помощи)



0



Fudthhh

Модератор

Эксперт Python

2745 / 1538 / 505

Регистрация: 21.02.2017

Сообщений: 4,126

Записей в блоге: 1

17.07.2020, 14:25

2

Лучший ответ Сообщение было отмечено Piwpaf как решение

Решение

Piwpaf, ???

Python
1
2
3
4
5
6
7
8
import docx
 
document = docx.Document("example.docx")
 
for paragraph in document.paragraphs:
    paragraph.text = paragraph.text.replace("old", "new")
 
document.save("example.docx")



1



2 / 2 / 0

Регистрация: 05.05.2020

Сообщений: 15

17.07.2020, 14:30

 [ТС]

3

Блин, вы заставили меня покраснеть за мою невнимательность. Спасибо большое



0



Project description

docx_search

A python library built to search for keywords in .docx files.

Installation

Easy:

pip install docx_search

Advanced:

Requires git.

Git clone https://github.com/WaldemarBjornstrom/docx_search

Then place the downloaded files in your project.

Usage

import docx_search

docx_search.run('Keyword', 'Path\to\file') # Returns a list with files that contain the keyword.

NOTE! The Keyword is case sensitive. And remember tu use double backslashes in the file path as python sees a single backslash as an escape marker.

License

This project is licensed under the GNU GPLv3 License. See LICENSE

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distributions

Built Distribution

Понравилась статья? Поделить с друзьями:
  • Поиск слов word search
  • Поиск слов word of wonders
  • Поиск слагаемых под заданную сумму excel
  • Поиск скачать бесплатно word
  • Поиск символов с конца строки excel