Extract data from word

here’s some code making use of late binding (declare objects rather than word.application etc). From Excel 2003, it

  1. opens a WORD document
  2. searches for string «minimum stock»
  3. moves the cursor some lines/words further
  4. expands/selects the WORD cursor
  5. pastes this WORD selection into EXCEL

steps 2-5 are repeated for «Period of report:» (note that the «:» is a word boundary, so we need to jump 8 words to the right to arrive at the date)

For WORD I copied the text from your Q just as is (no table, just plain text). If you use tables instead, you may need to play with the units of the various Move statements (e.g. for cells unit:=12); the strategy remains the same: find a constant text, move cursor to final destination, expand selection, create a word range and transfer.

Both items are placed into the current cell in Excel and its right neighbor.

Sub GrabUsage()
Dim FName As String, FD As FileDialog
Dim WApp As Object, WDoc As Object, WDR As Object
Dim ExR As Range

    Set ExR = Selection ' current location in Excel Sheet

    'let's select the WORD doc
    Set FD = Application.FileDialog(msoFileDialogOpen)
    FD.Show
    If FD.SelectedItems.Count <> 0 Then
        FName = FD.SelectedItems(1)
    Else
        Exit Sub
    End If

    ' open Word application and load doc
    Set WApp = CreateObject("Word.Application")
    ' WApp.Visible = True
    Set WDoc = WApp.Documents.Open(FName)

    ' go home and search
    WApp.Selection.HomeKey Unit:=6
    WApp.Selection.Find.ClearFormatting
    WApp.Selection.Find.Execute "Minimum Stock"

    ' move cursor from find to final data item
    WApp.Selection.MoveDown Unit:=5, Count:=1
    WApp.Selection.MoveRight Unit:=2, Count:=2

    ' the miracle happens here
    WApp.Selection.MoveRight Unit:=2, Count:=1, Extend:=1

    ' grab and put into excel        
    Set WDR = WApp.Selection
    ExR(1, 1) = WDR ' place at Excel cursor

    'repeat
    WApp.Selection.HomeKey Unit:=6
    WApp.Selection.Find.ClearFormatting
    WApp.Selection.Find.Execute "Period of Report:"
    WApp.Selection.MoveRight Unit:=2, Count:=8
    WApp.Selection.MoveRight Unit:=2, Count:=3, Extend:=1

    Set WDR = WApp.Selection
    ExR(1, 2) = WDR ' place in cell right of Excel cursor

    WDoc.Close
    WApp.Quit

End Sub

You can create a button and call that sub from there, or link GrabUsage() to a function key.

I commented out the WApp.Visible = True because in production you don’t want WORD even to show up, but you will need it for debugging and playing with the cursor movements.

The disadvantage of late binding (and not using references to the Word library) is the hardcoding of units (6=story, 5=line, 2=word) instead of using Word enumerations, but I sometimes get OS crashes with early binding …. not very sexy but it seems to work.

The FileDialog object needs a reference to the MS Office Office Library. AFAIK this is standard in Excel 2003, but better to check than to crash.

And I didn’t include code to check if the items are really found; I leave this to your creativity.

Hope that helps.

Word and Excel don’t usually get along too well so it’s no surprise that Power Query isn’t directly compatible with its estranged cousin Word either. If you are presented with the need to import data from Word into Power Query you’ll be please to hear it is possible however it requires a couple of manual steps to make it work. 

The manual steps could fairly easily be completed by a batch file which would automate the process further.

Here is the Excel data pasted ‘as values’ in a Word file which i’ll use for the first example

Here is the Excel data pasted with ‘keep source formatting’ which i’ll reference a couple of times in the article.

Although the steps I’ve covered below aren’t complex, this whole process has some unknowns around it so you may find the result in your instance varies from mine. The Word file I’ve used contains the contents of a range of excel cells that I deliberately pasted as values into Word to create a test file for this example. I’ve just repeated the process with another file where I pasted using the default paste option (keep source formatting) and the below example didn’t work because the data was located in a different place in the XML structure of the document.xml file. After a bit of exploratory work I managed to locate the data which would allow this method to work. To summarise, if your word file is formatted differently to this example, you may need to open the xml file in a browser to locate the values from the table and then you’ll be able to navigate to it in Power Query.

Here is the XML Mapping from the two examples i’ve mentioned

This is how I found the data in the XML structure for the 2 examples I mentioned earlier

Word file containing a table with data pasted as values

body, p, r  (to reveal column ‘t’ which contains the values from the table)

Word file containing table with data pasted with ‘keep source formatting’

body, tbl, tr, tc, p, r (to reveal column ‘t’ with the values from the table)

Even with these caveats, this process is one that demonstrates some of the potential of Power Query when combined with some creative manipulation of source data.

Warning! This method will make the file unopenable in Word, I’d suggest making a copy of the file and working with the copy.

I must give a nod to Matt Allington who documented a similar process with PDF files https://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/ which possibly had some influence in Microsoft providing a PDF connector into Power Query in 2019. I would assume that the demand for a Word connector would be significant so I expect this article to have a limited lifespan.

The M code from these queries is at the bottom of this article.

The manual steps

  • In windows Explorer, change the file extension from .docx to .zip and then extract (unzip) the file to a new folder. This will create a folder with the following data within it

  • In the ‘word’ folder there will be additional items, if you look at it you’ll notice the ‘document.xml’ file is larger than the other files. This is the file we’re interested in

Connecting Power Query to the document.xml file

In Power Query, use the ‘new source, more’ option to display the available connectors

Select the XML option, navigate to the ‘document.xml’ file and select it. Then you’ll be presented with the navigator window which can be used to load only portions of the file. See the comment earlier about the XML location in the file. In this example, the data in the Word file has been pasted as values so I’ll need the following selection.

Once loaded, column ‘r’ needs to be expanded which can be done by clicking the ‘expand’ option in the header row

This will then display the ‘t’ column which contains the values from the table in the word file

Then we’ll need to remove all the other columns apart from the ‘t’ column using the ‘Remove Other Columns’ step

For some reason there are some ‘null’ values in the first few rows which I know are not columns so we’ll remove them using the filter options

This will leave the following data in the ‘t’ column

This post could end here as the column displayed above is (technically) the contents of the word file. However, the structure isnt the same as the table in the word file. To make it the same we need to do some transformations.

Transforming the data to match the word file

I suspect each application of this approach may vary so this method is tailored for the file I’m using. It is possible your attempt creates something identical, if that’s the case then the following should work for you too

To transform the data we need to do the following

  • Create a column that groups each data point into a row
  • Create a column that groups each data point into a column
  • Pivot the data so it is displayed in columns and rows
  • Tidy up any headers/helper columns so we’re left with just the data we need

Column to group each data point into rows

The first 6 items in the above list are the items in the first row with the next 6 items in the next row. We need to have something in an adjacent column that can be used to group these items accordingly

First we’ll add a new index column from 0 using the following ribbon option

Next we’ll select the ‘Index’ column and then divide it by 6 using the ‘Standard -> divide’ options in the ‘Transform’ ribbon. Note these options are also in the ‘new column’ ribbon but we want to transform an existing column (not add a new column) so we use the ones in the ‘transform’ ribbon.

Next we’ll use the ’round down’ option in the same ribbon under ’rounding’

Now we have a column, that identifies the row number of the datapoint which matches the table in the word file. The last step is to rename the column to ‘RowID’

Column to group each data point into columns

Now we have a column to identify the 

Add another index column from zero

Next we’ll add another column that multiplies the RowID column by 6

Next we’ll need to select both the new index and the Multiplication columns and then in the ‘add column’ ribbon, select ‘Standard, Subtract’ which will create a new column which will contain the column number

Rename the ‘Subtraction’ column to ‘ColumnID’

Lastly delete the ‘index’ and ‘multiplication columns

Pivot the data so its displayed in columns and rows

Now we have 3 columns, one with the data to display in each cell, another with the RowID and the third with the ColumnID we can pivot the data into a table.

Rename column ‘t’, ‘Values’. This isn’t strictly part of the process but it’ll make sense after the next step.

Select ‘ColumnID’ and then click ‘Pivot Column’ which is located in the ‘Transform’ ribbon

Note, when pivoting data, Its useful to remember the selected column is the one that will become the column headers. 

This will present the following selection

As you can see there is a single drop down option which is titled ‘Values Column’, we want to have the column where the information to be displayed in each cell is located, in this instance the ‘Values’ column (see what I did there!)

By default, the pivot function assumes that the data in the values column needs to be counted. As we just want to display the single value in the ‘values’ column we need to change the aggregation option. To do this, click on the ‘Advanced options’ toggle to display the advanced option… Once it is displayed, change the ‘Aggregate Value Function’ drop down to ‘Don’t Aggregate’ and click ‘ok’.

This will return the following table, notice how the values in the ColumnID column (0 to 6) are now displayed, once, in the header row.

All that’s left is to remove the ‘RowID’ column and then promote the first row as headers and the table is identical to the one we saw at the start of the article in Word.

M Code used to create these queries

Excel data pasted in Word as values

let

    Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWordDocworddocument.xml»)),

    Table0 = Source{0}[Table],

    Table1 = Table0{0}[Table],

    #»Expanded r» = Table.ExpandTableColumn(Table1, «r», {«t», «tab»}, {«t», «tab»}),

    #»Removed Other Columns» = Table.SelectColumns(#»Expanded r»,{«t»}),

    #»Filtered Rows» = Table.SelectRows(#»Removed Other Columns», each ([t] <> null))

in

    #»Filtered Rows»

Excel data pasted in Word as Values with transformation steps

Let

    Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWordDocworddocument.xml»)),

    Table0 = Source{0}[Table],

    Table1 = Table0{0}[Table],

    #»Expanded r» = Table.ExpandTableColumn(Table1, «r», {«t», «tab»}, {«t», «tab»}),

    #»Removed Other Columns» = Table.SelectColumns(#»Expanded r»,{«t»}),

    #»Filtered Rows1″ = Table.SelectRows(#»Removed Other Columns», each ([t] <> null)),

    #»Added Index» = Table.AddIndexColumn(#»Filtered Rows1″, «Index», 0, 1, Int64.Type),

    #»Divided Column» = Table.TransformColumns(#»Added Index», {{«Index», each _ / 6, type number}}),

    #»Rounded Down» = Table.TransformColumns(#»Divided Column»,{{«Index», Number.RoundDown, Int64.Type}}),

    #»Renamed Columns» = Table.RenameColumns(#»Rounded Down»,{{«Index», «RowID»}}),

    #»Added Index1″ = Table.AddIndexColumn(#»Renamed Columns», «Index», 0, 1, Int64.Type),

    #»Inserted Multiplication» = Table.AddColumn(#»Added Index1″, «Multiplication», each [RowID] * 6, type number),

    #»Inserted Subtraction» = Table.AddColumn(#»Inserted Multiplication», «Subtraction», each [Index] — [Multiplication], type number),

    #»Renamed Columns1″ = Table.RenameColumns(#»Inserted Subtraction»,{{«Subtraction», «ColumnID»}}),

    #»Removed Columns» = Table.RemoveColumns(#»Renamed Columns1″,{«Index», «Multiplication»}),

    #»Renamed Columns2″ = Table.RenameColumns(#»Removed Columns»,{{«t», «Values»}}),

    #»Pivoted Column» = Table.Pivot(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»), List.Distinct(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»)[ColumnID]), «ColumnID», «Values»),

    #»Removed Columns1″ = Table.RemoveColumns(#»Pivoted Column»,{«RowID»}),

    #»Promoted Headers» = Table.PromoteHeaders(#»Removed Columns1″, [PromoteAllScalars=true])

in

    #»Promoted Headers»

Excel data pasted in Word as ‘keep source formatting‘ with transformation steps

let

    Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWord2worddocument.xml»)),

    Table = Source{0}[Table],

    Table1 = Table{0}[Table],

    Table2 = Table1{2}[Table],

    #»Expanded tc» = Table.ExpandTableColumn(Table2, «tc», {«tcPr», «p»}, {«tcPr», «p»}),

    #»Expanded p» = Table.ExpandTableColumn(#»Expanded tc», «p», {«pPr», «proofErr», «r», «http://schemas.openxmlformats.org/wordprocessingml/2006/main»}, {«pPr», «proofErr», «r», «http://schemas.openxmlformats.org/wordprocessingml/2006/main.1»}),

    #»Expanded r» = Table.ExpandTableColumn(#»Expanded p», «r», {«rPr», «t», «http://schemas.openxmlformats.org/wordprocessingml/2006/main»}, {«rPr», «t», «http://schemas.openxmlformats.org/wordprocessingml/2006/main.2»}),

    #»Removed Other Columns» = Table.SelectColumns(#»Expanded r»,{«t»}),

    #»Filtered Rows1″ = Table.SelectRows(#»Removed Other Columns», each ([t] <> null)),

    #»Added Index» = Table.AddIndexColumn(#»Filtered Rows1″, «Index», 0, 1, Int64.Type),

    #»Divided Column» = Table.TransformColumns(#»Added Index», {{«Index», each _ / 6, type number}}),

    #»Rounded Down» = Table.TransformColumns(#»Divided Column»,{{«Index», Number.RoundDown, Int64.Type}}),

    #»Renamed Columns» = Table.RenameColumns(#»Rounded Down»,{{«Index», «RowID»}}),

    #»Added Index1″ = Table.AddIndexColumn(#»Renamed Columns», «Index», 0, 1, Int64.Type),

    #»Inserted Multiplication» = Table.AddColumn(#»Added Index1″, «Multiplication», each [RowID] * 6, type number),

    #»Inserted Subtraction» = Table.AddColumn(#»Inserted Multiplication», «Subtraction», each [Index] — [Multiplication], type number),

    #»Renamed Columns1″ = Table.RenameColumns(#»Inserted Subtraction»,{{«Subtraction», «ColumnID»}}),

    #»Removed Columns» = Table.RemoveColumns(#»Renamed Columns1″,{«Index», «Multiplication»}),

    #»Renamed Columns2″ = Table.RenameColumns(#»Removed Columns»,{{«t», «Values»}}),

    #»Pivoted Column» = Table.Pivot(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»), List.Distinct(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»)[ColumnID]), «ColumnID», «Values»),

    #»Removed Columns1″ = Table.RemoveColumns(#»Pivoted Column»,{«RowID»}),

    #»Promoted Headers» = Table.PromoteHeaders(#»Removed Columns1″, [PromoteAllScalars=true])

in

    #»Promoted Headers»


  Перевод


  Ссылка на автора

Этот блог подробно расскажет о том, как извлекать информацию из Документов Word локально. Поскольку многие компании и роли неотделимы от Microsoft Office Suite, это полезный блог для тех, кто сталкивается с данными, передаваемыми в форматах .doc или .docx.

В качестве предварительного условия вам понадобится Python, установленный на вашем компьютере. Для тех из вас, кто делает это на работе, у вас, скорее всего, нет прав администратора. Этот блог объясняет, как установить Anaconda на компьютер с Windows без прав администратора.

Вы можете найти Блокнот, поддерживающий этот блог Вот,

Изображение, созданное с помощью Microsoft Word и поисковика Google «Логотип Microsoft Word» и «Логотип Python»

Мы будем использовать преимущества XML-каждого текстового документа. Оттуда мы будем использовать библиотеку регулярных выражений, чтобы найти каждый URL в тексте документа, а затем добавим URL-адреса в список, что идеально подходит для выполнения циклов for.

#specific to extracting information from word documents
import os
import zipfile#other tools useful in extracting the information from our document
import re#to pretty print our xml:
import xml.dom.minidom
  • Операционные системы позволит вам перемещаться и находить соответствующие файлы в вашей операционной системе
  • ZipFile позволит вам извлечь XML из файла
  • xml.dom.minidom разобрать код xml

Сначала нам нужно указать нашему коду открыть файлы в том месте, где они хранятся. Чтобы увидеть это из наших записных книжек (вместо того, чтобы открывать файловый менеджер), мы можем использоватьos, Зная путь к файлу, представляющему интерес, избавляет от необходимости использоватьosв этом простом примере эту библиотеку позже можно использовать для создания списка документов, хранящихся в целевой папке. Наличие списка документов, хранящихся в папке, полезно, если вы захотите написать цикл for для извлечения информации из всех текстовых документов, хранящихся в папке.

Чтобы увидеть список файлов в вашем текущем каталоге, используйте один период вosПуть к файлу:

os.listdir('.')

Чтобы увидеть список файлов в каталоге над вашим текущим местоположением, используйте двойной период:

os.listdir('..')

Как только вы найдете, где хранятся ваши текстовые документы, вы сможете преобразовать найденный файл с путем к файлу в тип файла zipfile.ZipFile, который для наших целей можно прочитать.

Формат файла ZIP является стандартным архивом и стандартом сжатия.

https://docs.python.org/3/library/zipfile.html

document = zipfile.ZipFile('../docs/TESU CBE 29 Employee Job Description Evaluation - Final Approved.docx')#document will be the filetype zipfile.ZipFile

Сейчас.read()Для объекта класса zipfile требуется аргумент name, который отличается от имени файла или пути к файлу.

ZipFile.read(name, pwd=None)

Чтобы увидеть примеры доступных имен, мы можем использовать.name()объект

document.namelist()

в Jupyter Записная книжка к этому блогу Я исследую некоторые из этих имен, чтобы показать, какие они есть. Имя с текстом основного документа Word — «word / document.xml»

Я нашел красивую технику печати в пользователе StackOverflow Нейт Болтон Ответ на вопрос: Довольно печать XML на Python,

Мы будем использовать только красивую печать, чтобы помочь нам идентифицировать шаблоны в XML для извлечения наших данных. Лично я не очень хорошо знаю XML, поэтому я буду полагаться на синтаксические шаблоны, чтобы найти каждый URL в тексте нашего текстового документа. Если вы уже знаете свой синтаксический паттерн для извлечения ваших данных, вам, возможно, не нужно его печатать.

В нашем примере мы находим, что персонажи>httpа также<окружить каждую гиперссылку, содержащуюся в тексте документа.

Мне нужно выполнить нашу задачу — собрать весь текст между вышеупомянутыми персонажами. Чтобы понять, как это сделать с помощью регулярных выражений, я использовал следующий вопрос StackOverflow, который содержит то, что я ищу в первом запросе: Регулярное выражение, чтобы найти строку, включенную между двумя символами при ИСКЛЮЧЕНИИ разделителей.

Хотя я хочу сохранитьhttpЯ не хочу сохранять<или>, Я внесу эти изменения в мои элементы списка, используя разрезание строк и понимание списка.

link_list = re.findall('http.*?<',xml_str)[1:]
link_list = [x[:-1] for x in link_list]

Чтобы увидеть полную записную книжку Jupyter за этим блогом, кликните сюда!

Если вы заинтересованы в создании и написании документов MS Word с использованием python, посмотрите библиотеку питон-DOCX,

Существуют и другие методы извлечения текста и информации из текстовых документов, такие как библиотеки docx2txt и библиотеки docx, описанные в ответах на следующие вопросы. Сообщение на форуме Python,


Это родственный блог моей статьи об инициативе по доступу к материалам с открытым исходным кодом Университета Томаса Эдисона. Средний пост / Github хранилище

Первоначальный блог, который я написал для этого проекта, был специфическим и граничит с громоздкой линией, поэтому этот блог — первый в серии, о которой я напишу, чтобы глубже погрузиться в отдельные аспекты проекта TESU и сделать материал более доступным.

Say someone sent you a Word document with a lot of images, and you want you to save those images on your hard drive. You can extract images from a Microsoft Office document with a simple trick.

If you have a Word (.docx), Excel (.xlsx), or PowerPoint (.pptx) file with images or other files embedded, you can extract them (as well as the document’s text), without having to save each one separately. And best of all, you don’t need any extra software. The Office XML based file formats–docx, xlsx, and pptx–are actually compressed archives that you can open like any normal .zip file with Windows. From there, you can extract images, text, and other embedded files. You can use Windows’ built-in .zip support, or an app like 7-Zip if you prefer.

If you need to extract files from an older office document–like a .doc, .xls, or .ppt file–you can do so with a small piece of free software. We’ll detail that process at the end of this guide.

How to Extract the Contents of a Newer Office File (.docx, .xlsx, or .pptx)

To access the inner contents of an XML based Office document, open File Explorer (or Windows Explorer in Windows 7), navigate to the file from which you want to extract the content, and select the file.

Press “F2” to rename the file and change the extension (.docx, .xlsx, or .pptx) to “.zip”. Leave the main part of the filename alone. Press “Enter” when you’re done.

The following dialog box displays warning you about changing the file name extension. Click “Yes”.

Windows automatically recognizes the file as a zipped file. To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu.

On the “Select a Destination and Extract Files” dialog box, the path where the content of the .zip file will be extracted displays in the “Files will be extracted to this folder” edit box. By default, a folder with the same name as the name of the file (without the file extension) is created in the same folder as the .zip file. To extract the files to a different folder, click “Browse”.

Navigate to where you want the content of the .zip file extracted, clicking “New folder” to create a new folder, if necessary. Click “Select Folder”.

To open a File Explorer (or Windows Explorer) window with the folder containing the extracted files showing once they are extracted, select the “Show extracted files when complete” check box so there is a check mark in the box. Click “Extract”.

How to Access the Extracted Images

Included in the extracted contents is a folder named “word”, if your original file is a Word document (or “xl” for an Excel document or “ppt” for a PowerPoint document). Double-click on the “word” folder to open it.

Double-click the “media” folder.

All the images from the original file are in the “media” folder. The extracted files are the original images used by the document. Inside the document, there may be resizing or other properties set, but the extracted files are the raw images without these properties applied.

How to Access the Extracted Text

If you don’t have Office installed on your PC, and you need to extract text out of a Word (or Excel or PowerPoint) file, you can access the extracted text in the “document.xml” file in the “word” folder.

You can open this file in a text editor, such as Notepad or WordPad, but it’s easier to read in a special XML editor, such as the free program, XML Notepad. All the text from the file is available in chunks of plain text regardless of the style and/or formatting applied in the document itself. Of course, if you’re going to download free software to view this text, you might as well download LibreOffice, which can read Microsoft Office documents.

How to Extract Embedded OLE Objects or Attached Files

To access embedded files in a Word document when you don’t have access to Word, first open the Word file in WordPad (which comes built into Windows). You might notice that some of the embedded file icons do not display, but they’re still there. Some of the embedded files might have partial filenames. WordPad does not support all of Word’s features, so some content might be displayed improperly. But you should be able to access the files.

If we right-click on one of the embedded files in our sample Word file, one of the options is “Open PDF Object”. This opens the PDF file in the default PDF reader program on your PC. From there, you can save the PDF file to your hard drive.

If WordPad doesn’t have an option for opening your file, make note of its file type here. For example, our second file in this document is a .mp3 file.

Then, go back to your “Files from [Document]” folder and double-click the “embeddings” folder inside the “word” folder.

Unfortunately, the file types are not preserved in the filenames. They all have a “.bin” file extension instead. If you know what types of files are embedded in the file, you can probably deduce which file is which by the size of the file. In our example, we had a PDF file and an MP3 file embedded in our document. Because the MP3 file is most likely larger than the PDF file, we can figure out which file is which by looking at the sizes of the files and then rename them using the correct extensions. Below, we’re renaming the MP3 file.

Note that not all files will necessarily open using this process–for example, our PDF file opened correctly from WordPad, but we couldn’t get it to open by renaming its .bin file.

Once you’ve extracted the content of the zipped file, you can revert the extension of the original file back to .docx, .xlsx, or .pptx. The file will remain intact and can be opened normally in the corresponding program.

How to Extract Images from Older Office Documents (.doc, .xls, or .ppt)

If you need to extract images from an Office 2003 (or earlier) document, there’s a free tool called Office Image Extraction Wizard that makes this task easy. This program also allows you to extract images from multiple documents (of the same or different types) at once. Download the program and install it (there’s also a portable version available if you’d rather not install it).

Run the program, and the Welcome screen displays. Click “Next”.

First, we need to select the file from which you want to extract the images. On the Input & Output screen, click the “Browse” (folder icon) button to the right of the Document edit box.

Navigate to the folder containing the document you want, select it, and click “Open”.

The folder that contains the selected file automatically becomes the Output folder. To create a subfolder within that folder named the same as the selected file, click the “Create a folder here” check box so there is a check mark in the box. Then, click “Next”.

On the Ready to Start screen, click “Start” to begin extracting the images.

The following screen displays while the extraction processes.

On the Finished screen, click the “Click here to open destination folder” to view the resulting image files.

Because we chose to create a subfolder, we get a folder containing the image files extracted from the file.

You will see all the images as numbered files.

You can also extract images from multiple files at once. To do this, on the Input & Output screen, click the “Batch Mode” check box so there is a check mark in the box.

The Batch Input & Output screen displays. Click “Add Files”.

On the Open dialog box, navigate to the folder containing any of the files from which you want to extract images, select the files using the “Shift” or “Ctrl” key to select multiple files, and click “Open”.

You can add files from another folder by clicking “Add Files” again, navigating to the folder on the Open dialog box, selecting the desired files, and clicking “Open”.

Once you’ve added all the files from which you want to extract images, you can choose to create a separate folder for each document within the same folder as each document into which the image files will be saved by clicking the “Create a folder for each document” check box so there is a check mark in the box.

You can also specify the Output folder to be the “Same as each file’s input folder” or enter or select a custom folder using the edit box and “Browse” button below that option. Click “Next” once you have selected the options you want.

Click “Start” on the Ready to Start screen.

The following screen displays showing the extraction progress.

The number of images extracted displays on the Finished screen. Click “Close” to close the Office Image Extraction Wizard.

If you chose to create a separate folder for each document, you will see folders with the same names as the files containing the images, whichever output folder(s) you specified.

Again, we get all the images as numbered files for each document.

Now you can rename the images, move them, and use them in your own documents. Just make sure you have the rights to use them legally.

READ NEXT

  • › HoloLens Now Has Windows 11 and Incredible 3D Ink Features
  • › Google Chrome Is Getting Faster
  • › BLUETTI Slashed Hundreds off Its Best Power Stations for Easter Sale
  • › The New NVIDIA GeForce RTX 4070 Is Like an RTX 3080 for $599
  • › This New Google TV Streaming Device Costs Just $20
  • › How to Adjust and Change Discord Fonts
  • Home  / 
  • Products – productivity tools for Microsoft Word  / 
  • Extract acronyms, bookmarks, tracked changes and comments from Word

Extract acronyms, bookmarks, tracked changes and comments from Word

Introduction to DocTools ExtractData

DocTools ExtractData is an add-in for Microsoft Word.

The add-in works with Word 2007, Word 2010, Word 2013, Word 2016, Word 2019, Word 2021, Word for Microsoft 365 on PC / Windows.
The add-in works with both 32-bit and 64-bit versions of Word.

The add-in lets you easily export / extract:

  • acronymsbookmarkstracked changes, or comments

from the active document to a new document. The extracted data, incl. additional metadata, will be listed in a table for easy overview.

DocTools ExtractData adds a set of tools that can be accessed from the group ExtractData in the DocTools tab in the Ribbon. The DocTools tab may also contain tools from other add-ins provided by DocTools.

You will find more details about the add-in below.


The ExtractData group in the DocTools tab

The ExtractData group in the DocTools tab.


The Extract Data menu in the Extract Data group

The Extract Data menu in the ExtractData group.
Find details about these commands in the More details section.


The Help menu in the Extract Data group

The Help menu in the ExtractData group.

Product version details – DocTools ExtractData

Version number:

Release date:

Supported Word versions:

1.5 (see Changelog for info about versions and changes)

October 18, 2021

Word
2007, Word
2010, Word
2013, Word
2016, Word
2019, Word 2021, Word
for
Office
365 on PC / Windows

See also the DocTools CommentManager add-in – an advanced add-in for managing comments in Word. DocTools CommentManager lets you extract comments with extra data compared to DocTools ExtractData. In addition, DocTools CommentManager includes a lot of tools that can help you manage comments quickly and easily.

DocTools ExtractChanges Pro                       Advanced add-in version – check it out now!


Word Add-In from DocTools

Are you editing large contracts or similar documents?
Do you want to be able to extract insertions, deletions and comments in full context and including headings and subheadings?

The story behind DocTools ExtractData

On my website thedoctools.com I have provided free macros for exporting / extracting acronyms, tracked changes  and comments for years. The macros have been downloaded by thousands of Word users over the years. DocTools ExtractData combines improved versions of those macros in an add-in. In addition, the add-in lets you export / extract bookmarks and related data from any Word document – this is new functionality I have developed for the DocTools ExtractData add-in. As opposed to the free macros, the add-in includes a user interface in the Ribbon for easy access.

As illustrated in the Introduction section, a menu named Extract Data will be found on the DocTools tab when you have installed the DocTools ExtractData add-in. Below you will find details about the individual commands in the Extract Data menu.

DocTools CommentManager                        Advanced add-in version – check it out now!


Word Add-In from DocTools

Are you editing and reviewing documents where comments are used a lot?
Would you like to be able to add automatic comment numbers, to review and handle comments faster and more efficiently, to rank and filter comments by importance, to extract comments to both Word and Excel and with more data (e.g. line numbers and headings), etc.?

More details about the DocTools ExtractData features

About header and footer info in extract documents

Each of the commands in the Extract Data menu creates a new document with the following data in the header:

  • Full name of the source document from which the data was extracted
  • Name of the document creator
  • Creation date of the extract document

The footer will include information about which version of DocTools ExtractData was used to create the extract – NEW in version 1.4

Extract Acronyms – about the extracted data

Some documents may contain many acronyms (i.e. words formed from the initial letters of multi-word names, e.g. VBA for «Visual Basic for Applications»). It is helpful to include the definition/full name the first time you mention an acronym. Alternatively, you may want to create a list of all the acronyms and include the definitions in the list.

The Extract Acronyms lets you create such list, ready for adding the definitions. Any word consisting of 3 or more uppercase letters will be interpreted as an acronym. The acronyms will be filled into a 3-column table. For each acronym, the table will show:

  1. The acronym
  2. An empty cell to be used for definition of the acronym
  3. Page number of first occurrence of the acronym

Example of extracted acronyms

Example of extracted acronyms. The «Definitions» column can be used to add definitions of the acronyms.

Extract Bookmarks – about the extracted data

The bookmarks and metadata will be filled into a 5-column table. For each bookmark, the table will show:

  1. An index number
  2. Page number where the bookmark starts
  3. The name of the bookmark
  4. The part of the document in which the bookmark is found, e.g. «Main text», «First page footer», or «Primary header»
  5. The bookmarked text – if the bookmarks marks a position only, the text «(Empty bookmark)» will be shown

When you select the Extract Bookmarks command, a dialog box lets you select whether you want to include hidden bookmarks in the extract. The names of hidden bookmarks start with an underscore. Hidden bookmarks are added by Word e.g. in relation to tables of contents and cross-references. More details about this will be shown in the dialog box. See the illustrations below for examples of extracts excl. and incl. hidden bookmarks.

Example of extracted bookmarks, excl. hidden bookmarks

Example of extracted bookmarks, excl. hidden bookmarks

Example of extracted bookmarks, incl. hidden bookmarks

Example of extracted bookmarks, incl. hidden bookmarks

Extract Changes – about the extracted data

See also the DocTools ExtractChanges Pro add-in – an advanced add-in for extracting insertions, deletions and comments, in full context and including headings and subheadings.
Overview of differences – DocTools ExtractChanges Pro and DocTools ExtractData

The tracked changes and metadata will be filled into a 6-column table. Only insertions and deletions will be extracted. Any other type of change will be ignored. Also note that the macro will only include insertions and deletions in the main body of the document (i.e. changes in headers, footers, footnotes and endnotes will not be included). For each insertion or deletion, the table will show:

  1. Page number
  2. Line number
  3. The type of change
  4. The text that was inserted or deleted
  5. Name of the author who made the change
  6. The date the change was made

Example of extracted changes

Example of extracted changes.

Extract Comments – about the extracted data

See also the DocTools CommentManager add-in – an advanced add-in for managing comments in Word. DocTools CommentManager lets you extract comments with the following extra data compared to DocTools ExtractData: comment number, type of comment (level 1 or reply), line number, and author initials – a total of 9 types of data instead of 5. In addition, DocTools CommentManager includes a lot of tools that can help you manage comments quickly and easily.

The comments and metadata will be filled into a 6-column table. The command will extract comments that have been inserted via Review tab > New Comment. For each comment the table will show:

  1. Page number
  2. Line number– NEW in version 1.4
  3. The text that was commented (i.e. the scope)
  4. The comment itself
  5. Name of the author who inserted the comment
  6. Date when the comment was added, date format dd-MMM-yyyy

Example of extracted comments

Example of extracted comments

Free Trial icon

Generate complete documents in seconds from re-usable text or graphics

Manage comments in Word fast and easy – review comments, extract comments to Word or Excel, etc.

Simplify and speed up the management of cross-references even in your most complex documents

Manage and repeat data in Word fast and easy with custom document properties and DocProperty fields

Extract insertions, deletions and comments from any Word document, incl. context and headings

Apply any highlight color or remove highlight in Word with a single click – customizable shortcuts

Browse pages, sections, headings, tables, graphics, etc. and find text in Word with a single click

Check safety-critical procedure documents for human factor issues in minutes – improve quality and help prevent errors

Create screen tips in Word fast and easy – with up to 2040 characters

Понравилась статья? Поделить с друзьями:
  • Extract all formulas from excel
  • Extra word в английском что это
  • Extra word of the day
  • Extra word in english
  • Extra symbols in word