Extract data from word - Word и Excel - помощь в работе с программами

here’s some code making use of late binding (declare objects rather than word.application etc). From Excel 2003, it

opens a WORD document
searches for string «minimum stock»
moves the cursor some lines/words further
expands/selects the WORD cursor
pastes this WORD selection into EXCEL

steps 2-5 are repeated for «Period of report:» (note that the «:» is a word boundary, so we need to jump 8 words to the right to arrive at the date)

For WORD I copied the text from your Q just as is (no table, just plain text). If you use tables instead, you may need to play with the units of the various Move statements (e.g. for cells unit:=12); the strategy remains the same: find a constant text, move cursor to final destination, expand selection, create a word range and transfer.

Both items are placed into the current cell in Excel and its right neighbor.

Sub GrabUsage()
Dim FName As String, FD As FileDialog
Dim WApp As Object, WDoc As Object, WDR As Object
Dim ExR As Range

    Set ExR = Selection ' current location in Excel Sheet

    'let's select the WORD doc
    Set FD = Application.FileDialog(msoFileDialogOpen)
    FD.Show
    If FD.SelectedItems.Count <> 0 Then
        FName = FD.SelectedItems(1)
    Else
        Exit Sub
    End If

    ' open Word application and load doc
    Set WApp = CreateObject("Word.Application")
    ' WApp.Visible = True
    Set WDoc = WApp.Documents.Open(FName)

    ' go home and search
    WApp.Selection.HomeKey Unit:=6
    WApp.Selection.Find.ClearFormatting
    WApp.Selection.Find.Execute "Minimum Stock"

    ' move cursor from find to final data item
    WApp.Selection.MoveDown Unit:=5, Count:=1
    WApp.Selection.MoveRight Unit:=2, Count:=2

    ' the miracle happens here
    WApp.Selection.MoveRight Unit:=2, Count:=1, Extend:=1

    ' grab and put into excel        
    Set WDR = WApp.Selection
    ExR(1, 1) = WDR ' place at Excel cursor

    'repeat
    WApp.Selection.HomeKey Unit:=6
    WApp.Selection.Find.ClearFormatting
    WApp.Selection.Find.Execute "Period of Report:"
    WApp.Selection.MoveRight Unit:=2, Count:=8
    WApp.Selection.MoveRight Unit:=2, Count:=3, Extend:=1

    Set WDR = WApp.Selection
    ExR(1, 2) = WDR ' place in cell right of Excel cursor

    WDoc.Close
    WApp.Quit

End Sub

You can create a button and call that sub from there, or link GrabUsage() to a function key.

I commented out the WApp.Visible = True because in production you don’t want WORD even to show up, but you will need it for debugging and playing with the cursor movements.

The disadvantage of late binding (and not using references to the Word library) is the hardcoding of units (6=story, 5=line, 2=word) instead of using Word enumerations, but I sometimes get OS crashes with early binding …. not very sexy but it seems to work.

The FileDialog object needs a reference to the MS Office Office Library. AFAIK this is standard in Excel 2003, but better to check than to crash.

And I didn’t include code to check if the items are really found; I leave this to your creativity.

Hope that helps.

Источник

Word and Excel don’t usually get along too well so it’s no surprise that Power Query isn’t directly compatible with its estranged cousin Word either. If you are presented with the need to import data from Word into Power Query you’ll be please to hear it is possible however it requires a couple of manual steps to make it work.

The manual steps could fairly easily be completed by a batch file which would automate the process further.

Here is the Excel data pasted ‘as values’ in a Word file which i’ll use for the first example

Here is the Excel data pasted with ‘keep source formatting’ which i’ll reference a couple of times in the article.

Although the steps I’ve covered below aren’t complex, this whole process has some unknowns around it so you may find the result in your instance varies from mine. The Word file I’ve used contains the contents of a range of excel cells that I deliberately pasted as values into Word to create a test file for this example. I’ve just repeated the process with another file where I pasted using the default paste option (keep source formatting) and the below example didn’t work because the data was located in a different place in the XML structure of the document.xml file. After a bit of exploratory work I managed to locate the data which would allow this method to work. To summarise, if your word file is formatted differently to this example, you may need to open the xml file in a browser to locate the values from the table and then you’ll be able to navigate to it in Power Query.

Here is the XML Mapping from the two examples i’ve mentioned

This is how I found the data in the XML structure for the 2 examples I mentioned earlier

Word file containing a table with data pasted as values

body, p, r (to reveal column ‘t’ which contains the values from the table)

Word file containing table with data pasted with ‘keep source formatting’

body, tbl, tr, tc, p, r (to reveal column ‘t’ with the values from the table)

Even with these caveats, this process is one that demonstrates some of the potential of Power Query when combined with some creative manipulation of source data.

Warning! This method will make the file unopenable in Word, I’d suggest making a copy of the file and working with the copy.

I must give a nod to Matt Allington who documented a similar process with PDF files https://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/ which possibly had some influence in Microsoft providing a PDF connector into Power Query in 2019. I would assume that the demand for a Word connector would be significant so I expect this article to have a limited lifespan.

The M code from these queries is at the bottom of this article.

The manual steps

In windows Explorer, change the file extension from .docx to .zip and then extract (unzip) the file to a new folder. This will create a folder with the following data within it

In the ‘word’ folder there will be additional items, if you look at it you’ll notice the ‘document.xml’ file is larger than the other files. This is the file we’re interested in

Connecting Power Query to the document.xml file

In Power Query, use the ‘new source, more’ option to display the available connectors

Select the XML option, navigate to the ‘document.xml’ file and select it. Then you’ll be presented with the navigator window which can be used to load only portions of the file. See the comment earlier about the XML location in the file. In this example, the data in the Word file has been pasted as values so I’ll need the following selection.

Once loaded, column ‘r’ needs to be expanded which can be done by clicking the ‘expand’ option in the header row

This will then display the ‘t’ column which contains the values from the table in the word file

Then we’ll need to remove all the other columns apart from the ‘t’ column using the ‘Remove Other Columns’ step

For some reason there are some ‘null’ values in the first few rows which I know are not columns so we’ll remove them using the filter options

This will leave the following data in the ‘t’ column

This post could end here as the column displayed above is (technically) the contents of the word file. However, the structure isnt the same as the table in the word file. To make it the same we need to do some transformations.

Transforming the data to match the word file

I suspect each application of this approach may vary so this method is tailored for the file I’m using. It is possible your attempt creates something identical, if that’s the case then the following should work for you too

To transform the data we need to do the following

Create a column that groups each data point into a row
Create a column that groups each data point into a column
Pivot the data so it is displayed in columns and rows
Tidy up any headers/helper columns so we’re left with just the data we need

Column to group each data point into rows

The first 6 items in the above list are the items in the first row with the next 6 items in the next row. We need to have something in an adjacent column that can be used to group these items accordingly

First we’ll add a new index column from 0 using the following ribbon option

Next we’ll select the ‘Index’ column and then divide it by 6 using the ‘Standard -> divide’ options in the ‘Transform’ ribbon. Note these options are also in the ‘new column’ ribbon but we want to transform an existing column (not add a new column) so we use the ones in the ‘transform’ ribbon.

Next we’ll use the ’round down’ option in the same ribbon under ’rounding’

Now we have a column, that identifies the row number of the datapoint which matches the table in the word file. The last step is to rename the column to ‘RowID’

Column to group each data point into columns

Now we have a column to identify the

Add another index column from zero

Next we’ll add another column that multiplies the RowID column by 6

Next we’ll need to select both the new index and the Multiplication columns and then in the ‘add column’ ribbon, select ‘Standard, Subtract’ which will create a new column which will contain the column number

Rename the ‘Subtraction’ column to ‘ColumnID’

Lastly delete the ‘index’ and ‘multiplication columns

Pivot the data so its displayed in columns and rows

Now we have 3 columns, one with the data to display in each cell, another with the RowID and the third with the ColumnID we can pivot the data into a table.

Rename column ‘t’, ‘Values’. This isn’t strictly part of the process but it’ll make sense after the next step.

Select ‘ColumnID’ and then click ‘Pivot Column’ which is located in the ‘Transform’ ribbon

Note, when pivoting data, Its useful to remember the selected column is the one that will become the column headers.

This will present the following selection

As you can see there is a single drop down option which is titled ‘Values Column’, we want to have the column where the information to be displayed in each cell is located, in this instance the ‘Values’ column (see what I did there!)

By default, the pivot function assumes that the data in the values column needs to be counted. As we just want to display the single value in the ‘values’ column we need to change the aggregation option. To do this, click on the ‘Advanced options’ toggle to display the advanced option… Once it is displayed, change the ‘Aggregate Value Function’ drop down to ‘Don’t Aggregate’ and click ‘ok’.

This will return the following table, notice how the values in the ColumnID column (0 to 6) are now displayed, once, in the header row.

All that’s left is to remove the ‘RowID’ column and then promote the first row as headers and the table is identical to the one we saw at the start of the article in Word.

M Code used to create these queries

Excel data pasted in Word as values

let

Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWordDocworddocument.xml»)),

Table0 = Source{0}[Table],

Table1 = Table0{0}[Table],

#»Expanded r» = Table.ExpandTableColumn(Table1, «r», {«t», «tab»}, {«t», «tab»}),

#»Removed Other Columns» = Table.SelectColumns(#»Expanded r»,{«t»}),

#»Filtered Rows» = Table.SelectRows(#»Removed Other Columns», each ([t] <> null))

#»Filtered Rows»

Excel data pasted in Word as Values with transformation steps

Let

Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWordDocworddocument.xml»)),

Table0 = Source{0}[Table],

Table1 = Table0{0}[Table],

#»Expanded r» = Table.ExpandTableColumn(Table1, «r», {«t», «tab»}, {«t», «tab»}),

#»Removed Other Columns» = Table.SelectColumns(#»Expanded r»,{«t»}),

#»Filtered Rows1″ = Table.SelectRows(#»Removed Other Columns», each ([t] <> null)),

#»Added Index» = Table.AddIndexColumn(#»Filtered Rows1″, «Index», 0, 1, Int64.Type),

#»Divided Column» = Table.TransformColumns(#»Added Index», {{«Index», each _ / 6, type number}}),

#»Rounded Down» = Table.TransformColumns(#»Divided Column»,{{«Index», Number.RoundDown, Int64.Type}}),

#»Renamed Columns» = Table.RenameColumns(#»Rounded Down»,{{«Index», «RowID»}}),

#»Added Index1″ = Table.AddIndexColumn(#»Renamed Columns», «Index», 0, 1, Int64.Type),

#»Inserted Multiplication» = Table.AddColumn(#»Added Index1″, «Multiplication», each [RowID] * 6, type number),

#»Inserted Subtraction» = Table.AddColumn(#»Inserted Multiplication», «Subtraction», each [Index] — [Multiplication], type number),

#»Renamed Columns1″ = Table.RenameColumns(#»Inserted Subtraction»,{{«Subtraction», «ColumnID»}}),

#»Removed Columns» = Table.RemoveColumns(#»Renamed Columns1″,{«Index», «Multiplication»}),

#»Renamed Columns2″ = Table.RenameColumns(#»Removed Columns»,{{«t», «Values»}}),

#»Pivoted Column» = Table.Pivot(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»), List.Distinct(Table.TransformColumnTypes(#»Renamed Columns2″, {{«ColumnID», type text}}, «en-AU»)[ColumnID]), «ColumnID», «Values»),

#»Removed Columns1″ = Table.RemoveColumns(#»Pivoted Column»,{«RowID»}),

#»Promoted Headers» = Table.PromoteHeaders(#»Removed Columns1″, [PromoteAllScalars=true])

#»Promoted Headers»

Excel data pasted in Word as ‘keep source formatting‘ with transformation steps

let

Source = Xml.Tables(File.Contents(«C:Users…DocumentsDataSampleDataTableInWord2worddocument.xml»)),

Table = Source{0}[Table],

Table1 = Table{0}[Table],

Table2 = Table1{2}[Table],

#»Expanded tc» = Table.ExpandTableColumn(Table2, «tc», {«tcPr», «p»}, {«tcPr», «p»}),

#»Expanded p» = Table.ExpandTableColumn(#»Expanded tc», «p», {«pPr», «proofErr», «r», «http://schemas.openxmlformats.org/wordprocessingml/2006/main»}, {«pPr», «proofErr», «r», «http://schemas.openxmlformats.org/wordprocessingml/2006/main.1»}),

#»Expanded r» = Table.ExpandTableColumn(#»Expanded p», «r», {«rPr», «t», «http://schemas.openxmlformats.org/wordprocessingml/2006/main»}, {«rPr», «t», «http://schemas.openxmlformats.org/wordprocessingml/2006/main.2»}),