Reading a word file in python - Word и Excel - помощь в работе с программами

The MS Word utility from Microsoft Office suite is one of the most commonly used tools for writing text documents, both simple and complex. Though humans can easily read and write MS Word documents, assuming you have the Office software installed, often times you need to read text from Word documents within another application.

For instance, if you are developing a natural language processing application in Python that takes MS Word files as input, you will need to read MS Word files in Python before you can process the text. Similarly, often times you need to write text to MS Word documents as output, which could be a dynamically generated report to download, for example.

In this article, article you will see how to read and write MS Word files in Python.

Installing Python-Docx Library

Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip command in your terminal to download the python-docx module as shown below:

$ pip install python-docx

Reading MS Word Files with Python-Docx Module

In this section, you will see how to read text from MS Word files via the python-docx module.

Create a new MS Word file and rename it as «my_word_file.docx». I saved the file in the root of my «E» directory, although you can save the file anywhere you want. The my_word_file.docx file should have the following content:

To read the above file, first import the docx module and then create an object of the Document class from the docx module. Pass the path of the my_word_file.docx to the constructor of the Document class, as shown in the following script:

import docx

doc = docx.Document("E:/my_word_file.docx")

The Document class object doc can now be used to read the content of the my_word_file.docx.

Reading Paragraphs

Once you create an object of the Document class using the file path, you can access all the paragraphs in the document via the paragraphs attribute. An empty line is also read as a paragraph by the Document. Let’s fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:

all_paras = doc.paragraphs
len(all_paras)

Output:

Now we’ll iteratively print all the paragraphs in the my_word_file.docx file:

for para in all_paras:
    print(para.text)
    print("-------")

Output:

-------
Introduction
-------

-------
Welcome to stackabuse.com
-------
The best site for learning Python and Other Programming Languages
-------
Learn to program and write code in the most efficient manner
-------

-------
Details
-------

-------
This website contains useful programming articles for Java, Python, Spring etc.
-------

The output shows all of the paragraphs in the Word file.

We can even access a specific paragraph by indexing the paragraphs property like an array. Let’s print the 5th paragraph in the file:

single_para = doc.paragraphs[4]
print(single_para.text)

Output:

The best site for learning Python and Other Programming Languages

Reading Runs

A run in a word document is a continuous sequence of words having similar properties, such as similar font sizes, font shapes, and font styles. For example, if you look at the second line of the my_word_file.docx, it contains the text «Welcome to stackabuse.com», here the text «Welcome to» is in plain font, while the text «stackabuse.com» is in bold face. Hence, the text «Welcome to» is considered as one run, while the bold faced text «stackabuse.com» is considered as another run.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Similarly, «Learn to program and write code in the» and «most efficient manner» are treated as two different runs in the paragraph «Learn to program and write code in the most efficient manner».

To get all the runs in a paragraph, you can use the run property of the paragraph attribute of the doc object.

Let’s read all the runs from paragraph number 5 (4th index) in our text:

single_para = doc.paragraphs[4]
for run in single_para.runs:
    print(run.text)

Output:

The best site for
learning Python
 and Other
Programming Languages

In the same way, the following script prints all the runs from the 6th paragraph of the my_word_file.docx file:

second_para = doc.paragraphs[5]
for run in second_para.runs:
    print(run.text)

Output:

Learn to program and write code in the
most efficient manner

Writing MS Word Files with Python-Docx Module

In the previous section, you saw how to read MS Word files in Python using the python-docx module. In this section, you will see how to write MS Word files via the python-docx module.

To write MS Word files, you have to create an object of the Document class with an empty constructor, or without passing a file name.

mydoc = docx.Document()

Writing Paragraphs

To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need to call the save() method on the Document class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save() method. If the file doesn’t already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.

The following script writes a simple paragraph to a newly created MS Word file named «my_written_file.docx».

mydoc.add_paragraph("This is first paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

Once you execute the above script, you should see a new file «my_written_file.docx» in the directory that you specified in the save() method. Inside the file, you should see one paragraph which reads «This is first paragraph of a MS Word file.»

Let’s add another paragraph to the my_written_file.docx:

mydoc.add_paragraph("This is the second paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

This second paragraph will be appended at the end of the existing content in my_written_file.docx.

Writing Runs

You can also write runs using the python-docx module. To write runs, you first have to create a handle for the paragraph to which you want to add your run. Take a look at the following example to see how it’s done:

third_para = mydoc.add_paragraph("This is the third paragraph.")
third_para.add_run(" this is a section at the end of third paragraph")
mydoc.save("E:/my_written_file.docx")

In the script above we write a paragraph using the add_paragraph() method of the Document class object mydoc. The add_paragraph() method returns a handle for the newly added paragraph. To add a run to the new paragraph, you need to call the add_run() method on the paragraph handle. The text for the run is passed in the form of a string to the add_run() method. Finally, you need to call the save() method to create the actual file.

You can also add headers to MS Word files. To do so, you need to call the add_heading() method. The first parameter to the add_heading() method is the text string for header, and the second parameter is the header size. The header sizes start from 0, with 0 being the top level header.

The following script adds three headers of level 0, 1, and 2 to the file my_written_file.docx:

mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)
mydoc.save("E:/my_written_file.docx")

Adding Images

To add images to MS Word files, you can use the add_picture() method. The path to the image is passed as a parameter to the add_picture() method. You can also specify the width and height of the image using the docx.shared.Inches() attribute. The following script adds an image from the local file system to the my_written_file.docx Word file. The width and height of the image will be 5 and 7 inches, respectively:

mydoc.add_picture("E:/eiffel-tower.jpg", width=docx.shared.Inches(5), height=docx.shared.Inches(7))
mydoc.save("E:/my_written_file.docx")

After executing all the scripts in the Writing MS Word Files with Python-Docx Module section of this article, your final my_written_file.docx file should look like this:

In the output, you can see the three paragraphs that you added to the MS word file, along with the three headers and one image.

Conclusion

The article gave a brief overview of how to read and write MS Word files using the python-docx module. The article covers how to read paragraphs and runs from within a MS Word file. Finally, the process of writing MS Word files, adding a paragraph, runs, headers, and images to MS Word files have been explained in this article.

Источник

This post will talk about how to read Word Documents with Python. We’re going to cover three different packages – docx2txt, docx, and my personal favorite: docx2python.

The docx2txt package

Let’s talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we’ve imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string.

import docx2txt

# read in word file
result = docx2txt.process("zen_of_python.docx")

What if the file has images? In that case we just need a minor tweak to our code. When we run the process method, we can pass an extra parameter that specifies the name of an output directory. Running docx2txt.process will extract any images in the Word Document and save them into this specified folder. The text from the file will still also be extracted and stored in the result variable.

import docx2txt

result = docx2txt.process("zen_of_python_with_image.docx", "C:/path/to/store/files")

Sample Image

docx2txt will also scrape any text from tables. Again, this will be returned into a single string with any other text found in the document, which means this text can more difficult to parse. Later in this post we’ll talk about docx2python, which allows you to scrape tables in a more structured format.

The docx package

The source code behind docx2txt is derived from code in the docx package, which can also be used to scrape Word Documents. docx is a powerful library for manipulating and creating Word Documents, but can also (with some restrictions) read in text from Word files.

In the example below, we open a connection to our sample word file using the docx.Document method. Here we just input the name of the file we want to connect to. Then, we can scrape the text from each paragraph in the file using a list comprehension in conjunction with doc.paragraphs. This will include scraping separate lines defined in the Word Document for listed items. Unlike docx2txt, docx, cannot scrape images from Word Documents. Also, docx will not scrape out hyperlinks and text in tables defined in the Word Document.

import docx

# open connection to Word Document
doc = docx.Document("zen_of_python.docx")

# read in each paragraph in file
result = [p.text for p in doc.paragraphs]

The docx2python package

docx2python is another package we can use to scrape Word Documents. It has some additional features beyond docx2txt and docx. For example, it is able to return the text scraped from a document in a more structured format. Let’s test out our Word Document with docx2python. We’re going to add a simple table in the document so that we can extract that as well (see below).

docx2python contains a method with the same name. If we call this method with the document’s name as input, we get back an object with several attributes.

from docx2python import docx2python

# extract docx content
doc_result = docx2python('zen_of_python.docx')

Each attribute provides either text or information from the file. For example, consider that our file has three main components – the text containing the Zen of Python, a table, and an image. If we call doc_result.body, each of these components will be returned as separate items in a list.

# get separate components of the document
doc_result.body

# get the text from Zen of Python
doc_result[0]

# get the image
doc_result[1] 

# get the table text
doc_result[2]

Scraping a word document table with docx2python

The table text result is returned as a nested list, as you can see below. Each row (including the header) gets returned as a separate sub-list. The 0th element of the list refers to the header – or 0th row of the table. The next element refers to the next row in the table and so on. In turn, each value in a row is returned as an individual sub-list within that row’s corresponding list.

We can convert this result into a tabular format using pandas. The data frame is still a little messy – each cell in the data frame is a list containing a single value. This value also has quite a few “t”‘s (which represent tab spaces).

pd.DataFrame(doc_result.body[1][1:])

Here, we use the applymap method to apply the lambda function below to every cell in the data frame. This function gets the individual value within the list in each cell and removes all instances of “t”.

import pandas as pd


pd.DataFrame(doc_result.body[1][1:]).
                            applymap(lambda val: val[0].strip("t"))

Next, let’s change the column headers to what we see in the Word file (which was also returned to us in doc_result.body).


df.columns = [val[0].strip("t") for val in doc_result.body[1][0]]

Extracting images

We can extract the Word file’s images using the images attribute of our doc_result object. doc_result.images consists of a dictionary where the keys are the names of the image files (not automatically written to disk) and the corresponding values are the images files in binary format.

type(doc_result.images) # dict

doc_result.images.keys() # dict_keys(['image1.png'])

We can write the binary-formatted image out to a physical file like this:


for key,val in doc_result.images.items():
    f = open(key, "wb")
    f.write(val)
    f.close()

Above we’re just looping through the keys (image file names) and values (binary images) in the dictionary and writing each out to file. In this case, we only have one image in the document, so we just get one written out.

Other attributes

The docx2python result has several other attributes we can use to extract text or information from the file. For example, if we want to just get all of the file’s text in a single string (similar to docx2txt) we can run doc_result.text.

# get all text in a single string
doc_result.text

In addition to text, we can also get metadata about the file using the properties attribute. This returns information such as the creator of the document, the created / last modified dates, and number of revisions.

doc_result.properties

If the document you’re scraping has headers and footers, you can also scrape those out like this (note the singular version of “header” and “footer”):

# get the headers
doc_result.header

# get the footers
doc_result.footer

Footnotes can also be extracted like this:

doc_result.footnotes

Getting HTML returned with docx2python

We can also specify that we want to get an HTML object returned with the docx2python method that supports a few types of tags including font (size and color), italics, bold, and underline text. We just need to specify the parameter “html = True”. In the example below we see The Zen of Python in bold and underlined print. Corresponding to this, we can see the HTML version of this in the second snapshot below. The HTML feature does not currently support table-related tags, so I would recommend using the method we went through above if you’re looking to scrape tables from Word documents.


doc_html_result = docx2python('zen_of_python.docx', html = True)

Hope you enjoyed this post! Please check out other Python posts of mine below or by clicking here.

Источник

Summary: in this tutorial, you learn various ways to read text files in Python.

TL;DR

The following shows how to read all texts from the readme.txt file into a string:

with open('readme.txt') as f:
    lines = f.readlines()Code language: Python (python)

Steps for reading a text file in Python

To read a text file in Python, you follow these steps:

First, open a text file for reading by using the open() function.
Second, read text from the text file using the file read(), readline(), or readlines() method of the file object.
Third, close the file using the file close() method.

1) open() function

The open() function has many parameters but you’ll be focusing on the first two:

open(path_to_file, mode)Code language: Python (python)

The path_to_file parameter specifies the path to the text file.

If the program and file are in the same folder, you need to specify only the filename of the file. Otherwise, you need to include the path to the file as well as the filename.

To specify the path to the file, you use the forward-slash ('/') even if you’re working on Windows.

For example, if the file readme.txt is stored in the sample folder as the program, you need to specify the path to the file as c:/sample/readme.txt

The mode is an optional parameter. It’s a string that specifies the mode in which you want to open the file. The following table shows available modes for opening a text file:

Mode	Description
`'r'`	Open for text file for reading text
`'w'`	Open a text file for writing text
`'a'`	Open a text file for appending text

For example, to open a file whose name is the-zen-of-python.txt stored in the same folder as the program, you use the following code:

 f = open('the-zen-of-python.txt','r')Code language: Python (python)

The open() function returns a file object which you will use to read text from a text file.

2) Reading text methods

The file object provides you with three methods for reading text from a text file:

read(size) – read some contents of a file based on the optional size and return the contents as a string. If you omit the size, the read() method reads from where it left off till the end of the file. If the end of a file has been reached, the read() method returns an empty string.
readline() – read a single line from a text file and return the line as a string. If the end of a file has been reached, the readline() returns an empty string.
readlines() – read all the lines of the text file into a list of strings. This method is useful if you have a small file and you want to manipulate the whole text of that file.

3) close() method

The file that you open will remain open until you close it using the close() method.

It’s important to close the file that is no longer in use for the following reasons:

First, when you open a file in your script, the file system usually locks it down so no other programs or scripts can use it until you close it.
Second, your file system has a limited number of file descriptors that you can create before it runs out of them. Although this number might be high, it’s possible to open a lot of files and deplete your file system resources.
Third, leaving many files open may lead to race conditions which occur when multiple processes attempt to modify one file at the same time and can cause all kinds of unexpected behaviors.

The following shows how to call the close() method to close the file:

f.close()Code language: Python (python)

To close the file automatically without calling the close() method, you use the with statement like this:

with open(path_to_file) as f:
    contents = f.readlines()Code language: Python (python)

In practice, you’ll use the with statement to close the file automatically.

Reading a text file examples

We’ll use the-zen-of-python.txt file for the demonstration.

The following example illustrates how to use the read() method to read all the contents of the the-zen-of-python.txt file into a string:

with open('the-zen-of-python.txt') as f:
    contents = f.read()
    print(contents)Code language: Python (python)

Output:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
...Code language: Python (python)

The following example uses the readlines() method to read the text file and returns the file contents as a list of strings:

with open('the-zen-of-python.txt') as f:
    [print(line) for line in f.readlines()]Code language: Python (python)

Output:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

...Code language: Python (python)

The reason you see a blank line after each line from a file is that each line in the text file has a newline character (n). To remove the blank line, you can use the strip() method. For example:

with open('the-zen-of-python.txt') as f:
    [print(line.strip()) for line in f.readlines()]Code language: Python (python)

The following example shows how to use the readline() to read the text file line by line:

with open('the-zen-of-python.txt') as f:
    while True:
        line = f.readline()
        if not line:
            break
        print(line.strip())Code language: Python (python)

Output:

Explicit is better than implicit.
Complex is better than complicated.
Flat is better than nested.
...Code language: Python (python)

A more concise way to read a text file line by line

The open() function returns a file object which is an iterable object. Therefore, you can use a for loop to iterate over the lines of a text file as follows:

with open('the-zen-of-python.txt') as f:
    for line in f:
        print(line.strip())Code language: Python (python)

This is a more concise way to read a text file line by line.

Read UTF-8 text files

The code in the previous examples works fine with ASCII text files. However, if you’re dealing with other languages such as Japanese, Chinese, and Korean, the text file is not a simple ASCII text file. And it’s likely a UTF-8 file that uses more than just the standard ASCII text characters.

To open a UTF-8 text file, you need to pass the encoding='utf-8' to the open() function to instruct it to expect UTF-8 characters from the file.

For the demonstration, you’ll use the following quotes.txt file that contains some quotes in Japanese.

The following shows how to loop through the quotes.txt file:

with open('quotes.txt', encoding='utf8') as f:
    for line in f:
        print(line.strip())Code language: Python (python)

Output:

Summary

Use the open() function with the 'r' mode to open a text file for reading.
Use the read(), readline(), or readlines() method to read a text file.
Always close a file after completing reading it using the close() method or the with statement.
Use the encoding='utf-8' to read the UTF-8 text file.

Did you find this tutorial helpful ?

Источник

В этой статье вы узнаете, как в Python считывать и записывать файлы MS Word.

Установка библиотеки Python-Docx
Чтение файлов MS Word с помощью модуля Python-Docx
Чтение параграфов
Чтение прогонов
Написание файлов MS Word с помощью модуля Python-Docx
Запись абзацев
Запись прогонов
Запись заголовков
Добавление изображений
Заключение

Существует несколько библиотек, которые можно использовать для чтения и записи в Python файлов MS Word. Мы будем использовать модуль python-docx .

Выполните приведенную ниже pip команду в терминале, чтобы загрузить модуль python-docx:

$ pip install python-docx

Создайте новый файл MS Word и переименуйте его в my_word_file.docx. Я сохранил файл в корне диска E. Файл my_word_file.docx должен иметь следующее содержимое

Чтобы считать указанный файл, импортируйте модуль docx, а затем создайте объект класса Document из модуля docx. Затем передайте путь к файлу my_word_file.docx в конструктор класса Document:

import docx

doc = docx.Document("E:/my_word_file.docx")

Объект doc класса Document теперь можно использовать для чтения содержимого файла my_word_file.docx.

С помощью объекта класса Document и пути к файлу можно получить доступ ко всем абзацам документа с помощью атрибута paragraphs. Пустая строка также читается как абзац.

Извлечем все абзацы из файла my_word_file.docx и затем отобразим общее количество абзацев документа:

all_paras = doc.paragraphs
len(all_paras)

Вывод:

Теперь поочередно выведем все абзацы, присутствующие в файле my_word_file.docx:

for para in all_paras:
    print(para.text)
    print("-------")

Вывод:

-------
Introduction
-------

-------
Welcome to stackabuse.com
-------
The best site for learning Python and Other Programming Languages
-------
Learn to program and write code in the most efficient manner
-------

-------
Details
-------

-------
This website contains useful programming articles for Java, Python, Spring etc.
-------

Вывод демонстрирует все абзацы, присутствующие в файле my_word_file.docx.

Также можно получить доступ к определенному абзацу, индексируя свойство paragraphs как массив. Давайте выведем пятый абзац в файле:

single_para = doc.paragraphs[4]
print(single_para.text)

Вывод:

The best site for learning Python and Other Programming Languages

Прогон в текстовом документе представляет собой непрерывную последовательность слов, имеющих схожие свойства. Например, одинаковые размеры шрифта, формы шрифта и стили шрифта.

Вторая строка файла my_word_file.docx содержит текст «Welcome to stackabuse.com». Слова «Welcome to» написаны простым шрифтом, а текст «stackabuse.com» — жирным. Следовательно, текст «Welcome to» считается одним прогоном, а текст, выделенный жирным шрифтом «stackabuse.com», считается другим прогоном.

Чтобы получить все прогоны в абзаце, можно использовать свойство run атрибута paragraphобъекта doc.

Считаем все прогоны из абзаца №5 (четвертый указатель) в тексте:

single_para = doc.paragraphs[4]
for run in single_para.runs:
    print(run.text)

Вывод:

The best site for
learning Python
 and Other
Programming Languages

Аналогичным образом приведенный ниже скрипт выводит все прогоны из 6-го абзаца файла my_word_file.docx:

second_para = doc.paragraphs[5]
for run in second_para.runs:
    print(run.text)

Вывод:

Learn to program and write code in the
most efficient manner

Чтобы записать файлы MS Word, создайте объект класса Document с пустым конструктором.

Для записи абзацев используйте метод add_paragraph() объекта класса Document. После добавления абзаца нужно вызвать метод save(). Путь к файлу, в который нужно записать абзац, передается в качестве параметра методу save(). Если файл не существует, то будет создан новый файл. Иначе абзац будет добавлен в конец существующего файла MS Word.

Приведенный ниже скрипт записывает простой абзац во вновь созданный файл my_written_file.docx.

mydoc.add_paragraph("This is first paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

После выполнения этого скрипта вы должны увидеть новый файл my_written_file.docx в каталоге, который указали в методе save(). Внутри файла должен быть один абзац, который гласит: «This is first paragraph of a MS Word file.».

Добавим в файл my_written_file.docx еще один абзац:

mydoc.add_paragraph("This is the second paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

Этот абзац будет добавлен в конец файла my_written_file.docx.

Вы также можете записать прогоны с помощью модуля python-docx. Для этого нужно создать дескриптор абзаца, к которому хотите добавить прогон:

third_para = mydoc.add_paragraph("This is the third paragraph.")
third_para.add_run(" this is a section at the end of third paragraph")
mydoc.save("E:/my_written_file.docx")

В приведенном выше скрипте записывается абзац с помощью метода add_paragraph()объекта mydoc класса Document. Метод add_paragraph() возвращает дескриптор для вновь добавленного пункта.

Чтобы добавить прогон к новому абзацу, необходимо вызвать метод add_run() для дескриптора абзаца. Текст прогона передается в виде строки в метод add_run(). Затем необходимо вызвать метод save() для создания фактического файла.

В файлы MS Word также можно добавлять заголовки. Для этого нужно вызвать метод add_heading(). Первым параметром метода add_heading() является текстовая строка для заголовка, а вторым – размер заголовка.

Приведенный ниже скрипт добавляет в файл my_written_file.docx три заголовка уровня 0, 1 и 2:

mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)
mydoc.save("E:/my_written_file.docx")

Чтобы добавить в файлы MS Word изображения, используется метод add_picture(). Путь к изображению передается как параметр метода add_picture(). Также можно указать ширину и высоту изображения с помощью атрибута docx.shared.Inches().

Приведенный ниже скрипт добавляет изображение из локальной файловой системы в файл my_written_file.docx. Ширина и высота изображения будут 5 и 7 дюймов:

mydoc.add_picture("E:/eiffel-tower.jpg", width=docx.shared.Inches(5), height=docx.shared.Inches(7))
mydoc.save("E:/my_written_file.docx")

После выполнения всех скриптов, рассмотренных в этой статье, окончательный файл my_written_file.docx должен выглядеть следующим образом:

Он должен содержать три абзаца, три заголовка и одно изображение.

И этой статьи вы узнали, как читать и записывать файлы MS Word с помощью модуля python-docx.

Дайте знать, что вы думаете по этой теме материала в комментариях. Мы очень благодарим вас за ваши комментарии, лайки, отклики, дизлайки, подписки!

Источник

С помощью модуля python-docx можно создавать и изменять документы MS Word с расширением .docx. Чтобы установить этот модуль, выполняем команду

> pip install python-docx

При установке модуля надо вводить python-docx, а не docx (это другой модуль). В то же время при импортировании модуля python-docx следует использовать import docx, а не import python-docx.

Чтение документов MS Word

Файлы с расширением .docx обладают развитой внутренней структурой. В модуле python-docx эта структура представлена тремя различными типами данных. На самом верхнем уровне объект Document представляет собой весь документ. Объект Document содержит список объектов Paragraph, которые представляют собой абзацы документа. Каждый из абзацев содержит список, состоящий из одного или нескольких объектов Run, представляющих собой фрагменты текста с различными стилями форматирования.

import docx

doc = docx.Document('example.docx')

# количество абзацев в документе
print(len(doc.paragraphs))

# текст первого абзаца в документе
print(doc.paragraphs[0].text)

# текст второго абзаца в документе
print(doc.paragraphs[1].text)

# текст первого Run второго абзаца
print(doc.paragraphs[1].runs[0].text)

6
Название документа
Простой абзац с жирным и курсивным текстом
Простой абзац с

Получаем весь текст из документа:

text = []
for paragraph in doc.paragraphs:
    text.append(paragraph.text)
print('n'.join(text))

Название документа
Простой абзац с жирным и курсивным текстом
Заголовок, уровень 1
Выделенная цитата
Первый элемент маркированного списка
Первый элемент нумерованного списка

Стилевое оформление

В документах MS Word применяются два типа стилей: стили абзацев, которые могут применяться к объектам Paragraph, стили символов, которые могут применяться к объектам Run. Как объектам Paragraph, так и объектам Run можно назначать стили, присваивая их атрибутам style значение в виде строки. Этой строкой должно быть имя стиля. Если для стиля задано значение None, то у объекта Paragraph или Run не будет связанного с ним стиля.

Стили абзацев

Normal
Body Text
Body Text 2
Body Text 3
Caption
Heading 1
Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
Heading 7
Heading 8
Heading 9
Intense Quote
List
List 2
List 3
List Bullet
List Bullet 2
List Bullet 3
List Continue
List Continue 2
List Continue 3
List Number
List Number 2
List Number 3
List Paragraph
Macro Text
No Spacing
Quote
Subtitle
TOCHeading
Title

Стили символов

Emphasis
Strong
Book Title
Default Paragraph Font
Intense Emphasis
Subtle Emphasis
Intense Reference
Subtle Reference

paragraph.style = 'Quote'
run.style = 'Book Title'

Атрибуты объекта Run

Отдельные фрагменты текста, представленные объектами Run, могут подвергаться дополнительному форматированию с помощью атрибутов. Для каждого из этих атрибутов может быть задано одно из трех значений: True (атрибут активизирован), False (атрибут отключен) и None (применяется стиль, установленный для данного объекта Run).

bold — Полужирное начертание
underline — Подчеркнутый текст
italic — Курсивное начертание
strike — Зачеркнутый текст

Изменим стили для всех параграфов нашего документа:

import docx

doc = docx.Document('example.docx')

# изменяем стили для всех параграфов
for paragraph in doc.paragraphs:
    paragraph.style = 'Normal'

doc.save('restyled.docx')

А теперь восстановим все как было:

import docx

os.chdir('C:\example')

doc1 = docx.Document('example.docx')
doc2 = docx.Document('restyled.docx')

# получаем из первого документа стили всех абзацев
styles = []
for paragraph in doc1.paragraphs:
    styles.append(paragraph.style)

# применяем стили ко всем абзацам второго документа
for i in range(len(doc2.paragraphs)):
    doc2.paragraphs[i].style = styles[i]

doc2.save('restored.docx')

Изменим форматирвание объектов Run второго абзаца:

import docx

doc = docx.Document('example.docx')

# добавляем стиль символов для runs[0]
doc.paragraphs[1].runs[0].style = 'Intense Emphasis'
# добавляем подчеркивание для runs[4]
doc.paragraphs[1].runs[4].underline = True

doc.save('restyled2.docx')

Запись докуменов MS Word

Добавление абзацев осуществляется вызовом метода add_paragraph() объекта Document. Для добавления текста в конец существующего абзаца, надо вызвать метод add_run() объекта Paragraph:

import docx

doc = docx.Document()

# добавляем первый параграф
doc.add_paragraph('Здравствуй, мир!')

# добавляем еще два параграфа
par1 = doc.add_paragraph('Это второй абзац.')
par2 = doc.add_paragraph('Это третий абзац.')

# добавляем текст во второй параграф
par1.add_run(' Этот текст был добавлен во второй абзац.')

# добавляем текст в третий параграф
par2.add_run(' Добавляем текст в третий абзац.').bold = True

doc.save('helloworld.docx')

Оба метода, add_paragraph() и add_run() принимают необязательный второй аргумент, содержащий строку стиля, например:

doc.add_paragraph('Здравствуй, мир!', 'Title')

Добавление заголовков

Вызов метода add_heading() приводит к добавлению абзаца, отформатированного в соответствии с одним из возможных стилей заголовков:

doc.add_heading('Заголовок 0', 0)
doc.add_heading('Заголовок 1', 1)
doc.add_heading('Заголовок 2', 2)
doc.add_heading('Заголовок 3', 3)
doc.add_heading('Заголовок 4', 4)

Аргументами метода add_heading() являются строка текста и целое число от 0 до 4. Значению 0 соответствует стиль заголовка Title.

Добавление разрывов строк и страниц

Чтобы добавить разрыв строки (а не добавлять новый абзац), нужно вызвать метод add_break() объекта Run. Если же требуется добавить разрыв страницы, то методу add_break() надо передать значение docx.enum.text.WD_BREAK.PAGE в качестве единственного аргумента:

import docx

doc = docx.Document()

doc.add_paragraph('Это первая страница')
doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
doc.add_paragraph('Это вторая страница')

doc.save('pages.docx')

Добавление изображений

Метод add_picture() объекта Document позволяет добавлять изображения в конце документа. Например, добавим в конец документа изображение kitten.jpg шириной 10 сантиметров:

import docx

doc = docx.Document()

doc.add_paragraph('Это первый абзац')
doc.add_picture('kitten.jpg', width = docx.shared.Cm(10))

doc.save('picture.docx')

Именованные аргументы width и height задают ширину и высоту изображения. Если их опустить, то значения этих аргументов будут определяться размерами самого изображения.

Добавление таблицы

import docx

doc = docx.Document()

# добавляем таблицу 3x3
table = doc.add_table(rows = 3, cols = 3)
# применяем стиль для таблицы
table.style = 'Table Grid'

# заполняем таблицу данными
for row in range(3):
    for col in range(3):
        # получаем ячейку таблицы
        cell = table.cell(row, col)
        # записываем в ячейку данные
        cell.text = str(row + 1) + str(col + 1)

doc.save('table.docx')

import docx

doc = docx.Document('table.docx')

# получаем первую таблицу в документе
table = doc.tables[0]

# читаем данные из таблицы
for row in table.rows:
    string = ''
    for cell in row.cells:
        string = string + cell.text + ' '
    print(string)

11 12 13 
21 22 23 
31 32 33

Дополнительно

Документация python-docx

Поиск:
MS • Python • Web-разработка • Word • Модуль

Каталог оборудования

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Производители

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Функциональные группы

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Источник

In this tutorial, we’re gonna look at way to use python-docx module to read, write Word docx files in Python program.

Word documents

Word .docx file has more structures than plain text. With python-docx module, we have 3 different data types:
– a Document object for entire document.
– Paragraph objects for the paragraphs inside Document object.
– Each Paragraph object contains a list of Run objects.

Install python-docx module

Open cmd, then run:
pip install python-docx

Once the installation is successful, we can see docx folder at PythonPython[version]Libsite-packages.
(In this tutorial, we use python-docx 0.8.10)

Now we can import the module by running import docx.

Read docx file

Open file

We call docx.Document() function and pass the filename to open a docx file under a Document object.


>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')


Get paragraphs
Document object has paragraphs attribute that is a list of Paragraph objects.

>>> gkzDoc = docx.Document('ozenero.docx')

>>> len(gkzDoc.paragraphs)
4
>>> gkzDoc.paragraphs[0].text
'JavaSampleApproach.com was the predecessor website to ozenero.com.'
>>> gkzDoc.paragraphs[1].text
'In this brandnew site, we donu2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.'
>>> gkzDoc.paragraphs[2].text
'What does ozenero mean?'
>>> gkzDoc.paragraphs[3].text
'Well, ozenero is derived from the words grok and konez.'





  


                       
  




Get full-text
To get full-text of the document, we will:

- open the Word document

- loop over all Paragraph objects and then appends their text

>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')

>>> fullText = []
>>> for paragraph in doc.paragraphs:
...     fullText.append(paragraph.text)
...

>>> fullText
[
'JavaSampleApproach.com was the predecessor website to ozenero.com.',
'In this brandnew site, we donu2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.',
'What does ozenero mean?',
'Well, ozenero is derived from the words grok and konez.'
]

# add 2 lines between paragraphs and merge them all
>>> 'nn'.join(fullText)


Write docx file
Create Word document
We call docx.Document() function to get a new, blank Word Document object.

>>> import docx
>>> gkzDoc = docx.Document()
>>> gkzDoc



Save Word Document
When everything's done, we must use save('filename.docx') Document's method with filename to save the Document object to a file.

>>> gkzDoc.save('ozenero.docx')


Add Paragraphs
We use add_paragraph() Document's method to add a new paragraph and get a reference to this Paragraph object. 

>>> gkzDoc.add_paragraph('ozenero.com for developers!')



We can add text to the end of an existing Paragraph object using Paragraph’s add_run(text) method.



  


                       
  





>>> para1 = gkzDoc.add_paragraph('1- Python Tutorials')
>>> para2 = gkzDoc.add_paragraph('2- Tensorflow Tutorials')
>>> para1.add_run(' - Basics')



Result in ozenero.docx:

To make text styled, we add Run with text attributes. 

>>> para2.add_run(' - ')

>>> para2.add_run('Machine Learning').bold = True
>>> para2.add_run(' Tutorials').italic = True

# both bold and italic
>>> para3 = gkzDoc.add_paragraph()
>>> runner = para3.add_run('3- Big Data Tutorials')
>>> runner.bold = True
>>> runner.italic = True

>>> gkzDoc.save('ozenero.docx')


Result in ozenero.docx:

These are some text attributes:

- bold

- italic

- underline

- strike (strikethrough)

- double_strike (double strikethrough)

- all_caps (capital letters)

- shadow

- outline

- rtl (right-to-left)
Add Headings
We use Document's add_heading(heading, i) method to add a paragraph with heading style with i argument from 0 to 9 for heading levels.

>>> gkzDoc.add_heading('ozenero 1', 1)

>>> gkzDoc.add_heading('ozenero 2', 2)

>>> gkzDoc.add_heading('ozenero 3', 3)

>>> gkzDoc.add_heading('ozenero 4', 4)

>>> gkzDoc.add_heading('ozenero 5', 5)

>>> gkzDoc.add_heading('ozenero 6', 6)

>>> gkzDoc.add_heading('ozenero 7', 7)

>>> gkzDoc.add_heading('ozenero 8', 8)

>>> gkzDoc.add_heading('ozenero 9', 9)





  


                       
  





Add Line Breaks, Page Breaks
Instead of starting a new paragraph, we can add a line break using Run object add_break() method on the one that we want to have the break appear after.

>>> import docx
>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break()
>>> para.add_run('Python Basics')

>>> gkzDoc.save('ozenero.docx')
>>> para.text
'ozenero TutorialsnPython Basics'


Result in ozenero.docx:

We can also add a page break with add_break() method by passing the value docx.enum.text.WD_BREAK.PAGE as an argument to it.

>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
>>> gkzDoc.add_paragraph('Python Basics')

>>> gkzDoc.save('ozenero.docx')


Result in ozenero.docx:

Add Pictures
We can use Document object's add_picture() method to add an image to the end of the document.

>>> gkzDoc = docx.Document()
>>> gkzDoc.add_paragraph('ozenero Tutorials')

>>> gkzDoc.add_picture('gkn-logo-sm.png')

>>> gkzDoc.save('ozenero.docx')


Result in ozenero.docx:



  


                       
  





add_picture() method has optional width and height arguments.

If we don't use them, the width and height will default to the normal size of the image.

>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1))

>>> gkzDoc.save('ozenero.docx')


Result in ozenero.docx:


>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1), height=docx.shared.Cm(3))

>>> gkzDoc.save('ozenero.docx')


Result in ozenero.docx:




0

                    0

                    votes
Article Rating



  


                       
  








 Subscribe 
                        




1.1K Comments                    



  


                       
  






 Inline Feedbacks
View all comments





Post navigation


 Previous post: How to use Angular (6,7) Currency Pipe example



  


                       
  






 Next post: Kotlin List Sort: sort(), sortBy(), sortWith()

Источник

Python provides inbuilt functions for creating, writing, and reading files. There are two types of files that can be handled in python, normal text files and binary files (written in binary language, 0s, and 1s).

Text files: In this type of file, Each line of text is terminated with a special character called EOL (End of Line), which is the new line character (‘n’) in python by default.
Binary files: In this type of file, there is no terminator for a line, and the data is stored after converting it into machine-understandable binary language.

In this article, we will be focusing on opening, closing, reading, and writing data in a text file.

File Access Modes

Access modes govern the type of operations possible in the opened file. It refers to how the file will be used once its opened. These modes also define the location of the File Handle in the file. File handle is like a cursor, which defines from where the data has to be read or written in the file. There are 6 access modes in python.

Read Only (‘r’) : Open text file for reading. The handle is positioned at the beginning of the file. If the file does not exists, raises the I/O error. This is also the default mode in which a file is opened.
Read and Write (‘r+’): Open the file for reading and writing. The handle is positioned at the beginning of the file. Raises I/O error if the file does not exist.
Write Only (‘w’) : Open the file for writing. For the existing files, the data is truncated and over-written. The handle is positioned at the beginning of the file. Creates the file if the file does not exist.
Write and Read (‘w+’) : Open the file for reading and writing. For an existing file, data is truncated and over-written. The handle is positioned at the beginning of the file.
Append Only (‘a’): Open the file for writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.
Append and Read (‘a+’) : Open the file for reading and writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

How Files are Loaded into Primary Memory

There are two kinds of memory in a computer i.e. Primary and Secondary memory every file that you saved or anyone saved is on secondary memory cause any data in primary memory is deleted when the computer is powered off. So when you need to change any text file or just to work with them in python you need to load that file into primary memory. Python interacts with files loaded in primary memory or main memory through “file handlers” ( This is how your operating system gives access to python to interact with the file you opened by searching the file in its memory if found it returns a file handler and then you can work with the file ).

Opening a File

It is done using the open() function. No module is required to be imported for this function.

File_object = open(r"File_Name","Access_Mode")

The file should exist in the same directory as the python program file else, the full address of the file should be written in place of the filename. Note: The r is placed before the filename to prevent the characters in the filename string to be treated as special characters. For example, if there is temp in the file address, then t is treated as the tab character, and an error is raised of invalid address. The r makes the string raw, that is, it tells that the string is without any special characters. The r can be ignored if the file is in the same directory and the address is not being placed.

Python

file1 = open("MyFile1.txt","a")

file2 = open(r"D:TextMyFile2.txt","w+")

Here, file1 is created as an object for MyFile1 and file2 as object for MyFile2

Closing a file

close() function closes the file and frees the memory space acquired by that file. It is used at the time when the file is no longer needed or if it is to be opened in a different file mode. File_object.close()

Python

file1 = open("MyFile.txt","a")

file1.close()

Writing to a file

There are two ways to write in a file.

write() : Inserts the string str1 in a single line in the text file.

File_object.write(str1)

writelines() : For a list of string elements, each string is inserted in the text file.Used to insert multiple strings at a single time.

File_object.writelines(L) for L = [str1, str2, str3]

Reading from a file

There are three ways to read data from a text file.

read() : Returns the read bytes in form of a string. Reads n bytes, if no n specified, reads the entire file.

File_object.read([n])

readline() : Reads a line of the file and returns in form of a string.For specified n, reads at most n bytes. However, does not reads more than one line, even if n exceeds the length of the line.

File_object.readline([n])

readlines() : Reads all the lines and return them as each line a string element in a list.

  File_object.readlines()

Note: ‘n’ is treated as a special character of two bytes

Python3

file1 = open("myfile.txt","w")

L = ["This is Delhi n","This is Paris n","This is London n"]

file1.write("Hello n")

file1.writelines(L)

file1.close()

file1 = open("myfile.txt","r+")

print("Output of Read function is ")

print(file1.read())

print()

file1.seek(0)

print( "Output of Readline function is ")

print(file1.readline())

print()

file1.seek(0)

print("Output of Read(9) function is ")

print(file1.read(9))

print()

file1.seek(0)

print("Output of Readline(9) function is ")

print(file1.readline(9))

file1.seek(0)

print("Output of Readlines function is ")

print(file1.readlines())

print()

file1.close()

Output:

Output of Read function is 
Hello 
This is Delhi 
This is Paris 
This is London 


Output of Readline function is 
Hello 


Output of Read(9) function is 
Hello 
Th

Output of Readline(9) function is 
Hello 

Output of Readlines function is 
['Hello n', 'This is Delhi n', 'This is Paris n', 'This is London n']

Appending to a file

Python3

file1 = open("myfile.txt","w")

L = ["This is Delhi n","This is Paris n","This is London n"]

file1.writelines(L)

file1.close()

file1 = open("myfile.txt","a")

file1.write("Today n")

file1.close()

file1 = open("myfile.txt","r")

print("Output of Readlines after appending")

print(file1.readlines())

print()

file1.close()

file1 = open("myfile.txt","w")

file1.write("Tomorrow n")

file1.close()

file1 = open("myfile.txt","r")

print("Output of Readlines after writing")

print(file1.readlines())

print()

file1.close()

Output:

Output of Readlines after appending
['This is Delhi n', 'This is Paris n', 'This is London n', 'Today n']

Output of Readlines after writing
['Tomorrow n']

Related Article: File Objects in Python

This article is contributed by Harshit Agrawal. If you like GeeksforGeeks and would like to contribute, you can also write an article using write.geeksforgeeks.org or mail your article to review-team@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.

Источник

In this tutorial, you’ll learn how to read a text file in Python with the open function. Learning how to safely open, read, and close text files is an important skill to learn as you begin working with different types of files. In this tutorial, you’ll learn how to use context managers to safely and efficiently handle opening files.

By the end of this tutorial, you’ll have learned:

How to open and read a text file using Python
How to read files in different ways and in different formats
How to read a file all at once or line-by-line
How to read a file to a list or to a dictionary

Python provides a number of easy ways to create, read, and write files. Since we’re focusing on how to read a text file, let’s take a look at the Python open() function. This function, well, facilitates opening a file.

Let’s take a look at this Python open function:

open(
    file,           # The pathname
    mode=’r’,       # The mode to open the file in
    buffering=-1,   # The buffering policy
    encoding=None,  # The encoding used for the file
    errors=None,    # How encoding/decoding errors are handled
    newline=None,   # How to identify new lines
    closefd=True,   # Whether to keep file descriptor open
    opener=None     # Using a custom opener
    )

In this tutorial, we’ll focus on just three of the most important parameters: file=, mode=, and encoding=.

When opening a file, we have a number of different options in terms of how to open the file. This is controlled by the mode parameter. Let’s take a look at the various arguments this parameter takes:

Character	Meaning
`'r'`	Open for reading (default)
`'w'`	Open for writing (first truncates the file)
`'x'`	Open for exclusive creation (fails if file already exists)
`'a'`	Open for writing (appends to file if it exists)
`'b'`	Open in binary mode
`'t'`	Text mode
`'+'`	Open for updating (reading and writing)

The different arguments for the mode= parameter in the Python open function

Ok, let’s see how we can open a file in Python. Feel free to download this text file, if you want to follow along line by line. Save the file and note its path into the file_path variable below:

# Opening a text file in Python
file_path = '/Users/datagy/Desktop/sample_text.txt'

file = open(file_path)
print(file)

# Returns: <_io.TextIOWrapper name='/Users/datagy/Desktop/sample_text.txt' mode='r' encoding='UTF-8'>

When we run this, we’re opening the text file. At this, I’ll take a quick detour and discuss the importance of closing the file as well. Python must be explicitly told to manage the external resources we pass in. By default, Python will try and retain the resource for as long as possible, even when we’re done using it.

Because of this, we can close the file by using the .close() method:

# Closing a file with .close()
file_path = '/Users/datagy/Desktop/sample_text.txt'

file = open(file_path)
file.close()

Handling Files with Context Managers in Python

A better alternative to this approach is to use a context manager. The context manager handles opening the file, performing actions on it, and then safely closing the file! This means that your resources can be safer and your code can be cleaner.

Let’s take a look at how we can use a context manager to open a text file in Python:

# Using a context manager to open a file
file_path = '/Users/nikpi/Desktop/sample_text.txt'

with open(file_path) as file:
    ...

We can see here that by using the with keyword, we were able to open the file. The context manager then implicitly handles closing the file once all the nested actions are complete!

Ok, now that you have an understanding of how to open a file in Python, let’s see how we can actually read the file!

How To Read a Text File in Python

Let’s start by reading the entire text file. This can be helpful when you don’t have a lot of content in your file and want to see the entirety of the file’s content. To do this, we use the aptly-named .read() method.

Let’s see how we can use a context manager and the .read() method to read an entire text file in Python:

# Reading an entire text file in Python
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    print(file.read())

# Returns:
# Hi there!
# Welcome to datagy!
# Today we’re learning how to read text files.
# See you later!

We can see how easy that was! The .read() method returns a string, meaning that we could assign it to a variable as well.

The .read() method also takes an optional parameter to limit how many characters to read. Let’s see how we can modify this a bit:

# Reading only some characters
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    print(file.read(28))

# Returns:
# Hi there!
# Welcome to datagy!

If the argument provided is negative or blank, then the entire file will be read.

How to Read a Text File in Python Line by Line

In some cases, your files will be too large to conveniently read all at once. This is where being able to read your file line by line becomes important.

The Python .readline() method returns only a single line at a time. This can be very helpful when you’re parsing the file for a specific line, or simply want to print the lines slightly modified.

Let’s see how we can use this method to print out the file line by line:

# Reading a single line in Python
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    print(file.readline())

# Returns:
# Hi there!

In the example above, only the first line was returned. We can call the method multiple times in order to print more than one line:

# Reading multiple lines in Python
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    print(file.readline())
    print(file.readline())

# Returns:
# Hi there!

# Welcome to datagy!

This process can feel a bit redundant, especially because it requires you to know how many lines there are. Thankfully, the file object we created is an iterable, and we can simply iterate over these items:

# Printing all lines, line by line
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    for line in file:
        print(line)

# Returns:
# Hi there!

# Welcome to datagy!

# Today we’re learning how to read text files.

# See you later!

How to Read a Text File in Python to a List

Sometimes you’ll want to store the data that you read in a collection object, such as a Python list. We can accomplish this using the .readlines() method, which reads all lines at once into a list.

Let’s see how we can use this method:

# Reading a text file to a list
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    line_list = file.readlines()

print(line_list)

# Returns:
# ['Hi there!n', 'Welcome to datagy!n', 'Today we’re learning how to read text files.n', 'See you later!']

We can remove the new line characters by using the .rstrip() method, which removes trailing whitespace:

# Removing trailing new lines from our list
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path) as file:
    line_list = file.readlines()
    line_list = [item.rstrip() for item in line_list]

print(line_list)

# Returns:
# ['Hi there!', 'Welcome to datagy!', 'Today we’re learning how to read text files.', 'See you later!']

How to Read a Text File in Python to a Dictionary

Now, let’s see how we can read a file to a dictionary in Python. For this section, download the file linked here. The file contains a line by line shopping list of supplies that we need for a project:

Lumber, 4
Screws, 16
Nails, 12
Paint, 1
Hammer, 1

We want to be able to read this text file into a dictionary, so that we can easily reference the number of supplies we need per item.

Again, we can use the .readlines() method. We can then parse each item in the list and, using a dictionary comprehension, split the values to create a dictionary:

# Reading a file into a dictionary
file_path = '/Users/datagy/Desktop/text_resources.txt'

with open(file_path) as file:
    resources = {}
    for line in file:
        key, value = line.rstrip().split(',', 1)
        resources[key] = value

    # Or as a dictionary comprehension:
    resources2 = {key:value for line in file for key, value in [line.rstrip().split(',', 1)]}

print(resources)

# Returns: {'Lumber': ' 4', 'Screws': ' 16', 'Nails': ' 12', 'Paint': ' 1', 'Hammer': ' 1'}

In this case, the for loop is significantly easier to read. Let’s see what we’re doing here:

We instantiate an empty dictionary
We loop over each line in the file
For each line, we remove the trailing white-space and split the line by the first comma
We then assign the first value to be the key of the dictionary and then assign the value to be the remaining items

How to Read a Text File in Python with Specific Encoding

In some cases, you’ll be working with files that aren’t encoded in a way that Python can immediately handle. When this happens, you can specify the type of encoding to use. For example, we can read the file using the 'utf-8' encoding by writing the code below:

# Specifying the encoding
file_path = '/Users/datagy/Desktop/sample_text.txt'

with open(file_path, encoding='utf-8') as file:
    line_list = file.readlines()

Conclusion

In this post, you learned how to use Python to read a text file. You learned how to safely handle opening and closing the file using the with context manager. You then learned how to read a file, first all at once, then line by line. You also learned how to convert a text file into a Python list and how to parse a text file into a dictionary using Python.

Additional Resources

To learn more about related topics, check out the tutorials below:

Python: Copy a File (4 Different Ways)
Python Delete a File or Directory: A Complete Guide
Python: Check if a File or Directory Exists
Python open function: Official Documentation

Источник