Read excel pandas пример

В Python данные из файла Excel считываются в объект DataFrame. Для этого используется функция read_excel() модуля pandas.

Лист Excel — это двухмерная таблица. Объект DataFrame также представляет собой двухмерную табличную структуру данных.

  • Пример использования Pandas read_excel()
  • Список заголовков столбцов листа Excel
  • Вывод данных столбца
  • Пример использования Pandas to Excel: read_excel()
  • Чтение файла Excel без строки заголовка
  • Лист Excel в Dict, CSV и JSON
  • Ресурсы

Пример использования Pandas read_excel()

Предположим, что у нас есть документ Excel, состоящий из двух листов: «Employees» и «Cars». Верхняя строка содержит заголовок таблицы.

Пример использования Pandas read_excel() - 2

Ниже приведен код, который считывает данные листа «Employees» и выводит их.

import pandas

excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Employees')

# print whole sheet data
print(excel_data_df)

Вывод:

   EmpID    EmpName EmpRole
0      1     Pankaj     CEO
1      2  David Lee  Editor
2      3   Lisa Ray  Author

Первый параметр, который принимает функция read_excel ()— это имя файла Excel. Второй параметр (sheet_name) определяет лист для считывания данных.

При выводе содержимого объекта DataFrame мы получаем двухмерные таблицы, схожие по своей структуре со структурой документа Excel.

Чтобы получить список заголовков столбцов таблицы, используется свойство columns объекта Dataframe. Пример реализации:

print(excel_data_df.columns.ravel())

Вывод:

['Pankaj', 'David Lee', 'Lisa Ray']

Мы можем получить данные из столбца и преобразовать их в список значений. Пример:

print(excel_data_df['EmpName'].tolist())

Вывод:

['Pankaj', 'David Lee', 'Lisa Ray']

Можно указать имена столбцов для чтения из файла Excel. Это потребуется, если нужно вывести данные из определенных столбцов таблицы.

import pandas

excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Cars', usecols=['Car Name', 'Car Price'])
print(excel_data_df)

Вывод:

         Car Name      Car Price
0      Honda City     20,000 USD
1  Bugatti Chiron  3 Million USD
2     Ferrari 458   2,30,000 USD

Если в листе Excel нет строки заголовка, нужно передать его значение как None.

excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Numbers', header=None)

Если вы передадите значение заголовка как целое число (например, 3), тогда третья строка станет им. При этом считывание данных начнется со следующей строки. Данные, расположенные перед строкой заголовка, будут отброшены.

Объект DataFrame предоставляет различные методы для преобразования табличных данных в формат Dict , CSV или JSON.

excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Cars', usecols=['Car Name', 'Car Price'])

print('Excel Sheet to Dict:', excel_data_df.to_dict(orient='record'))
print('Excel Sheet to JSON:', excel_data_df.to_json(orient='records'))
print('Excel Sheet to CSV:n', excel_data_df.to_csv(index=False))

Вывод:

Excel Sheet to Dict: [{'Car Name': 'Honda City', 'Car Price': '20,000 USD'}, {'Car Name': 'Bugatti Chiron', 'Car Price': '3 Million USD'}, {'Car Name': 'Ferrari 458', 'Car Price': '2,30,000 USD'}]
Excel Sheet to JSON: [{"Car Name":"Honda City","Car Price":"20,000 USD"},{"Car Name":"Bugatti Chiron","Car Price":"3 Million USD"},{"Car Name":"Ferrari 458","Car Price":"2,30,000 USD"}]
Excel Sheet to CSV:
 Car Name,Car Price
Honda City,"20,000 USD"
Bugatti Chiron,3 Million USD
Ferrari 458,"2,30,000 USD"
  • Документы API pandas read_excel()

Дайте знать, что вы думаете по этой теме материала в комментариях. Мы крайне благодарны вам за ваши комментарии, дизлайки, подписки, лайки, отклики!

Время на прочтение
5 мин

Количество просмотров 63K

Excel — это чрезвычайно распространённый инструмент для анализа данных. С ним легко научиться работать, есть он практически на каждом компьютере, а тот, кто его освоил, может с его помощью решать довольно сложные задачи. Python часто считают инструментом, возможности которого практически безграничны, но который освоить сложнее, чем Excel. Автор материала, перевод которого мы сегодня публикуем, хочет рассказать о решении с помощью Python трёх задач, которые обычно решают в Excel. Эта статья представляет собой нечто вроде введения в Python для тех, кто хорошо знает Excel.

Загрузка данных

Начнём с импорта Python-библиотеки pandas и с загрузки в датафреймы данных, которые хранятся на листах sales и states книги Excel. Такие же имена мы дадим и соответствующим датафреймам.

import pandas as pd
sales = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'sales')
states = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'states')

Теперь воспользуемся методом .head() датафрейма sales для того чтобы вывести элементы, находящиеся в начале датафрейма:

print(sales.head())

Сравним то, что будет выведено, с тем, что можно видеть в Excel.

Сравнение внешнего вида данных, выводимых в Excel, с внешним видом данных, выводимых из датафрейма pandas

Тут можно видеть, что результаты визуализации данных из датафрейма очень похожи на то, что можно видеть в Excel. Но тут имеются и некоторые очень важные различия:

  • Нумерация строк в Excel начинается с 1, а в pandas номер (индекс) первой строки равняется 0.
  • В Excel столбцы имеют буквенные обозначения, начинающиеся с буквы A, а в pandas названия столбцов соответствуют именам соответствующих переменных.

Продолжим исследование возможностей pandas, позволяющих решать задачи, которые обычно решают в Excel.

Реализация возможностей Excel-функции IF в Python

В Excel существует очень удобная функция IF, которая позволяет, например, записать что-либо в ячейку, основываясь на проверке того, что находится в другой ячейке. Предположим, нужно создать в Excel новый столбец, ячейки которого будут сообщать нам о том, превышают ли 500 значения, записанные в соответствующие ячейки столбца B. В Excel такому столбцу (в нашем случае это столбец E) можно назначить заголовок MoreThan500, записав соответствующий текст в ячейку E1. После этого, в ячейке E2, можно ввести следующее:

=IF([@Sales]>500, "Yes", "No")

Использование функции IF в Excel

Для того чтобы сделать то же самое с использованием pandas, можно воспользоваться списковым включением (list comprehension):

sales['MoreThan500'] = ['Yes' if x > 500 else 'No' for x in sales['Sales']]

Списковые включения в Python: если текущее значение больше 500 — в список попадает Yes, в противном случае — No

Списковые включения — это отличное средство для решения подобных задач, позволяющее упростить код за счёт уменьшения потребности в сложных конструкциях вида if/else. Ту же задачу можно решить и с помощью if/else, но предложенный подход экономит время и делает код немного чище. Подробности о списковых включениях можно найти здесь.

Реализация возможностей Excel-функции VLOOKUP в Python

В нашем наборе данных, на одном из листов Excel, есть названия городов, а на другом — названия штатов и провинций. Как узнать о том, где именно находится каждый город? Для этого подходит Excel-функция VLOOKUP, с помощью которой можно связать данные двух таблиц. Эта функция работает по принципу левого соединения, когда сохраняется каждая запись из набора данных, находящегося в левой части выражения. Применяя функцию VLOOKUP, мы предлагаем системе выполнить поиск определённого значения в заданном столбце указанного листа, а затем — вернуть значение, которое находится на заданное число столбцов правее найденного значения. Вот как это выглядит:

=VLOOKUP([@City],states,2,false)

Зададим на листе sales заголовок столбца F как State и воспользуемся функцией VLOOKUP для того чтобы заполнить ячейки этого столбца названиями штатов и провинций, в которых расположены города.

Использование функции VLOOKUP в Excel

В Python сделать то же самое можно, воспользовавшись методом merge из pandas. Он принимает два датафрейма и объединяет их. Для решения этой задачи нам понадобится следующий код:

sales = pd.merge(sales, states, how='left', on='City')

Разберём его:

  1. Первый аргумент метода merge — это исходный датафрейм.
  2. Второй аргумент — это датафрейм, в котором мы ищем значения.
  3. Аргумент how указывает на то, как именно мы хотим соединить данные.
  4. Аргумент on указывает на переменную, по которой нужно выполнить соединение (тут ещё можно использовать аргументы left_on и right_on, нужные в том случае, если интересующие нас данные в разных датафреймах названы по-разному).

Сводные таблицы

Сводные таблицы (Pivot Tables) — это одна из самых мощных возможностей Excel. Такие таблицы позволяют очень быстро извлекать ценные сведения из больших наборов данных. Создадим в Excel сводную таблицу, выводящую сведения о суммарных продажах по каждому городу.

Создание сводной таблицы в Excel

Как видите, для создания подобной таблицы достаточно перетащить поле City в раздел Rows, а поле Sales — в раздел Values. После этого Excel автоматически выведет суммарные продажи для каждого города.

Для того чтобы создать такую же сводную таблицу в pandas, нужно будет написать следующий код:

sales.pivot_table(index = 'City', values = 'Sales', aggfunc = 'sum')

Разберём его:

  1. Здесь мы используем метод sales.pivot_table, сообщая pandas о том, что мы хотим создать сводную таблицу, основанную на датафрейме sales.
  2. Аргумент index указывает на столбец, по которому мы хотим агрегировать данные.
  3. Аргумент values указывает на то, какие значения мы собираемся агрегировать.
  4. Аргумент aggfunc задаёт функцию, которую мы хотим использовать при обработке значений (тут ещё можно воспользоваться функциями mean, max, min и так далее).

Итоги

Из этого материала вы узнали о том, как импортировать Excel-данные в pandas, о том, как реализовать средствами Python и pandas возможности Excel-функций IF и VLOOKUP, а также о том, как воспроизвести средствами pandas функционал сводных таблиц Excel. Возможно, сейчас вы задаётесь вопросом о том, зачем вам пользоваться pandas, если то же самое можно сделать и в Excel. На этот вопрос нет однозначного ответа. Python позволяет создавать код, который поддаётся тонкой настройке и глубокому исследованию. Такой код можно использовать многократно. Средствами Python можно описывать очень сложные схемы анализа данных. А возможностей Excel, вероятно, достаточно лишь для менее масштабных исследований данных. Если вы до этого момента пользовались только Excel — рекомендую испытать Python и pandas, и узнать о том, что у вас из этого получится.

А какие инструменты вы используете для анализа данных?

Напоминаем, что у нас продолжается конкурс прогнозов, в котором можно выиграть новенький iPhone. Еще есть время ворваться в него, и сделать максимально точный прогноз по злободневным величинам.

pandas.read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but reading two sheets it returns a Dict of DataFrame.

pandas Read Excel Key Points

  • This supports to read files with extension xls, xlsx, xlsm, xlsb, odf, ods and odt 
  • Can load excel files stored in a local filesystem or from an URL.
  • For URL, it supports http, ftp, s3, and file.
  • Also supports reading from a single sheet or a list of sheets.
  • When reading a two sheets, it returns a Dict of DataFrame.

Table of contents –

  • Read Excel Sheet into DataFrame
  • Read by Ignoring Column Names
  • Set Column from Excel as Index
  • Read Excel by Sheet Name
  • Read Two Sheets
  • Skip Columns From Excel
  • Skip Rows From Excel
  • Other Important Params

I have an excel file with two sheets named Technologies and Schedule, I will be using this to demonstrate how to read into pandas DataFrame.

pandas read excelpandas read excel sheet

Notice that on our excel file the top row contains the header of the table which can be used as column names on DataFrame.

1. pandas Read Excel Sheet

Use pandas.read_excel() function to read excel sheet into pandas DataFrame, by default it loads the first sheet from the excel file and parses the first row as a DataFrame column name. Excel file has an extension .xlsx. This function also supports several extensions xls, xlsx, xlsm, xlsb, odf, ods and odt .

Following are some of the features supported by read_excel() with optional param.

  • Reading excel file from URL, S3, and from local file ad supports several extensions.
  • Ignoreing the column names and provides an option to set column names.
  • Setting column as Index
  • Considering multiple values as NaN
  • Decimal points to use for numbers
  • Data types for each column
  • Skipping rows and columns

I will cover how to use some of these optional params with examples, first let’s see how to read an excel sheet & create a DataFrame without any params.


import pandas as pd
# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)

# Outputs
#   Courses    Fee  Duration  Discount
#0    Spark  25000   50 Days      2000
#1   Pandas  20000   35 Days      1000
#2     Java  15000       NaN       800
#3   Python  15000   30 Days       500
#4      PHP  18000   30 Days       800

Related: pandas Write to Excel Sheet

By default, it considers the first row from excel as a header and used it as DataFrame column names. In case you wanted to consider the first row from excel as a data record use header=None param and use names param to specify the column names. Not specifying names result in column names with numerical numbers.


# Read excel by considering first row as data
columns = ['courses','course_fee','course_duration','course_discount']
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   header=None, names = columns)
print(df2)

# Outputs        
#0  courses  course_fee  Duration  Discount
#1    Spark  25000   50 Days      2000
#2   Pandas  20000   35 Days      1000
#3     Java  15000       NaN       800
#4   Python  15000   30 Days       500
#5      PHP  18000   30 Days       800

3. Set Column from Excel as Index

If you notice, the DataFrame was created with the default index, if you wanted to set the column name as index use index_col param. This param takes values {int, list of int, default None}. If a list is passed with header positions, it creates a MultiIndex.

By default, it is set to None meaning not column is set as an index.


# Read excel by setting column as index
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   index_col=0)
print(df2)

# Outputs
#           Fee Duration  Discount
#Courses                          
#Spark    25000  50 Days      2000
#Pandas   20000  35 Days      1000
#Java     15000      NaN       800
#Python   15000  30 Days       500
#PHP      18000  30 Days       800

4. Read Excel by Sheet Name

As I said in the above section by default pandas read the first sheet from the excel file and provide a sheet_name param to read a specific sheet by name. This param takes {str, int, list, or None} as values. This is also used to load a sheet by position.

By default, it is set to 0 meaning load the first sheet.


# Read specific excel sheet
df = pd.read_excel('records.xlsx', sheet_name='Sheet1')
print(df)

5. Read Two Sheets

sheet_name param also takes a list of sheet names as values that can be used to read two sheets into pandas DataFrame. Not that while reading two sheets it returns a Dict of DataFrame. The key in Dict is a sheet name and the value would be DataFrame.

Use None to load all sheets from excel and returns a Dict of Dictionary.


# Read Multiple sheets
dict_df = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   sheet_name=['Technologies','Schedule'])

# Get DataFrame from Dict
technologies_df = dict_df .get('Technologies')
schedule_df = dict_df.get('Schedule')

# Print DataFrame's
print(technologies_df)
print(schedule_df)

I will leave this to you to execute and validate the output.

6. Skip Columns From Excel Sheet

Sometimes while reading an excel sheet into pandas DataFrame you may need to skip columns, you can do this by using usecols param. This takes values {int, str, list-like, or callable default None}. To specify the list of column names or positions use a list of strings or a list of int.

By default it is set to None meaning load all columns


# Read excel by skipping columns
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols=['Courses', 'Duration'])
print(df2)
# Outputs
#  Courses Duration
#0   Spark  50 Days
#1  Pandas  35 Days
#2    Java      NaN
#3  Python  30 Days
#4     PHP  30 Days

Alternatively, you can also write it by column position.


# Skip columns with list of values
df = pd.read_excel('records.xlsx', usecols=[0,2])
print(df)

Also supports a range of columns as value. For example, value ‘B:D’ means parsing B, C, and D columns.


# Skip columns by range
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols='B:D')
print(df2)

     Fee Duration  Discount
0  25000  50 Days      2000
1  20000  35 Days      1000
2  15000      NaN       800
3  15000  30 Days       500
4  18000  30 Days       800

7. Skip Rows from Excel Sheet

Use skiprows param to skip rows from the excel file, this param takes values {list-like, int, or callable, optional}. With this, you can skip the first few rows, selected rows, and range of rows. The below example skips the first 3 rows and considers the 4th row from excel as the header.


# Read excel file by skipping rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=2)
print(df2)

   Pandas  20000  35 Days  1000
0    Java  15000      NaN   800
1  Python  15000  30 Days   500
2     PHP  18000  30 Days   800

Use header=None to consider the 4th row as data. you can also use a list of rows to skip.


# Using skiprows to skip rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=[1,3])
print(df2)

  Courses    Fee Duration  Discount
0  Pandas  20000  35 Days      1000
1  Python  15000  30 Days       500
2     PHP  18000  30 Days       800

By using a lambda expression.


# Using skiprows with lambda
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=lambda x: x in [1,3])
print(df2)

8. Other Important Params

  • dtype – Dict with column name an type.
  • nrows – How many rows to parse.
  • na_values – Additional strings to recognize as NA/NaN. 
  • keep_default_na – Whether or not to include the default NaN values when parsing the data. 
  • na_filter – Filters missing values.
  • parse_dates – Specify the column index you wanted to parse as dates
  • thousands – Thousands separator for parsing string columns to numeric.
  • skipfooter – Specify how to rows you wanted to skip from the footer.
  • mangle_dupe_cols – Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, 

For complete params and description, refer to pandas documentation.

Conclusion

In this article, you have learned how to read an Excel sheet and covert it into DataFrame by ignoring header, skipping rows, skipping columns, specifying column names, and many more.

Happy Learning !!

Related Articles

  • pandas ExcelWriter Usage with Examples
  • pandas write CSV file
  • Pandas Read SQL Query or Table with Examples
  • Pandas Read TSV with Examples
  • Pandas Read Text with Examples
  • Pandas read_csv() with Examples
  • Pandas Read JSON File with Examples
  • How to Read CSV from String in Pandas
  • Pandas Write to Excel with Examples

References

  • https://docs.microsoft.com/en-us/deployoffice/compat/office-file-format-reference
  • https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions

Хотя многие Data Scientist’ы больше привыкли работать с CSV-файлами, на практике очень часто приходится сталкиваться с обычными Excel-таблицами. Поэтому сегодня мы расскажем, как читать Excel-файлы в Pandas, а также рассмотрим основные возможности Python-библиотеки OpenPyXL для чтения метаданных ячеек.

Дополнительные зависимости для возможности чтения Excel таблиц

Для чтения таблиц Excel в Pandas требуются дополнительные зависимости:

  • xlrd поддерживает старые и новые форматы MS Excel [1];
  • OpenPyXL поддерживает новые форматы MS Excel (.xlsx) [2];
  • ODFpy поддерживает свободные форматы OpenDocument (.odf, .ods и .odt) [3];
  • pyxlsb поддерживает бинарные MS Excel файлы (формат .xlsb) [4].

Мы рекомендуем установить только OpenPyXL, поскольку он нам пригодится в дальнейшем. Для этого в командной строке прописывается следующая операция:

pip install openpyxl

Затем в Pandas нужно указать путь к Excel-файлу и одну из установленных зависимостей. Python-код выглядит следующим образом:

import pandas as pd
pd.read_excel(io='temp1.xlsx', engine='openpyxl')
#
     Name  Age  Weight
0    Alex   35      87
1   Lesha   57      72
2  Nastya   21      64

Читаем несколько листов

Excel-файл может содержать несколько листов. В Pandas, чтобы прочитать конкретный лист, в аргументе нужно указать sheet_name. Можно указать список названий листов, тогда Pandas вернет словарь (dict) с объектами DataFrame:

dfs = pd.read_excel(io='temp1.xlsx',
                    engine='openpyxl',
                    sheet_name=['Sheet1', 'Sheet2'])
dfs
#
{'Sheet1':      Name  Age  Weight
 0    Alex   35      87
 1   Lesha   57      72
 2  Nastya   21      64,
 'Sheet2':     Name  Age  Weight
 0  Gosha   43      95
 1   Anna   24      65
 2   Lena   22      78}

Если таблицы в словаре имеют одинаковые атрибуты, то их можно объединить в один DataFrame. В Python это выглядит так:

pd.concat(dfs).reset_index(drop=True)
     Name  Age  Weight
0    Alex   35      87
1   Lesha   57      72
2  Nastya   21      64
3   Gosha   43      95
4    Anna   24      65
5    Lena   22      78

Указание диапазонов

Таблицы могут размещаться не в самом начале, а как, например, на рисунке ниже. Как видим, таблица располагается в диапазоне A:F.

Таблица Excel

Таблица с диапазоном

Чтобы прочитать такую таблицу, нужно указать диапазон в аргументе usecols. Также дополнительно можно добавить header — номер заголовка таблицы, а также nrows — количество строк, которые нужно прочитать. В аргументе header всегда передается номер строки на единицу меньше, чем в Excel-файле, поскольку в Python индексация начинается с 0 (на рисунке это номер 5, тогда указываем 4):

pd.read_excel(io='temp1.xlsx',
              engine='openpyxl',
              usecols='D:F',
              header=4, # в excel это №5
              nrows=3)
#
    Name  Age  Weight
0  Gosha   43      95
1   Anna   24      65
2   Lena   22      78

Читаем таблицы в OpenPyXL

Pandas прочитывает только содержимое таблицы, но игнорирует метаданные: цвет заливки ячеек, примечания, стили таблицы и т.д. В таком случае пригодится библиотека OpenPyXL. Загрузка файлов осуществляется через функцию load_workbook, а к листам обращаться можно через квадратные скобки:

from openpyxl import load_workbook
wb = load_workbook('temp2.xlsx')
ws = wb['Лист1']
type(ws)
# openpyxl.worksheet.worksheet.Worksheet

Excel-таблица OpenPyXL

Две таблицы на листе

Допустим, имеется Excel-файл с несколькими таблицами на листе (см. рисунок выше). Если бы мы использовали Pandas, то он бы выдал следующий результат:

pd.read_excel(io='temp2.xlsx',
              engine='openpyxl')
#
     Name  Age  Weight  Unnamed: 3 Name.1  Age.1  Weight.1
0    Alex   35      87         NaN  Tanya     25        66
1   Lesha   57      72         NaN  Gosha     43        77
2  Nastya   21      64         NaN  Tolya     32        54

Можно, конечно, заняться обработкой и привести таблицы в нормальный вид, а можно воспользоваться OpenPyXL, который хранит таблицу и его диапазон в словаре. Чтобы посмотреть этот словарь, нужно вызвать ws.tables.items. Вот так выглядит Python-код:

ws.tables.items()
wb = load_workbook('temp2.xlsx')
ws = wb['Лист1']
ws.tables.items()
#
[('Таблица1', 'A1:C4'), ('Таблица13', 'E1:G4')]

Обращаясь к каждому диапазону, можно проходить по каждой строке или столбцу, а внутри них – по каждой ячейке. Например, следующий код на Python таблицы объединяет строки в список, где первая строка уходит на заголовок, а затем преобразует их в DataFrame:

dfs = []
for table_name, value in ws.tables.items():
    table = ws[value]
    header, *body = [[cell.value for cell in row]
                      for row in table]
    df = pd.DataFrame(body, columns=header)
    dfs.append(df)

Если таблицы имеют одинаковые атрибуты, то их можно соединить в одну:

pd.concat(dfs)
#
     Name  Age  Weight
0    Alex   35      87
1   Lesha   57      72
2  Nastya   21      64
0   Tanya   25      66
1   Gosha   43      77
2   Tolya   32      54

Сохраняем метаданные таблицы

Как указано в коде выше, у ячейки OpenPyXL есть атрибут value, который хранит ее значение. Помимо value, можно получить тип ячейки (data_type), цвет заливки (fill), примечание (comment) и др.

Excel OpenPyXL

Таблица с цветными ячейками

Например, требуется сохранить данные о цвете ячеек. Для этого мы каждую ячейку с числами перезапишем в виде <значение,RGB>, где RGB — значение цвета в формате RGB (red, green, blue). Python-код выглядит следующим образом:

# _TYPES = {int:'n', float:'n', str:'s', bool:'b'}
data = []
for row in ws.rows:
    row_cells = []
    for cell in row:
        cell_value = cell.value
        if cell.data_type == 'n':
            cell_value = f"{cell_value},{cell.fill.fgColor.rgb}"
        row_cells.append(cell_value)
    data.append(row_cells)

Первым элементом списка является строка-заголовок, а все остальное уже значения таблицы:

pd.DataFrame(data[1:], columns=data[0])
#
     Name          Age       Weight
0    Alex  35,00000000  87,00000000
1   Lesha  57,00000000  72,FFFF0000
2  Nastya  21,FF00A933  64,00000000

Теперь представим атрибуты в виде индексов с помощью метода stack, а после разобьём все записи на значение и цвет методом str.split:

(pd.DataFrame(data[1:], columns=data[0])
 .set_index('Name')
 .stack()
 .str.split(',', expand=True)
)
#
                0         1
Name                       
Alex   Age     35  00000000
       Weight  87  00000000
Lesha  Age     57  00000000
       Weight  72  FFFF0000
Nastya Age     21  FF00A933
       Weight  64  0000000

Осталось только переименовать 0 и 1 на Value и Color, а также добавить атрибут Variable, который обозначит Вес и Возраст. Полный код на Python выглядит следующим образом:

(pd.DataFrame(data[1:], columns=data[0])
 .set_index('Name')
 .stack()
 .str.split(',', expand=True)
 .set_axis(['Value', 'Color'], axis=1)
 .rename_axis(index=['Name', 'Variable'])
 .reset_index()
)
#
     Name Variable Value     Color
0    Alex      Age    35  00000000
1    Alex   Weight    87  00000000
2   Lesha      Age    57  00000000
3   Lesha   Weight    72  FFFF0000
4  Nastya      Age    21  FF00A933
5  Nastya   Weight    64  00000000

Ещё больше подробностей о работе с таблицами в Pandas, а также их обработке на реальных примерах Data Science задач, вы узнаете на наших курсах по Python в лицензированном учебном центре обучения и повышения квалификации IT-специалистов в Москве.

Источники

  1. https://xlrd.readthedocs.io/en/latest/
  2. https://openpyxl.readthedocs.io/en/latest/
  3. https://github.com/eea/odfpy
  4. https://github.com/willtrnr/pyxlsb

Why learn to work with Excel with Python? Excel is one of the most popular and widely-used data tools; it’s hard to find an organization that doesn’t work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use Excel for both quick stats and serious data crunching.

With Excel being so pervasive, data professionals must be familiar with it. Working with data in Python or R offers serious advantages over Excel’s UI, so finding a way to work with Excel using code is critical. Thankfully, there’s a great tool already out there for using Excel with Python called pandas.

Pandas has excellent methods for reading all kinds of data from Excel files. You can also export your results from pandas back to Excel, if that’s preferred by your intended audience. Pandas is great for other routine data analysis tasks, such as:

  • quick Exploratory Data Analysis (EDA)
  • drawing attractive plots
  • feeding data into machine learning tools like scikit-learn
  • building machine learning models on your data
  • taking cleaned and processed data to any number of data tools

Pandas is better at automating data processing tasks than Excel, including processing Excel files.

In this tutorial, we are going to show you how to work with Excel files in pandas. We will cover the following concepts.

  • setting up your computer with the necessary software
  • reading in data from Excel files into pandas
  • data exploration in pandas
  • visualizing data in pandas using the matplotlib visualization library
  • manipulating and reshaping data in pandas
  • moving data from pandas into Excel

Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out our course.

System Prerequisites

We will use Python 3 and Jupyter Notebook to demonstrate the code in this tutorial.In addition to Python and Jupyter Notebook, you will need the following Python modules:

  • matplotlib — data visualization
  • NumPy — numerical data functionality
  • OpenPyXL — read/write Excel 2010 xlsx/xlsm files
  • pandas — data import, clean-up, exploration, and analysis
  • xlrd — read Excel data
  • xlwt — write to Excel
  • XlsxWriter — write to Excel (xlsx) files

There are multiple ways to get set up with all the modules. We cover three of the most common scenarios below.

  • If you have Python installed via Anaconda package manager, you can install the required modules using the command conda install. For example, to install pandas, you would execute the command — conda install pandas.
  • If you already have a regular, non-Anaconda Python installed on the computer, you can install the required modules using pip. Open your command line program and execute command pip install <module name> to install a module. You should replace <module name> with the actual name of the module you are trying to install. For example, to install pandas, you would execute command — pip install pandas.
  • If you don’t have Python already installed, you should get it through the Anaconda package manager. Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the full installer, you will get all the modules you need, along with Python and pandas within a single package. This is the easiest and fastest way to get started.

The Data Set

In this tutorial, we will use a multi-sheet Excel file we created from Kaggle’s IMDB Scores data. You can download the file here.

img-excel-1

Our Excel file has three sheets: ‘1900s,’ ‘2000s,’ and ‘2010s.’ Each sheet has data for movies from those years.

We will use this data set to find the ratings distribution for the movies, visualize movies with highest ratings and net earnings and calculate statistical information about the movies. We will be analyzing and exploring this data using Python and pandas, thus demonstrating pandas capabilities for working with Excel data in Python.

Read data from the Excel file

We need to first import the data from the Excel file into pandas. To do that, we start by importing the pandas module.

import pandas as pd

We then use the pandas’ read_excel method to read in data from the Excel file. The easiest way to call this method is to pass the file name. If no sheet name is specified then it will read the first sheet in the index (as shown below).

excel_file = 'movies.xls'
movies = pd.read_excel(excel_file)

Here, the read_excel method read the data from the Excel file into a pandas DataFrame object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called movies.

Pandas has a built-in DataFrame.head() method that we can use to easily display the first few rows of our DataFrame. If no argument is passed, it will display first five rows. If a number is passed, it will display the equal number of rows from the top.

movies.head()
Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
0 Intolerance: Love’s Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123 1.33 385907.0 NaN 436 22 9.0 481 691 1 10718 88 69.0 8.0
1 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110 1.33 100000.0 3000000.0 2 2 0.0 4 0 1 5 1 1.0 4.8
2 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151 1.33 245000.0 NaN 81 12 6.0 108 226 0 4849 45 48.0 8.3
3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 136 23 18.0 203 12000 1 111841 413 260.0 8.3
4 Pandora’s Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 426 20 3.0 455 926 1 7431 84 71.0 8.0

5 rows × 25 columns

Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is very important. To make this easy, the pandas read_excel method takes an argument called sheetname that tells pandas which sheet to read in the data from. For this, you can either use the sheet name or the sheet number. Sheet numbers start with zero. If the sheetname argument is not given, it defaults to zero and pandas will import the first sheet.

By default, pandas will automatically assign a numeric index or row label starting with zero. You may want to leave the default index as such if your data doesn’t have a column with unique values that can serve as a better index. In case there is a column that you feel would serve as a better index, you can override the default behavior by setting index_col property to a column. It takes a numeric value for setting a single column as index or a list of numeric values for creating a multi-index.

In the below code, we are choosing the first column, ‘Title’, as index (index=0) by passing zero to the index_col argument.

movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0)
movies_sheet1.head()
Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
Title
Intolerance: Love’s Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123 1.33 385907.0 NaN D.W. Griffith 436 22 9.0 481 691 1 10718 88 69.0 8.0
Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110 1.33 100000.0 3000000.0 Harry F. Millarde 2 2 0.0 4 0 1 5 1 1.0 4.8
The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151 1.33 245000.0 NaN King Vidor 81 12 6.0 108 226 0 4849 45 48.0 8.3
Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 Fritz Lang 136 23 18.0 203 12000 1 111841 413 260.0 8.3
Pandora’s Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 Georg Wilhelm Pabst 426 20 3.0 455 926 1 7431 84 71.0 8.0

5 rows × 24 columns

As you noticed above, our Excel data file has three sheets. We already read the first sheet in a DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.

movies_sheet2 = pd.read_excel(excel_file, sheetname=1, index_col=0)
movies_sheet2.head()
Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
Title
102 Dalmatians 2000 Adventure|Comedy|Family English USA G 100.0 1.85 85000000.0 66941559.0 Kevin Lima 2000.0 795.0 439.0 4182 372 1 26413 77.0 84.0 4.8
28 Days 2000 Comedy|Drama English USA PG-13 103.0 1.37 43000000.0 37035515.0 Betty Thomas 12000.0 10000.0 664.0 23864 0 1 34597 194.0 116.0 6.0
3 Strikes 2000 Comedy English USA R 82.0 1.85 6000000.0 9821335.0 DJ Pooh 939.0 706.0 585.0 3354 118 1 1415 10.0 22.0 4.0
Aberdeen 2000 Drama English UK NaN 106.0 1.85 6500000.0 64148.0 Hans Petter Moland 844.0 2.0 0.0 846 260 0 2601 35.0 28.0 7.3
All the Pretty Horses 2000 Drama|Romance|Western English USA PG-13 220.0 2.35 57000000.0 15527125.0 Billy Bob Thornton 13000.0 861.0 820.0 15006 652 2 11388 183.0 85.0 5.8

5 rows × 24 columns

movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0)
movies_sheet3.head()
Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Director Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
Title
127 Hours 2010.0 Adventure|Biography|Drama|Thriller English USA R 94.0 1.85 18000000.0 18329466.0 Danny Boyle 11000.0 642.0 223.0 11984 63000 0.0 279179 440.0 450.0 7.6
3 Backyards 2010.0 Drama English USA R 88.0 NaN 300000.0 NaN Eric Mendelsohn 795.0 659.0 301.0 1884 92 0.0 554 23.0 20.0 5.2
3 2010.0 Comedy|Drama|Romance German Germany Unrated 119.0 2.35 NaN 59774.0 Tom Tykwer 24.0 20.0 9.0 69 2000 0.0 4212 18.0 76.0 6.8
8: The Mormon Proposition 2010.0 Documentary English USA R 80.0 1.78 2500000.0 99851.0 Reed Cowan 191.0 12.0 5.0 210 0 0.0 1138 30.0 28.0 7.1
A Turtle’s Tale: Sammy’s Adventures 2010.0 Adventure|Animation|Family English France PG 88.0 2.35 NaN NaN Ben Stassen 783.0 749.0 602.0 3874 0 2.0 5385 22.0 56.0 6.1

5 rows × 24 columns

Since all the three sheets have similar data but for different recordsmovies, we will create a single DataFrame from all the three DataFrames we created above. We will use the pandas concat method for this and pass in the names of the three DataFrames we just created and assign the results to a new DataFrame object, movies. By keeping the DataFrame name same as before, we are over-writing the previously created DataFrame.

movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])

We can check if this concatenation by checking the number of rows in the combined DataFrame by calling the method shape on it that will give us the number of rows and columns.

movies.shape
(5042, 24)

Using the ExcelFile class to read multiple sheets

We can also use the ExcelFile class to work with multiple sheets from the same Excel file. We first wrap the Excel file using ExcelFile and then pass it to read_excel method.

xlsx = pd.ExcelFile(excel_file)
movies_sheets = []
for sheet in xlsx.sheet_names:
   movies_sheets.append(xlsx.parse(sheet))
movies = pd.concat(movies_sheets)

If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile is more convenient and efficient in comparison to read_excel. With ExcelFile, you only need to pass the Excel file once, and then you can use it to get the DataFrames. When using read_excel, you pass the Excel file every time and hence the file is loaded again for every sheet. This can be a huge performance drag if the Excel file has many sheets with a large number of rows.

Exploring the data

Now that we have read in the movies data set from our Excel file, we can start exploring it using pandas. A pandas DataFrame stores the data in a tabular format, just like the way Excel displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the Excel file we just read in.

We already introduced the method head in the previous section that displays few rows from the top from the DataFrame. Let’s look at few more methods that come in handy while exploring the data set.

We can use the shape method to find out the number of rows and columns for the DataFrame.

movies.shape
(5042, 25)

This tells us our Excel file has 5042 records and 25 columns or observations. This can be useful in reporting the number of records and columns and comparing that with the source data set.

We can use the tail method to view the bottom rows. If no parameter is passed, only the bottom five rows are returned.

movies.tail()
Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
1599 War & Peace NaN Drama|History|Romance|War English UK TV-14 NaN 16.00 NaN NaN 1000.0 888.0 502.0 4528 11000 1.0 9277 44.0 10.0 8.2
1600 Wings NaN Comedy|Drama English USA NaN 30.0 1.33 NaN NaN 685.0 511.0 424.0 1884 1000 5.0 7646 56.0 19.0 7.3
1601 Wolf Creek NaN Drama|Horror|Thriller English Australia NaN NaN 2.00 NaN NaN 511.0 457.0 206.0 1617 954 0.0 726 6.0 2.0 7.1
1602 Wuthering Heights NaN Drama|Romance English UK NaN 142.0 NaN NaN NaN 27000.0 698.0 427.0 29196 0 2.0 6053 33.0 9.0 7.7
1603 Yu-Gi-Oh! Duel Monsters NaN Action|Adventure|Animation|Family|Fantasy Japanese Japan NaN 24.0 NaN NaN NaN 0.0 NaN NaN 0 124 0.0 12417 51.0 6.0 7.0

5 rows × 25 columns

In Excel, you’re able to sort a sheet based on the values in one or more columns. In pandas, you can do the same thing with the sort_values method. For example, let’s sort our movies DataFrame based on the Gross Earnings column.

sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=False)

Since we have the data sorted by values in a column, we can do few interesting things with it. For example, we can display the top 10 movies by Gross Earnings.

sorted_by_gross["Gross Earnings"].head(10)
1867 760505847.0
1027 658672302.0
1263 652177271.0
610 623279547.0
611 623279547.0
1774 533316061.0
1281 474544677.0
226 460935665.0
1183 458991599.0
618 448130642.0
Name: Gross Earnings, dtype: float64

We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to visualize your data with plots and charts through matplotlib, a popular data visualization library. With a couple lines of code, you can start plotting. Moreover, matplotlib plots work well inside Jupyter Notebooks since you can displace the plots right under the code.

First, we import the matplotlib module and set matplotlib to display the plots right in the Jupyter Notebook.

import matplotlib.pyplot as plt%matplotlib inline

We will draw a bar plot where each bar will represent one of the top 10 movies. We can do this by calling the plot method and setting the argument kind to barh. This tells matplotlib to draw a horizontal bar plot.

sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh")
plt.show()

python-pandas-and-excel_28_0

Let’s create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies. Histograms are a good way to visualize the distribution of a data set. We use the plot method on the IMDB Scores series from our movies DataFrame and pass it the argument.

movies['IMDB Score'].plot(kind="hist")
plt.show()

python-pandas-and-excel_30_0

This data visualization suggests that most of the IMDB Scores fall between six and eight.

Getting statistical information about the data

Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the describe method to get a statistical summary of the data set.

movies.describe()
Year Duration Aspect Ratio Budget Gross Earnings Facebook Likes — Director Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
count 4935.000000 5028.000000 4714.000000 4.551000e+03 4.159000e+03 4938.000000 5035.000000 5029.000000 5020.000000 5042.000000 5042.000000 5029.000000 5.042000e+03 5022.000000 4993.000000 5042.000000
mean 2002.470517 107.201074 2.220403 3.975262e+07 4.846841e+07 686.621709 6561.323932 1652.080533 645.009761 9700.959143 7527.457160 1.371446 8.368475e+04 272.770808 140.194272 6.442007
std 12.474599 25.197441 1.385113 2.061149e+08 6.845299e+07 2813.602405 15021.977635 4042.774685 1665.041728 18165.101925 19322.070537 2.013683 1.384940e+05 377.982886 121.601675 1.125189
min 1916.000000 7.000000 1.180000 2.180000e+02 1.620000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000e+00 1.000000 1.000000 1.600000
25% 1999.000000 93.000000 1.850000 6.000000e+06 5.340988e+06 7.000000 614.500000 281.000000 133.000000 1411.250000 0.000000 0.000000 8.599250e+03 65.000000 50.000000 5.800000
50% 2005.000000 103.000000 2.350000 2.000000e+07 2.551750e+07 49.000000 988.000000 595.000000 371.500000 3091.000000 166.000000 1.000000 3.437100e+04 156.000000 110.000000 6.600000
75% 2011.000000 118.000000 2.350000 4.500000e+07 6.230944e+07 194.750000 11000.000000 918.000000 636.000000 13758.750000 3000.000000 2.000000 9.634700e+04 326.000000 195.000000 7.200000
max 2016.000000 511.000000 16.000000 1.221550e+10 7.605058e+08 23000.000000 640000.000000 137000.000000 23000.000000 656730.000000 349000.000000 43.000000 1.689764e+06 5060.000000 813.000000 9.500000

The describe method displays below information for each of the columns.

  • the count or number of values
  • mean
  • standard deviation
  • minimum, maximum
  • 25%, 50%, and 75% quantile

Please note that this information will be calculated only for the numeric values.

We can also use the corresponding method to access this information one at a time. For example, to get the mean of a particular column, you can use the mean method on that column.

movies["Gross Earnings"].mean()
48468407.526809327

Just like mean, there are methods available for each of the statistical information we want to access. You can read about these methods in our free pandas cheat sheet.

Reading files with no header and skipping records

Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn’t have any header row. For such instances, you can tell pandas not to consider the first row as header or columns names. And If the Excel sheet’s first few rows contain data that should not be read in, you can ask the read_excel method to skip a certain number of rows, starting from the top.

For example, look at the top few rows of this Excel file.img-excel-no-header-1

This file obviously has no header and first four rows are not actual records and hence should not be read in. We can tell read_excel there is no header by setting argument header to None and we can skip first four rows by setting argument skiprows to four.

movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4)
movies_skip_rows.head(5)
0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24
0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 136 23 18.0 203 12000 1 111841 413 260.0 8.3
1 Pandora’s Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 426 20 3.0 455 926 1 7431 84 71.0 8.0
2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 77 28 4.0 109 167 8 4546 71 36.0 6.3
3 Hell’s Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN 431 12 4.0 457 279 1 3753 53 35.0 7.8
4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns

We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one can combine different options in a single read statement. To skip rows at the bottom of the sheet, you can use option skip_footer, which works just like skiprows, the only difference being the rows are counted from the bottom upwards.

The column names in the previous DataFrame are numeric and were allotted as default by the pandas. We can rename the column names to descriptive ones by calling the method columns on the DataFrame and passing the column names as a list.

movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director', 'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director', 'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2', 'Facebook Likes - Actor 3', 'Facebook Likes - cast Total', 'Facebook likes - Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score']
movies_skip_rows.head()
Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Facebook Likes — Actor 1 Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 136 23 18.0 203 12000 1 111841 413 260.0 8.3
1 Pandora’s Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 426 20 3.0 455 926 1 7431 84 71.0 8.0
2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 77 28 4.0 109 167 8 4546 71 36.0 6.3
3 Hell’s Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN 431 12 4.0 457 279 1 3753 53 35.0 7.8
4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns

Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a subset of columns.

Reading a subset of columns

Although read_excel defaults to reading and importing all columns, you can choose to import only certain columns. By passing parse_cols=6, we are telling the read_excel method to read only the first columns till index six or first seven columns (the first column being indexed zero).

movies_subset_columns = pd.read_excel(excel_file, parse_cols=6)
movies_subset_columns.head()
Title Year Genres Language Country Content Rating Duration
0 Intolerance: Love’s Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123
1 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110
2 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151
3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145
4 Pandora’s Box 1929 Crime|Drama|Romance German Germany Not Rated 110

Alternatively, you can pass in a list of numbers, which will let you import columns at particular indexes.

Applying formulas on the columns

One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.

movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]

Above, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It’s worth noting the difference here in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes — with Python, the calculations happen and the values are stored — if Gross Earnings for one movie was manually changed, Net Earnings won’t be updated.

Let’s use the sort_values method to sort the data by the new column we created and visualize the top 10 movies by Net Earnings.

sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'], ascending=[False])sorted_movies.head(10)['Net Earnings'].plot.barh()
plt.show()

python-pandas-and-excel_44_0

Pivot Table in pandas

Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let’s start small, by choosing Year as the index column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.

movies_subset = movies[['Year', 'Gross Earnings']]
movies_subset.head()
Year Gross Earnings
0 1916.0 NaN
1 1920.0 3000000.0
2 1925.0 NaN
3 1927.0 26435.0
4 1929.0 9950.0

We now call pivot_table on this subset of data. The method pivot_table takes a parameter index. As mentioned, we want to use Year as the index.

earnings_by_year = movies_subset.pivot_table(index=['Year'])
earnings_by_year.head()
Gross Earnings
Year
1916.0 NaN
1920.0 3000000.0
1925.0 NaN
1927.0 26435.0
1929.0 1408975.0

This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings. Notice, we didn’t need to specify Gross Earnings column explicitly as pandas automatically identified it the values on which summarization should be applied.

We can use this pivot table to create some data visualizations. We can call the plot method on the DataFrame to create a line plot and call the show method to display the plot in the notebook.

earnings_by_year.plot()
plt.show()

python-pandas-and-excel_50_0

We saw how to pivot with a single column as the index. Things will get more interesting if we can use multiple columns. Let’s create another DataFrame subset but this time we will choose the columns, Country, Language and Gross Earnings.

movies_subset = movies[['Country', 'Language', 'Gross Earnings']]
movies_subset.head()
Country Language Gross Earnings
0 USA NaN NaN
1 USA NaN 3000000.0
2 USA NaN NaN
3 Germany German 26435.0
4 Germany German 9950.0

We will use columns Country and Language as the index for the pivot table. We will use Gross Earnings as summarization table, however, we do not need to specify this explicitly as we saw earlier.

earnings_by_co_lang = movies_subset.pivot_table(index=['Country', 'Language'])
earnings_by_co_lang.head()
Gross Earnings
Country Language
Afghanistan Dari 1.127331e+06
Argentina Spanish 7.230936e+06
Aruba English 1.007614e+07
Australia Aboriginal 6.165429e+06
Dzongkha 5.052950e+05

Let’s visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot table, we will plot just a few of them.

earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8))
plt.show()

python-pandas-and-excel_56_0

Exporting the results to Excel

If you’re going to be working with colleagues who use Excel, saving Excel files out of pandas is important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel method is called on the DataFrame we want to export.We also need to pass a filename to which this DataFrame will be written.

movies.to_excel('output.xlsx')

By default, the index is also saved to the output file. However, sometimes the index doesn’t provide any useful information. For example, the movies DataFrame has a numeric auto-increment index, that was not part of the original Excel data.

movies.head()
Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Facebook Likes — Actor 2 Facebook Likes — Actor 3 Facebook Likes — cast Total Facebook likes — Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Net Earnings
0 Intolerance: Love’s Struggle Throughout the Ages 1916.0 Drama|History|War NaN USA Not Rated 123.0 1.33 385907.0 NaN 22.0 9.0 481 691 1.0 10718 88.0 69.0 8.0 NaN
1 Over the Hill to the Poorhouse 1920.0 Crime|Drama NaN USA NaN 110.0 1.33 100000.0 3000000.0 2.0 0.0 4 0 1.0 5 1.0 1.0 4.8 2900000.0
2 The Big Parade 1925.0 Drama|Romance|War NaN USA Not Rated 151.0 1.33 245000.0 NaN 12.0 6.0 108 226 0.0 4849 45.0 48.0 8.3 NaN
3 Metropolis 1927.0 Drama|Sci-Fi German Germany Not Rated 145.0 1.33 6000000.0 26435.0 23.0 18.0 203 12000 1.0 111841 413.0 260.0 8.3 -5973565.0
4 Pandora’s Box 1929.0 Crime|Drama|Romance German Germany Not Rated 110.0 1.33 NaN 9950.0 20.0 3.0 455 926 1.0 7431 84.0 71.0 8.0 NaN

5 rows × 26 columns

You can choose to skip the index by passing along index-False.

movies.to_excel('output.xlsx', index=False)

We need to be able to make our output files look nice before we can send it out to our co-workers. We can use pandas ExcelWriter class along with the XlsxWriter Python module to apply the formatting.

We can do use these advanced output options by creating a ExcelWriter object and use this object to write to the EXcel file.

writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')
movies.to_excel(writer, index=False, sheet_name='report')
workbook = writer.bookworksheet = writer.sheets['report']

We can apply customizations by calling add_format on the workbook we are writing to. Here we are setting header format as bold.

header_fmt = workbook.add_format({'bold': True})
worksheet.set_row(0, None, header_fmt)

Finally, we save the output file by calling the method save on the writer object.

writer.save()

As an example, we saved the data with column headers set as bold. And the saved file looks like the image below.

img-excel-output-bold-1

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

Conclusion

Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can be very great companion tools. As we demonstrated, pandas can do a lot of complex data analysis and manipulations, which depending on your need and expertise, can go beyond what you can achieve if you are just using Excel. One of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You can export your results from pandas back to Excel too if that’s preferred by your intended audience.

On the other hand, Excel is a such a widely used data tool, it’s not a wise to ignore it. Acquiring expertise in both pandas and Excel and making them work together gives you skills that can help you stand out in your organization.

If you’d like to learn more about this topic, check out Dataquest’s interactive Pandas and NumPy Fundamentals course, and our Data Analyst in Python, and Data Scientist in Python paths that will help you become job-ready in around 6 months.

In this tutorial, you’ll learn how to use Python and Pandas to read Excel files using the Pandas read_excel function. Excel files are everywhere – and while they may not be the ideal data type for many data scientists, knowing how to work with them is an essential skill.

By the end of this tutorial, you’ll have learned:

  • How to use the Pandas read_excel function to read an Excel file
  • How to read specify an Excel sheet name to read into Pandas
  • How to read multiple Excel sheets or files
  • How to certain columns from an Excel file in Pandas
  • How to skip rows when reading Excel files in Pandas
  • And more

Let’s get started!

The Quick Answer: Use Pandas read_excel to Read Excel Files

To read Excel files in Python’s Pandas, use the read_excel() function. You can specify the path to the file and a sheet name to read, as shown below:

# Reading an Excel File in Pandas
import pandas as pd

df = pd.read_excel('/Users/datagy/Desktop/Sales.xlsx')

# With a Sheet Name
df = pd.read_excel(
   io='/Users/datagy/Desktop/Sales.xlsx'
   sheet_name ='North'
)

In the following sections of this tutorial, you’ll learn more about the Pandas read_excel() function to better understand how to customize reading Excel files.

Understanding the Pandas read_excel Function

The Pandas read_excel() function has a ton of different parameters. In this tutorial, you’ll learn how to use the main parameters available to you that provide incredible flexibility in terms of how you read Excel files in Pandas.

Parameter Description Available Option
io= The string path to the workbook. URL to file, path to file, etc.
sheet_name= The name of the sheet to read. Will default to the first sheet in the workbook (position 0). Can read either strings (for the sheet name), integers (for position), or lists (for multiple sheets)
usecols= The columns to read, if not all columns are to be read Can be strings of columns, Excel-style columns (“A:C”), or integers representing positions columns
dtype= The datatypes to use for each column Dictionary with columns as keys and data types as values
skiprows= The number of rows to skip from the top Integer value representing the number of rows to skip
nrows= The number of rows to parse Integer value representing the number of rows to read
The important parameters of the Pandas .read_excel() function

The table above highlights some of the key parameters available in the Pandas .read_excel() function. The full list can be found in the official documentation. In the following sections, you’ll learn how to use the parameters shown above to read Excel files in different ways using Python and Pandas.

As shown above, the easiest way to read an Excel file using Pandas is by simply passing in the filepath to the Excel file. The io= parameter is the first parameter, so you can simply pass in the string to the file.

The parameter accepts both a path to a file, an HTTP path, an FTP path or more. Let’s see what happens when we read in an Excel file hosted on my Github page.

# Reading an Excel file in Pandas
import pandas as pd

df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/Sales.xlsx')
print(df.head())

# Returns:
#         Date Customer  Sales
# 0 2022-04-01        A    191
# 1 2022-04-02        B    727
# 2 2022-04-03        A    782
# 3 2022-04-04        B    561
# 4 2022-04-05        A    969

If you’ve downloaded the file and taken a look at it, you’ll notice that the file has three sheets? So, how does Pandas know which sheet to load? By default, Pandas will use the first sheet (positionally), unless otherwise specified.

In the following section, you’ll learn how to specify which sheet you want to load into a DataFrame.

How to Specify Excel Sheet Names in Pandas read_excel

As shown in the previous section, you learned that when no sheet is specified, Pandas will load the first sheet in an Excel workbook. In the workbook provided, there are three sheets in the following structure:

Sales.xlsx
|---East
|---West
|---North

Because of this, we know that the data from the sheet “East” was loaded. If we wanted to load the data from the sheet “West”, we can use the sheet_name= parameter to specify which sheet we want to load.

The parameter accepts both a string as well as an integer. If we were to pass in a string, we can specify the sheet name that we want to load.

Let’s take a look at how we can specify the sheet name for 'West':

# Specifying an Excel Sheet to Load by Name
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name='West')
print(df.head())

# Returns:
#         Date Customer  Sales
# 0 2022-04-01        A    504
# 1 2022-04-02        B    361
# 2 2022-04-03        A    694
# 3 2022-04-04        B    702
# 4 2022-04-05        A    255

Similarly, we can load a sheet name by its position. By default, Pandas will use the position of 0, which will load the first sheet. Say we wanted to repeat our earlier example and load the data from the sheet named 'West', we would need to know where the sheet is located.

Because we know the sheet is the second sheet, we can pass in the 1st index:

# Specifying an Excel Sheet to Load by Position
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name=1)
print(df.head())

# Returns:
#         Date Customer  Sales
# 0 2022-04-01        A    504
# 1 2022-04-02        B    361
# 2 2022-04-03        A    694
# 3 2022-04-04        B    702
# 4 2022-04-05        A    255

We can see that both of these methods returned the same sheet’s data. In the following section, you’ll learn how to specify which columns to load when using the Pandas read_excel function.

How to Specify Columns Names in Pandas read_excel

There may be many times when you don’t want to load every column in an Excel file. This may be because the file has too many columns or has different columns for different worksheets.

In order to do this, we can use the usecols= parameter. It’s a very flexible parameter that lets you specify:

  • A list of column names,
  • A string of Excel column ranges,
  • A list of integers specifying the column indices to load

Most commonly, you’ll encounter people using a list of column names to read in. Each of these columns are comma separated strings, contained in a list.

Let’s load our DataFrame from the example above, only this time only loading the 'Customer' and 'Sales' columns:

# Specifying Columns to Load by Name
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    usecols=['Customer', 'Sales'])
print(df.head())

# Returns:
#   Customer  Sales
# 0        A    191
# 1        B    727
# 2        A    782
# 3        B    561
# 4        A    969

We can see that by passing in the list of strings representing the columns, we were able to parse those columns only.

If we wanted to use Excel changes, we could also specify columns 'B:C'. Let’s see what this looks like below:

# Specifying Columns to Load by Excel Range
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    usecols='B:C')
print(df.head())

# Returns:
#   Customer  Sales
# 0        A    191
# 1        B    727
# 2        A    782
# 3        B    561
# 4        A    969

Finally, we can also pass in a list of integers that represent the positions of the columns we wanted to load. Because the columns are the second and third columns, we would load a list of integers as shown below:

# Specifying Columns to Load by Their Position
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    usecols=[1,2])
print(df.head())

# Returns:
#   Customer  Sales
# 0        A    191
# 1        B    727
# 2        A    782
# 3        B    561
# 4        A    969

In the following section, you’ll learn how to specify data types when reading Excel files.

How to Specify Data Types in Pandas read_excel

Pandas makes it easy to specify the data type of different columns when reading an Excel file. This serves three main purposes:

  1. Preventing data from being read incorrectly
  2. Speeding up the read operation
  3. Saving memory

You can pass in a dictionary where the keys are the columns and the values are the data types. This ensures that data are ready correctly. Let’s see how we can specify the data types for our columns.

# Specifying Data Types for Columns When Reading Excel Files
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    dtype={'date':'datetime64', 'Customer': 'object', 'Sales':'int'})
print(df.head())

# Returns:
#   Customer  Sales
#         Date Customer  Sales
# 0 2022-04-01        A    191
# 1 2022-04-02        B    727
# 2 2022-04-03        A    782
# 3 2022-04-04        B    561
# 4 2022-04-05        A    969

It’s important to note that you don’t need to pass in all the columns for this to work. In the next section, you’ll learn how to skip rows when reading Excel files.

How to Skip Rows When Reading Excel Files in Pandas

In some cases, you’ll encounter files where there are formatted title rows in your Excel file, as shown below:

A poorly formatted Excel File
An Excel with unusual formatting

If we were to read the sheet 'North', we would get the following returned:

# Reading a poorly formatted Excel file
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name='North')
print(df.head())

# Returns:
#            North Sales Unnamed: 1 Unnamed: 2
# 0     Totals Available        NaN        NaN
# 1                 Date   Customer      Sales
# 2  2022-04-01 00:00:00          A        164
# 3  2022-04-02 00:00:00          B        612
# 4  2022-04-03 00:00:00          A        260

Pandas makes it easy to skip a certain number of rows when reading an Excel file. This can be done using the skiprows= parameter. We can see that we need to skip two rows, so we can simply pass in the value 2, as shown below:

# Reading a Poorly Formatted File Correctly
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name='North',
    skiprows=2)
print(df.head())

# Returns:
#         Date Customer  Sales
# 0 2022-04-01        A    164
# 1 2022-04-02        B    612
# 2 2022-04-03        A    260
# 3 2022-04-04        B    314
# 4 2022-04-05        A    215

This read the file much more accurately! It can be a lifesaver when working with poorly formatted files. In the next section, you’ll learn how to read multiple sheets in an Excel file in Pandas.

How to Read Multiple Sheets in an Excel File in Pandas

Pandas makes it very easy to read multiple sheets at the same time. This can be done using the sheet_name= parameter. In our earlier examples, we passed in only a single string to read a single sheet. However, you can also pass in a list of sheets to read multiple sheets at once.

Let’s see how we can read our first two sheets:

# Reading Multiple Excel Sheets at Once in Pandas
import pandas as pd

dfs = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name=['East', 'West'])

print(type(dfs))

# Returns: <class 'dict'>

In the example above, we passed in a list of sheets to read. When we used the type() function to check the type of the returned value, we saw that a dictionary was returned.

Each of the sheets is a key of the dictionary with the DataFrame being the corresponding key’s value. Let’s see how we can access the 'West' DataFrame:

# Reading Multiple Excel Sheets in Pandas
import pandas as pd

dfs = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name=['East', 'West'])

print(dfs.get('West').head())

# Returns: 
#         Date Customer  Sales
# 0 2022-04-01        A    504
# 1 2022-04-02        B    361
# 2 2022-04-03        A    694
# 3 2022-04-04        B    702
# 4 2022-04-05        A    255

You can also read all of the sheets at once by specifying None for the value of sheet_name=. Similarly, this returns a dictionary of all sheets:

# Reading Multiple Excel Sheets in Pandas
import pandas as pd

dfs = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    sheet_name=None)

In the next section, you’ll learn how to read multiple Excel files in Pandas.

How to Read Only n Lines When Reading Excel Files in Pandas

When working with very large Excel files, it can be helpful to only sample a small subset of the data first. This allows you to quickly load the file to better be able to explore the different columns and data types.

This can be done using the nrows= parameter, which accepts an integer value of the number of rows you want to read into your DataFrame. Let’s see how we can read the first five rows of the Excel sheet:

# Reading n Number of Rows of an Excel Sheet
import pandas as pd

df = pd.read_excel(
    io='https://github.com/datagy/mediumdata/raw/master/Sales.xlsx',
    nrows=5)
print(df)

# Returns:
#         Date Customer  Sales
# 0 2022-04-01        A    191
# 1 2022-04-02        B    727
# 2 2022-04-03        A    782
# 3 2022-04-04        B    561
# 4 2022-04-05        A    969

Conclusion

In this tutorial, you learned how to use Python and Pandas to read Excel files into a DataFrame using the .read_excel() function. You learned how to use the function to read an Excel, specify sheet names, read only particular columns, and specify data types. You then learned how skip rows, read only a set number of rows, and read multiple sheets.

Additional Resources

To learn more about related topics, check out the tutorials below:

  • Pandas Dataframe to CSV File – Export Using .to_csv()
  • Combine Data in Pandas with merge, join, and concat
  • Introduction to Pandas for Data Science
  • Summarizing and Analyzing a Pandas DataFrame

code python pandas import excel

Learn how to import an Excel file (having .xlsx extension) using python pandas.

Pandas is the most popular data manipulation package in Python, and DataFrames are the Pandas data type for storing tabular 2D data. Reading data from excel files or CSV files, and writing data to Excel files or CSV files using Python Pandas is a necessary skill for any analyst or data scientist.

Table of Contents

  1. Python Pandas read_excel() Syntax
  2. Import Excel file using Python Pandas (Example)
  3. read_excel Important Parameters Examples
    1. Import Specific Excel Sheet using sheet name
      • Import Multiple Excel Sheets Pandas
    2. Import only n Rows of Excel Sheet
    3. Import specific columns of Excel Sheet
  4. Common Errors and Troubleshooting

1. Pandas read_excel() Syntax

The syntax of DataFrame to_excel() function and some of the important parameters are:

pandas.read_excel(io='filepath', sheet_name=0, header=0, usecols=None, nrows=None)
Sr.No Parameters Description
1 io
the file path from where you want to read the data. This could be a URL path or, could be a local system file path. Valid URL schemes include http, ftp, s3, and file.
2 sheet_name: str, int, list, or None, default 0
Available cases:
~Default is 0: 1st sheet as a DataFrame
~Use 1: To read 2nd sheet as a DataFrame
~Use Specific Sheet Name: "Sheet1" to load sheet with name “Sheet1”
~Load Multiple Sheets using dict:[0, 2, "MySheet"] will load first, third and sheet named “MySheet” as a dictionary of DataFrame
~None: Load All sheets
3 header
default is 0. Pass Header = 1 to consider the second line of the dataset as a header. Use None if there is no header.
4 usecols
~Default is None, then parse all columns.
~If str, then provide a comma-separated list of Excel columns (“A, B, D, E”) or range of Excel columns (e.g. “A:F” or “A, B,E:F”). Ranges are inclusive of both sides.
~If list of int, indicates list of column numbers to be parsed e.g. [1,2,5].
~If list of string, provide list of column names to be parsed e.g. [“A, B, D, E”].
5 nrows:
Default is None
Number of rows to parse (provide int).
Python Pandas read_excel() Syntax

For complete list of read_excel parameters refer to official documentation.

2. Import Excel file using Python Pandas

Let’s review a full example:

  • Create a DataFrame from scratch and save it as Excel
  • Import (or load) the DataFrame from above saved Excel file
import pandas as pd

# Create a dataframe
raw_data = {'first_name': ['Sam','Ziva','Kia','Robin'], 
        'degree': ['PhD','MBA','','MS'],
        'age': [25, 29, 19, 21]}
df = pd.DataFrame(raw_data)

df

#Save the dataframe to the current directory
df.to_excel(r'Example1.xlsx')

We have the following data about students:

first_name degree age
0 Sam PhD 25
1 Ziva MBA 29
2 Kia 19
3 Robin MS 21
Excel file snapshot in to import in Pandas Dataframe Jupyter

Read Excel file into Pandas DataFrame (Explained)

Now, let’s see the steps to import the Excel file into a DataFrame.

Step 1: Enter the path and filename where the Excel file is stored. The could be a local system file path or URL path.

For example,

 pd.read_excel(r‘D:PythonTutorialExample1.csv‘)

Notice that path is highlighted with 3 different colors:

  • The blue part represents the path where the Excel file is saved.
  • The green part is the name of the file you want to import.
  • The purple part represents the file type or Excel file extension. Use ‘.xlsx’ in case of an Excel file.

Modify the Python above code to reflect the path where the Excel file is stored on your computer.

Note: You can save or read an Excel file without explicitly providing a file path (blue part) by placing the file in the current working directory. To find current directory path use below code:

# Current working directory
import os
print(os.getcwd())

# Display all files present in the current working directory
print(os.listdir(os.getcwd()))
D:PythonTutorial
Example1.xlsx

Find out how to read multiple files in a folder(directory) here.

Step 2: Enter the following code and make the necessary changes to your path to read the Excel file.

import pandas as pd

# Read the excel file
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx')

df

Snapshot of Data Representation in Excel files

On the left side of the image Excel file is opened in Microsoft Excel. On the right side same Excel file is opened in Juptyter Notebook using pandas read_excel.

Import Excel file in Python Pandas Dataframe code

3. Pandas read_excel Important Parameters Examples

3.1 Import Specific Excel Sheet using Python Pandas

Import excel file multiple sheets to pandas
Example1.xlsx Sheet “Personal Info”
Import excel file specific sheet to pandas python
Example1.xlsx Sheet “Salary Info”

There may be Multiple Sheets in an Excel file. Pandas provide various methods to import one or multiple excel sheets in sheet_name parameter.

  • Default is 0: Read the 1st sheet in Excel as a DataFrame
  • Use 1: To read 2nd sheet as a DataFrame
  • Use Specific Sheet Name: "Sheet1" to load sheet with name “Sheet1”
  • Load Multiple Sheets using dict:[0, 2, "MySheet"] will load first, third and sheet named “MySheet” as a dictionary of DataFrame
  • None: Load All sheets

1. Import Excel Sheet using Integer

By default sheet_name = 0 imports the 1st sheet in Excel as a DataFrame. To import Second Excel Sheet i.e. “Salary Info” in our case as a Pandas DataFrame use sheet_name = 1

import pandas as pd

# Read "Salary Info" Sheet from Excel file (2nd Sheet)
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',sheet_name=1)

df
first_name  salary
0 Sam 120000
1 Ziva 80000
2 Kia 110000
3 Robin 150000

2. Import Specific Excel Sheet using Sheet Name

To import Specific Excel Sheet i.e. “Personal Info” as a Pandas DataFrame using sheet_name = "Personal Info"

import pandas as pd

# Read excel file sheet "Personal Info" using sheetname
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',sheet_name="Personal Info")

df
first_name degree age
0 Sam PhD 25
1 Ziva MBA 29
2 Kia NaN 19
3 Robin MS 21

3. Import Multiple Excel Sheet into Pandas DataFrame

Multiple Excel Sheets can be read into Pandas DataFrame by passing list in the sheet_name parameter e.g. [0, “Salary Info”] will load the first sheet and sheet named “Salary Info” as a dictionary of DataFrame.

import pandas as pd

# Read multiple excel file sheets as dictionary of DataFrame
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',sheet_name=[0, "Salary Info"])

df
Import excel multiple sheets pandas python dataframe

Now to store different sheets into different DataFrames use Dictionary Key Value.

import pandas as pd

# Read multiple excel file sheets as dictionary of DataFrame
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',sheet_name=[0, "Salary Info"])

# As seen in the output above Keys are 0 and "Salary_Info"
Personal_Info = df[0]
Salary_Info = df["Salary Info"]

print(Personal_Info)
print(Salary_Info)

3.2 Import only n Rows of Excel Sheet using Pandas

Sometimes Excel file is quite big or our system has memory constraints. In this case, we can import only the top n rows of Excel Sheet using Pandas read_excel nrows parameter. For example, to import only top 2 rows use nrows=2

import pandas as pd

# Load top 2 rows of Excel sheets as Pandas DataFrame
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',nrows=2)

df
first_name degree age
0 Sam PhD 25
1 Ziva MBA 29

3.3 Import specific columns of Excel Sheet

There may be hundreds of columns in excel sheet, but while importing we need only few columns. In this case, we can pass usecols parameter. Different ways to use usecols parameter are below:

  • Default is None, parse all columns.
  • If str, then provide a comma-separated list of Excel columns (“A, B, D, E”) or range of Excel columns (e.g. “A:F” or “A, B,E:F”). Ranges are inclusive of both sides.
  • If list of int, indicates list of column numbers to be parsed e.g. [0,2,5].
  • If list of string, provide list of column names to be parsed e.g. [“A, B, D, E”].
import pandas as pd

# Import 1st and 3rd columns of Execl sheet as Pandas DataFrame
df = pd.read_excel(r'D:PythonTutorialExample1.xlsx',usecols=[0,2])

df
first_name age
0 Sam 25
1 Ziva 29
2 Kia 19
3 Robin 21

4. Common Errors and Troubleshooting

Listing down the common error you can face while loading data from CSV files into Pandas dataframe will be:

  1. FileNotFoundError: File b'filename.csv' does not exist
    • Reason: File Not Found error typically occurs when there is an issue with the file path (or directory) or file name.
    • Fix: Check file path, file name, and file extension.
  2. SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated UXXXXXXXX escape
    • Reason: In U starts an eight-character Unicode escape, such as U00014321. In the code, the escape is followed by the character ‘s’, which is invalid.
    • Fix:
      • Use the prefix string with r (to produce a raw string) pd.read_excel(r'D:PythonTutorialfilename.xlsx') or,
      • You either need to duplicate all backslashes pd.read_excel(r'D:\Python\Tutorial\filename.xlsx')
  3. ImportError: Install xlrd >= 1.0.0 for Excel support.
    • Reason:  xlrd package is not available in the python environment
    • Fix:  Install xlrd package if you get the above error pip install xlrd

Conclusion

We have covered the steps needed to read an Excel file in python using pandas read_excel function.

Go to read data from csv files, and write data to CSV files using Python.

Понравилась статья? Поделить с друзьями:
  • Read excel pandas name columns
  • Read excel pandas as text
  • Read excel node js
  • Read excel in to xml
  • Read excel files with python