Introduction to pandas#
pandas is a data science package orignally developed by Wes McKinney. It builds on top of numpy to provide a higher level API for wrangling, analysing and visualising data. It is also closely coupled to matplotlib with a number of shorthand methods to create plots of data.
In practice, I often use both numpy and pandas for data wrangling: switching between them (which is straightforward) when needed as each offers different benefits.
For your Machine Learning and Data Science career you will find pandas very useful indeed. It is very easy to import data from a variety of sources e.g. CSV or a database. (It also works with Excel, but I’d encourage you to not use Excel formatted files - save it as a CSV instead).
Read the full docs here: https://pandas.pydata.org/
The higher level API offered by
pandascomes at the cost of efficency i.e. execution speed. This statement is relative tonumpywhich approaches the speed ofC. I emphasise thatpandasis still fast especially compared Microsoft Excel. However, in generalpandasis not recommended for implementing high performance computational algorithms over and abovenumpy.
Imports#
We usually use the alias pd when import pandas
import numpy as np
import pandas as pd
print(pd.__version__)
print(np.__version__)
1.2.3
1.19.2
The pd.Series and pd.DataFrame classes#
The central contribution of the pandas package to data science in python are the Series and DataFrame class. These provide the high level abstraction of data.
pd.Series#
If you are familiar with analysing tabular data in a software package like stata, spss or Excel then you can think of a series as a column of data.
The example below:
Create a variable called
column1of typepd.Seriescolumn1has 10 rows (a sequence of 1 to 10).The data in column 1 has a data type of 8-bit unsigned integer (0 - 255)
# create a series
column1 = pd.Series(np.arange(1, 10), name='unique_id',
dtype=np.uint8)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
Name: unique_id, dtype: uint8
# data type
type(column1)
pandas.core.series.Series
# shape
column1.shape
(9,)
Another way to do this would have been to ignore the data type and name parameters. Notice now that pandas has defaulted to int64 for the data.
Depending on your application, this may or may not be a big deal. But note that it uses more memory.
int64is equivalent to Python’sintand C’slongtype:-9_223_372_036_854_775_808to9_223_372_036_854_775_807.
# create a series just passing data.
column2 = pd.Series(np.arange(1, 10))
column2
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
In a data science application its likely you will have a very large column of data. You can take a look at the head or tail of the Series by using the .head() and .tail() methods
# 10,000 rows. uint16 = 0 - 65_535
column3 = pd.Series(np.arange(1, 10_000), name='longer',
dtype=np.uint16)
column3.head()
0 1
1 2
2 3
3 4
4 5
Name: longer, dtype: uint16
# To view more or less rows
column3.head(2)
0 1
1 2
Name: longer, dtype: uint16
column3.tail()
9994 9995
9995 9996
9996 9997
9997 9998
9998 9999
Name: longer, dtype: uint16
# side note: to drop to numpy
column3.to_numpy()
array([ 1, 2, 3, ..., 9997, 9998, 9999], dtype=uint16)
### pd.DataFrame
For most machine learning applications you will be working with a full DataFrame.
The data you will use is likely to be imported from an external data source, such as a Comma Seperated Value (CSV) file or large scale database such as PostgreSQL. But while you build familiarity with DataFrames we will look at building them manually.
Building
DataFramesmanually is quite handy for practice, as I often use dataframes to summarise the results of computational procedures and simulations.DataFramescan also generate LaTeX which is handy for quickly producing tables of results for a report/paper I am writing in LaTeX.
# create 5 x 4 matrix
raw_data = np.arange(20, dtype=np.uint8).reshape(5, -1)
raw_data.shape
(5, 4)
raw_data
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]], dtype=uint8)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_0 5 non-null uint8
1 col_1 5 non-null uint8
2 col_2 5 non-null uint8
3 col_3 5 non-null uint8
dtypes: uint8(4)
memory usage: 148.0 bytes
# its a small matrix so lets view it all
df
| col_0 | col_1 | col_2 | col_3 | |
|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 | 7 |
| 2 | 8 | 9 | 10 | 11 |
| 3 | 12 | 13 | 14 | 15 |
| 4 | 16 | 17 | 18 | 19 |
#for bigger `DataFrames` use .head()/.tail()
df.head(2)
| col_0 | col_1 | col_2 | col_3 | |
|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 | 7 |
# access a named column
df['col_3']
0 3
1 7
2 11
3 15
4 19
Name: col_3, dtype: uint8
# alternative approach for accessing
df.col_3
0 3
1 7
2 11
3 15
4 19
Name: col_3, dtype: uint8
# side bar 1 - let's ignore the datatype.
# the size in memory is almost double.
raw_data = np.arange(20).reshape(5, -1)
df = pd.DataFrame(raw_data)
df.columns = (['col_' + str(i) for i in range(df.shape[1])])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col_0 5 non-null int64
1 col_1 5 non-null int64
2 col_2 5 non-null int64
3 col_3 5 non-null int64
dtypes: int64(4)
memory usage: 288.0 bytes
# side bar 2: drop to numpy
df.to_numpy()
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
Creating a pd.DataFrame from python lists#
In the previous example we created a DataFrame from a numpy.ndarray. But a DataFrame or Seriescan be created from anything array-like. So for example, we could work with one or more python lists.
# creating an individual series
thrash_metal_bands = pd.Series(['pantera', 'metallica', 'megadeth',
'anthrax'])
thrash_metal_bands
0 pantera
1 metallica
2 megadeth
3 anthrax
dtype: object
# create a full data frame
# each column is defined as a seperate list
band_name = ['pantera', 'metallica', 'megadeth', 'anthrax']
n_albums = [9, 10, 15, 11]
formed = [1981, 1981, 1983, 1981]
still_active = [0, 1, 1, 1]
# empty dataframe
thrash_metal_bands = pd.DataFrame()
# create new columns from lists
thrash_metal_bands['band'] = band_name
thrash_metal_bands['n_albums'] = n_albums
thrash_metal_bands['yr_formed'] = formed
thrash_metal_bands['active'] = still_active
thrash_metal_bands
| band | n_albums | yr_formed | active | |
|---|---|---|---|---|
| 0 | pantera | 9 | 1981 | 0 |
| 1 | metallica | 10 | 1981 | 1 |
| 2 | megadeth | 15 | 1983 | 1 |
| 3 | anthrax | 11 | 1981 | 1 |
# take a look at the df summary information.
thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 band 4 non-null object
1 n_albums 4 non-null int64
2 yr_formed 4 non-null int64
3 active 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 256.0+ bytes
# could also be specific about datatype using pd.Series
thrash_metal_bands = pd.DataFrame()
thrash_metal_bands['band'] = pd.Series(band_name, dtype=str)
thrash_metal_bands['n_albums'] = pd.Series(n_albums, dtype=np.uint8)
thrash_metal_bands['yr_formed'] = pd.Series(formed, dtype=np.uint16)
thrash_metal_bands['active'] = pd.Series(still_active, dtype=bool)
thrash_metal_bands.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 band 4 non-null object
1 n_albums 4 non-null uint8
2 yr_formed 4 non-null uint16
3 active 4 non-null bool
dtypes: bool(1), object(1), uint16(1), uint8(1)
memory usage: 176.0+ bytes
Note that in practice its also useful to know that you can create a DataFrame from a dict. I often forget exactly what format my data need to be in, but luckily you can call the .to_dict() method of a DataFrame to see what is required.
This functionality has proved useful in practice as it is often useful to use a simple
dictto track results of an algorithm. At the end of the experiment the dict can quickly be cast to aDataFrameand easily viewed in a notebook.
bands_dict = thrash_metal_bands.to_dict()
bands_dict
{'band': {0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'},
'n_albums': {0: 9, 1: 10, 2: 15, 3: 11},
'yr_formed': {0: 1981, 1: 1981, 2: 1983, 3: 1981},
'active': {0: False, 1: True, 2: True, 3: True}}
# the code to use a dict looks like pandas code.
bands_dict['band']
{0: 'pantera', 1: 'metallica', 2: 'megadeth', 3: 'anthrax'}
# reverse engineer
new_df = pd.DataFrame(bands_dict)
new_df
| band | n_albums | yr_formed | active | |
|---|---|---|---|---|
| 0 | pantera | 9 | 1981 | False |
| 1 | metallica | 10 | 1981 | True |
| 2 | megadeth | 15 | 1983 | True |
| 3 | anthrax | 11 | 1981 | True |
DataFrame Indexes#
In each of the examples so far you will notice that the first (unlabelled) a column is a sequence of arbitrary numbers. This is the DataFrame index. When we create a DataFrame manually an additional step is often to set the index to a column we have created.
thrash_metal_bands.index
RangeIndex(start=0, stop=4, step=1)
# set the index to 'band' column
# note that this method returns a 'copy' unless we set 'inplace=True'
thrash_metal_bands.set_index('band', inplace=True)
thrash_metal_bands
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| pantera | 9 | 1981 | False |
| metallica | 10 | 1981 | True |
| megadeth | 15 | 1983 | True |
| anthrax | 11 | 1981 | True |
Accessing elements in a DataFrame#
In nearly all data wrangling and analysis projects, you will want to explore subsets of your data. This might be, for example, to eyeball the data close up or to calculate summary statistics for particular populations. To do that in pandas we need to understand how we access individual and subsets or rows and columns in our DataFrame.
For simplicity, we will again use thrash_metal_bands.
Accessing Rows#
To access an individal row we can use the .loc and .iloc accessors. The former of these references by name while the latter references by numeric index.
# return the records for panteria
thrash_metal_bands.loc['pantera']
n_albums 9
yr_formed 1981
active False
Name: pantera, dtype: object
# return the records the row at index 2
thrash_metal_bands.iloc[2]
n_albums 15
yr_formed 1983
active True
Name: megadeth, dtype: object
Note that we can return multiple rows if we supply a list of indexes. For example to find records for both pantera and megadeth:
to_find = ['pantera', 'megadeth']
thrash_metal_bands.loc[to_find]
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| pantera | 9 | 1981 | False |
| megadeth | 15 | 1983 | True |
Indexes can be sliced in a similar fashion to arrays
# All rows from index 2 onwards
thrash_metal_bands.iloc[2:]
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| megadeth | 15 | 1983 | True |
| anthrax | 11 | 1981 | True |
# Rows 0 and 1
thrash_metal_bands.iloc[:2]
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| pantera | 9 | 1981 | False |
| metallica | 10 | 1981 | True |
# slicing by name
thrash_metal_bands.loc['pantera': 'megadeth']
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| pantera | 9 | 1981 | False |
| metallica | 10 | 1981 | True |
| megadeth | 15 | 1983 | True |
Accessing columns#
We have already seen that accessing a column is done as so
df['column_name']
We can also select multiple columns.
# select yr_formed and active columns only
mask = ['yr_formed', 'active']
thrash_metal_bands[mask]
| yr_formed | active | |
|---|---|---|
| band | ||
| pantera | 1981 | False |
| metallica | 1981 | True |
| megadeth | 1983 | True |
| anthrax | 1981 | True |
or restrict both columns and rows
columns = ['yr_formed', 'active']
rows = ['pantera', 'anthrax']
thrash_metal_bands.loc[rows, mask]
| yr_formed | active | |
|---|---|---|
| band | ||
| pantera | 1981 | False |
| anthrax | 1981 | True |
Selecting individual cells#
To access an individual cell within a DataFrame use the .at and .iat accessors. The former uses the index row and column names while the latter is an integer based lookup.
# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed']
1981
# look up the cell value in position (1, 1)
thrash_metal_bands.iat[1, 1]
1981
# to update the individual value
# lookup the year pantera was formed.
thrash_metal_bands.at['pantera', 'yr_formed'] = 9999
thrash_metal_bands
| n_albums | yr_formed | active | |
|---|---|---|---|
| band | |||
| pantera | 9 | 9999 | False |
| metallica | 10 | 1981 | True |
| megadeth | 15 | 1983 | True |
| anthrax | 11 | 1981 | True |