Basic Plotting with Pyplot

This is a tutorial on using matplotlib.pyplot to create data visualizations. Here, we will be using the Palmer penguins dataset, which we import below.

Data Preparation

import pandas as pd

#import spreadsheet of data as a Pandas dataset
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

We want to see whether there is a correlation between species, culmen length, and culmen depth. We won’t be using all of this data, so let’s extract only the columns we need. While we’re at it we can clean up the data by dropping rows with incomplete entries.

penguins = penguins[["Species","Culmen Length (mm)","Culmen Depth (mm)"]].dropna()
penguins.head()
Species Culmen Length (mm) Culmen Depth (mm)
0 Adelie Penguin (Pygoscelis adeliae) 39.1 18.7
1 Adelie Penguin (Pygoscelis adeliae) 39.5 17.4
2 Adelie Penguin (Pygoscelis adeliae) 40.3 18.0
4 Adelie Penguin (Pygoscelis adeliae) 36.7 19.3
5 Adelie Penguin (Pygoscelis adeliae) 39.3 20.6

Plotting

Now we’re ready to make our scatterplot.

from matplotlib import pyplot as plt

fig,ax = plt.subplots(1) #creates figure and axis to plot on

We want to plot culmen length against culmen depth, but we also need some way to differentiate between species. There are multiple way to do this, but here we use the groupby() function to split the dataset by species and plot each subset one by one. Each subset and associated species name can be iterated through, and the scatter() function automatically plots them in different colors.

for name, group in penguins.groupby("Species"):
    ax.scatter(data=group,x="Culmen Length (mm)",y="Culmen Depth (mm)",label=name.split()[0])
    
fig

You’ll notice that we gave scatter() a label argument that takes the first word of the “Species” column. This is for the legend, which we add below along with axis labels and an title.

ax.set(xlabel="Culmen Length (mm)",ylabel="Culmen Depth (mm)",title="Culmen Length vs. Culmen Depth in Penguin Species")
ax.legend()

fig

From this data visualization we can see a pretty clear correlation between culmen length and depth with respect to species!