Classifying Palmer Penguins

Classifying Palmer Penguins based on quantitative and qualitative data
Author

Jake Gilbert

Published

February 22, 2024

Classifying Palmer Penguins

Abstract

By using machine learning models, we can more easily classify the species of Palmer penguins given certain physical features—both quantitative and qualitative. Based on the exploration of data visualization, training and cross validation, it was concluded that a logistic regression model that was tested through cross validation using Culmen Depth (mm), Culmen Length (mm) and Island of origin (Biscoe, Dream or Torgersen) provided a 100% accurate classifier for future testing data.

To start the process, we need to implement the training data and our packages for data visualization.

import pandas as pd
import seaborn as sns
import numpy as np


train_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)

train.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0809 31 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N63A1 Yes 11/24/08 40.9 16.6 187.0 3200.0 FEMALE 9.08458 -24.54903 NaN
1 PAL0809 41 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N74A1 Yes 11/24/08 49.0 19.5 210.0 3950.0 MALE 9.53262 -24.66867 NaN
2 PAL0708 4 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N32A2 Yes 11/27/07 50.0 15.2 218.0 5700.0 MALE 8.25540 -25.40075 NaN
3 PAL0708 15 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N38A1 Yes 12/3/07 45.8 14.6 210.0 4200.0 FEMALE 7.79958 -25.62618 NaN
4 PAL0809 34 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N65A2 Yes 11/24/08 51.0 18.8 203.0 4100.0 MALE 9.23196 -24.17282 NaN

Exploring Visualizations

In order to accurately predict the species of penguin, we need to determine the best subset of data and features to train on. To do this, we will look at some data visualization.

First, we need to prepare the data. Given the data, we will drop some columns that we will not be able to train on and focus on either quantitative data (like culmen length/depth) or one-hot encoded columns, like sex or island of origin.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train, y_train = prepare_data(train)

X_train
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo) Island_Biscoe Island_Dream Island_Torgersen Stage_Adult, 1 Egg Stage Clutch Completion_No Clutch Completion_Yes Sex_FEMALE Sex_MALE
0 40.9 16.6 187.0 3200.0 9.08458 -24.54903 False True False True False True True False
1 49.0 19.5 210.0 3950.0 9.53262 -24.66867 False True False True False True False True
2 50.0 15.2 218.0 5700.0 8.25540 -25.40075 True False False True False True False True
3 45.8 14.6 210.0 4200.0 7.79958 -25.62618 True False False True False True True False
4 51.0 18.8 203.0 4100.0 9.23196 -24.17282 False True False True False True False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
270 51.1 16.5 225.0 5250.0 8.20660 -26.36863 True False False True False True False True
271 35.9 16.6 190.0 3050.0 8.47781 -26.07821 False False True True True False True False
272 39.5 17.8 188.0 3300.0 9.66523 -25.06020 False True False True False True True False
273 36.7 19.3 193.0 3450.0 8.76651 -25.32426 False False True True False True True False
274 42.4 17.3 181.0 3600.0 9.35138 -24.68790 False True False True False True True False

256 rows × 14 columns

To choose the best possible features for our model, it’s helpful to visualize the dat.

Firstly, I want to find the one hot encoded feature which doesn’t contain any numeric values. Below I’m using a box plot to visualize the the difference in the culmen depths between islands.

sns.boxplot(data=train, x="Island", y="Culmen Depth (mm)")

FIGURE 1: In the above figure, we can see that there is enough variance between the islands to ensure this feature could be viable. This is especially apparent on Biscoe island, which may be due to a very specific population of penguin on that island.

We can also see how this changes between the different species.

sns.boxplot(data=train, x="Island", y="Culmen Depth (mm)", hue="Species")

FIGURE 2: Here we can see the reason for the large difference in Biscoe island—when separating boxes by species. Biscoe island is the only one with the Gentoo penguin, which has a significantly smaller culmen depth. This visualization also tells us that island can be a good one-hot encoded feature and an indicator of the type of penguin.

We can also look at some other numerical features by comparing them to culmen depth.

import matplotlib.pyplot as plt

ax = plt.subplots(figsize=(12, 3))
plot = sns.boxplot(train, x = "Body Mass (g)", y = "Culmen Depth (mm)", width=0.8)
plot.tick_params(axis='x', rotation=90)

FIGURE 3: As body mass increases, culmen depth increases for a certain range, but there seems to be a threshold where it drops, perhaps due to a unique body proportion of a species.

plot = sns.boxplot(data=train, x="Species", y="Body Mass (g)")
plot.tick_params(labelsize=6)

FIGURE 4: The Gentoo penguin has a larger mass, but in figure 2, it has a notably smaller culmen depth. Therefore, we see no discernable trend between body mass and culmen depth. With how variant Culmen depth is, it may be a good numeric feature.

However, we also need to see if culmen depth itself could be used as a good numeric feature. Additionally, we need a second numerical feature to ensure the quality of the training. To do this we can remove the features with low variance.

We can select an 80% variance threshold and plug our cleaned up data frame

from sklearn.feature_selection import VarianceThreshold
variance = VarianceThreshold(threshold=(.8 * (1 - .8)))

variance.fit_transform(X_train)

Choosing Our Model

We can loop through all combinations of columns of possible features we want to use. As of now, I may want to use Island as my one-hot-encoded feature and Culmen Depth as a numerical feature, but we will iterate through all feature combinations using itertools.

I will test logistic regression, random forest classifier and a decision tree. For each model, I will keep track of the best score for the given feature combination.

However, before I determine the accuracy of the decision tree, I need to find the best max_depth parameter. To do this, I will use cross validation and an exhaustive search between two possible depth values. We will test depths 5-20 for this example with a 7 fold cross validation.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


depths = []
max_score = 0
best_depth = 0
for i in range(5,20):
    classifier = DecisionTreeClassifier(max_depth=i)
    scores = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=7, n_jobs=4)
    if scores.mean() > max_score: 
        max_score = scores.mean()
        best_depth = i
    depths.append((i,scores.mean()))


print(depths)
print(max_score)
print(best_depth)
[(5, 0.960960960960961), (6, 0.9649292149292149), (7, 0.9649292149292149), (8, 0.961068211068211), (9, 0.9649292149292149), (10, 0.965036465036465), (11, 0.961068211068211), (12, 0.9649292149292149), (13, 0.965036465036465), (14, 0.9688974688974689), (15, 0.9649292149292149), (16, 0.960960960960961), (17, 0.961068211068211), (18, 0.961068211068211), (19, 0.9688974688974689)]
0.9688974688974689
14

From the search, it seems that a max depth of 7 gives the best average score when cross validating. Now we can search through each feature combination and train our three models:

from itertools import combinations

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')


#keep track of the best score for each
logreg_score = 0
logreg_features = []

decision_score = 0
decision_features = []

forest_score = 0
forest_features = []


# these are not actually all the columns: you'll 
# need to add any of the other ones you want to search for
all_qual_cols = ["Island", "Clutch Completion"]
all_quant_cols = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Body Mass (g)']

for qual in all_qual_cols: 
  qual_cols = [col for col in X_train.columns if qual in col ]
  for pair in combinations(all_quant_cols, 2):
    cols = qual_cols + list(pair) 
    
    #logistic regression
    LR = LogisticRegression()
    LR.fit(X_train[cols], y_train)
    temp_score = LR.score(X_train[cols], y_train)
    if temp_score > logreg_score:
      logreg_score = temp_score
      logreg_features = cols

    #RANDOM FOREST
    RF = RandomForestClassifier()
    RF.fit(X_train[cols], y_train)
    temp_score = RF.score(X_train[cols], y_train)
    if temp_score > forest_score:
      forest_score = temp_score
      forest_features = cols

    #DECISION TREE
    DT = DecisionTreeClassifier(max_depth=7)
    DT.fit(X_train[cols], y_train)
    temp_score = DT.score(X_train[cols], y_train)
    if temp_score > decision_score:
      decision_score = temp_score
      decision_features = cols
    

print("Logical regression score:", logreg_score)
print("Logical regression best feautes", logreg_features)
print()
print("Random forest score: ", forest_score)
print("Random forest best features", forest_features)
print()
print("Decision tree score: ", decision_score)
print("Decision tree best features", decision_features)
Logical regression score: 0.99609375
Logical regression best feautes ['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Culmen Length (mm)', 'Culmen Depth (mm)']

Random forest score:  1.0
Random forest best features ['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Culmen Length (mm)', 'Culmen Depth (mm)']

Decision tree score:  1.0
Decision tree best features ['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Culmen Length (mm)', 'Culmen Depth (mm)']

From the initial classifications, the random forest and decision trees seem to give 100% testing accuracy and all three models agree that island, culmen length and culmen depth are the best options for features. However, we need to do some cross validation to prevent over-fitting.

cv_lr = cross_val_score(LR, X_train, y_train, cv=7).mean()
cv_rf = cross_val_score(RF, X_train, y_train, cv=7).mean()
cv_dt = cross_val_score(DT, X_train, y_train, cv=7).mean()

print("logistic regression: ", cv_lr)
print("Random Forest: ", cv_rf)
print("Decision Tree: ", cv_dt)
logistic regression:  1.0
Random Forest:  0.9882024882024882
Decision Tree:  0.9649292149292149

The result of our cross validation reveals that logistic regression has a 100% accuracy. We now know we can create a reliable logistic regression model with our three features: Culmen Depth, Culmen Length and Island.

Testing

We will now test our model with a new dataset, which produces a perfect prediction for the testing data.

test_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/test.csv"
test = pd.read_csv(test_url)

columns = ['Culmen Depth (mm)', 'Culmen Length (mm)', 'Island_Biscoe', 'Island_Dream', 'Island_Torgersen']

X_test, y_test = prepare_data(test)
LR.fit(X_train[columns], y_train)
LR.score(X_test[columns], y_test)
1.0

Plotting Decision Regions

Now that we have our accurate model and test data, we want to plot our decision regions.

from matplotlib import pyplot as plt
from matplotlib.patches import Patch

def plot_regions(model, X, y):
    
    x0 = X[X.columns[0]]
    x1 = X[X.columns[1]]
    qual_features = X.columns[2:]
    
    fig, axarr = plt.subplots(1, len(qual_features), figsize = (7, 3))

    # create a grid
    grid_x = np.linspace(x0.min(),x0.max(),501)
    grid_y = np.linspace(x1.min(),x1.max(),501)
    xx, yy = np.meshgrid(grid_x, grid_y)
    
    XX = xx.ravel()
    YY = yy.ravel()

    for i in range(len(qual_features)):
      XY = pd.DataFrame({
          X.columns[0] : XX,
          X.columns[1] : YY
      })

      for j in qual_features:
        XY[j] = 0

      XY[qual_features[i]] = 1

      p = model.predict(XY)
      p = p.reshape(xx.shape)
      
      
      # use contour plot to visualize the predictions
      axarr[i].contourf(xx, yy, p, cmap = "jet", alpha = 0.2, vmin = 0, vmax = 2)
      
      ix = X[qual_features[i]] == 1
      # plot the data
      axarr[i].scatter(x0[ix], x1[ix], c = y[ix], cmap = "jet", vmin = 0, vmax = 2)
      
      axarr[i].set(xlabel = X.columns[0], 
            ylabel  = X.columns[1], 
            title = qual_features[i])
      
      patches = []
      for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
        patches.append(Patch(color = color, label = spec))

      plt.legend(title = "Species", handles = patches, loc = "best")
      
      plt.tight_layout()

plot_regions(LR, X_train[columns], y_train)

Plotting our decision regions provides a good visualization of our model and the reasoning behind its decisions. For each island, there is enough distinction between our quantitative features to provide accurate classification.

Confusion Matrix

Another way to get a good indication of our model and its accuracy is to create a confusion matrix:

from sklearn.metrics import confusion_matrix

prediction = LR.predict(X_test[columns])

confusion_matrix = confusion_matrix(y_test, prediction)

print(confusion_matrix)
[[31  0  0]
 [ 0 11  0]
 [ 0  0 26]]

The result of our confusion matrix shows only values on the diagonal axis, revealing the models accuracy.

Discussion

Based on my exploration of various models and features, I concluded that a logistic regression model trained with high variance features Culmen Depth, Culmen Length and one qualitative feature, Island, produces a 100% accurate model for classifying Palmer penguins. One thing that I learned from this exploration is the importance of steps like cross validation. Although on the surface, some models— for example our decision tree—we’re more accurate, cross validation reveals they were less so due to over-fitting and the nature of the dataset. I also learned the importance of data visualization in finding correct features and discerning trends in the data. Also revealing the decision regions of the model helps me understand why certain features work. There are many steps for determining the right classification model, and many of these steps can lead to interesting and surprising discoveries.