This page was generated from docs/source/examples/GroupedPipeline.ipynb.

`GroupedPipeline`: applying a transformer per category¶

[1]:

import numpy as np
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse, mean_absolute_error as mae

from timeserio.data.datasets import load_iris_df
from timeserio.pipeline import GroupedPipeline
from timeserio.preprocessing import PandasValueSelector

import seaborn as sns

Load the iris dataset¶

This dataset consists of four numeric and one categorical columns.

[2]:

df = load_iris_df()
df.head(2)

[2]:

	sepal_length_cm	sepal_width_cm	petal_length_cm	petal_width_cm	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa

Imagine we want to predict sepal_width_cm based on sepal_length_cm: we may try fitting a simple linear regression model. The result is somewhat unsatisfying:

Fit a joint model¶

[3]:

sns.lmplot(x="sepal_length_cm", y="sepal_width_cm", data=df)

[3]:

<seaborn.axisgrid.FacetGrid at 0x7f600c6d2ba8>

../_images/examples_GroupedPipeline_6_1.png

In sklearn, we can represent the model (in this case, a linear regression on selected features) as a pipeline:

[4]:

simple_model = Pipeline([
    ("features", PandasValueSelector("sepal_length_cm")),
    ("lr", LinearRegression()),
])

[5]:

simple_model.fit(df, df["sepal_width_cm"])

[5]:

Pipeline(memory=None,
     steps=[('features', PandasValueSelector(columns='sepal_length_cm')), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])

[6]:

y_pred, y = simple_model.predict(df), df["sepal_width_cm"]
print(f"MSE = {mse(y, y_pred)}, MAE = {mae(y, y_pred)}")

MSE = 0.18610437589381357, MAE = 0.33122352979559877

[7]:

lr = simple_model.named_steps['lr']
print(f"Intercept: {lr.intercept_:.2f}, slope: {lr.coef_[0]:.2f}")

Intercept: 3.42, slope: -0.06

Fit a model per category¶

We can improve the regression model signiifcantly by adding the categorical feature species - we could try on-hot encoding, embeddings, etc. However, a common approach would be to fit a separate regression model per category:

[8]:

sns.lmplot(x="sepal_length_cm", y="sepal_width_cm", hue="species", data=df)

[8]:

<seaborn.axisgrid.FacetGrid at 0x7f600bee2860>

../_images/examples_GroupedPipeline_14_1.png

[9]:

grouped_model = GroupedPipeline(
    groupby="species",
    pipeline=simple_model
)

[10]:

grouped_model.fit(df, df["sepal_width_cm"])

# Specifying target by column name is also supported:
grouped_model.fit(df, "sepal_width_cm")

[10]:

GroupedPipeline(errors='raise', groupby='species',
        pipeline=Pipeline(memory=None,
     steps=[('features', PandasValueSelector(columns='sepal_length_cm')), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))]))

[11]:

y_pred, y = grouped_model.predict(df), df["sepal_width_cm"]
print(f"MSE = {mse(y, y_pred)}, MAE = {mae(y, y_pred)}")

MSE = 0.07119979049915942, MAE = 0.20655162210804365

[12]:

for group, pipe in grouped_model.pipelines_.items():
    lr = pipe.named_steps['lr']
    print(f"{group}: intercept: {lr.intercept_:.2f}, slope: {lr.coef_[0]:.2f}")

setosa: intercept: -0.57, slope: 0.80
versicolor: intercept: 0.87, slope: 0.32
virginica: intercept: 1.45, slope: 0.23