{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Modelling Solar generation across Multiple Sites - Part 2\n", "\n", "In [Part 1](SolarGenerationTimeSeries_Part1.ipynb), we have explored the SETIS PV generation dataset and built a powerful and performant model using \n", "`timeserio`'s `MultiModel` and datetime feature generation pipelines. In this part, we instead train an auto-regressive model using more advanced batch generator features.\n", "\n", "Remember the metrics our previous model achieved on the train/test split (without any parameter tuning):\n", "\n", "| | train | test |\n", "|---|---|---|\n", "| MSE | 0.0063 | 0.0068 |\n", "| MAE | 0.0401 | 0.0424 |\n", "\n", "In this notebook, we will build a simple model to create short-range predictions (between 1 and 2 hours ahead) based on recent history (say 6h)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data from parquet" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.09 s, sys: 608 ms, total: 1.7 s\n", "Wall time: 449 ms\n" ] } ], "source": [ "%%time\n", "df = pd.read_parquet(\"~/tmp/datasets/EMHIRESPV_TSh_CF_Country_19862015_tall.parquet\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "plot_countries = ['ES', 'UK', 'FI', ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split into train-test sets" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2761080, 6442800)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dev = df.iloc[:100]\n", "df_train, df_test = df[df['Year'] < 1995], df[df['Year'] >= 1995]\n", "len(df_train), len(df_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Auto-regressive model\n", "\n", "In an auto-regressive model, we treat past values of the timeseries as input features to the forecasting model.\n", "While the functional form of the model is important, deep learning frameworks give us an easy way to try different approaches including CNNs, RNNs, etc.\n", "\n", "A key part remains however - we must be able to supply abundant training examples, each consisting of a window of consecutive values, the target, and (optinally) the time between the end of the window and the target (the \"forecast horizon\"). A long timeseries can be used to generate many examples simply by sampling the windows randomly from the original timeseries - in fact, for a realistic timeseries, pre-generating training examples in memory is prohibitively expensive. `timeserio` provides a way to generate sequence training examples on-demand from data held in memory, or even from datasets partitioned into multiple files." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "from timeserio.batches.chunked.pandas import SequenceForecastBatchGenerator\n", "\n", "batchgen_train = SequenceForecastBatchGenerator(\n", " df=df_train, batch_size=2**15,\n", " sequence_length=6,\n", " sequence_columns=[\"generation\", \"Time_step\"],\n", " last_step_columns=[\"Time_step\"],\n", " forecast_steps_min=1,\n", " forecast_steps_max=2,\n", " batch_offset=True,\n", " id_column=\"country\",\n", " batch_aggregator=1\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "35" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(batchgen_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 116 ms, sys: 4.71 ms, total: 121 ms\n", "Wall time: 119 ms\n" ] } ], "source": [ "%%time\n", "batch = batchgen_train[0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygenerationTime_stepseq_generationseq_Time_stepend_of_Time_step
012345012345
0HU0.07053790.0000000.0000000.0000000.0000000.0000000.0305283456788
1HU0.084160160.0705370.0784650.0726690.0822930.0693790.0624129101112131414
\n", "
" ], "text/plain": [ " country generation Time_step seq_generation \\\n", " 0 1 2 3 \n", "0 HU 0.070537 9 0.000000 0.000000 0.000000 0.000000 \n", "1 HU 0.084160 16 0.070537 0.078465 0.072669 0.082293 \n", "\n", " seq_Time_step end_of_Time_step \n", " 4 5 0 1 2 3 4 5 \n", "0 0.000000 0.030528 3 4 5 6 7 8 8 \n", "1 0.069379 0.062412 9 10 11 12 13 14 14 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "batch.head(2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 158 ms, sys: 3.54 ms, total: 162 ms\n", "Wall time: 161 ms\n" ] } ], "source": [ "%%time\n", "batch = batchgen_train[-1]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygenerationTime_stepseq_generationseq_Time_stepend_of_Time_step
012345012345
0RS0.06636590.0000000.0000000.0000000.000000.0000000.0359353456788
1RS0.000000160.0663650.0885950.0830350.095240.0903350.0716259101112131414
\n", "
" ], "text/plain": [ " country generation Time_step seq_generation \\\n", " 0 1 2 3 \n", "0 RS 0.066365 9 0.000000 0.000000 0.000000 0.00000 \n", "1 RS 0.000000 16 0.066365 0.088595 0.083035 0.09524 \n", "\n", " seq_Time_step end_of_Time_step \n", " 4 5 0 1 2 3 4 5 \n", "0 0.000000 0.035935 3 4 5 6 7 8 8 \n", "1 0.090335 0.071625 9 10 11 12 13 14 14 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "batch.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequence and Forecast horizon features" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from timeserio.pipeline import Pipeline\n", "from timeserio.preprocessing import PandasColumnSelector, PandasValueSelector\n", "\n", "class ColumnDifferenceValues:\n", " \"\"\"Compute difference feature of two columns\"\"\"\n", " def __init__(self, *, col_plus, col_minus):\n", " self.col_plus = col_plus\n", " self.col_minus = col_minus\n", " \n", " def fit(self, *args, **kwargs):\n", " return self\n", " \n", " def fit_transform(self, df, *args, **kwargs):\n", " return self.transform(df, *args, **kwargs)\n", "\n", " def transform(self, df, *args, **kwargs):\n", " return (df[self.col_plus] - df[self.col_minus]).values.reshape(-1, 1)\n", "\n", " \n", "seq_pipeline = PandasValueSelector(\"seq_generation\")\n", "fc_horizon_pipeline = ColumnDifferenceValues(col_plus=\"Time_step\", col_minus=\"end_of_Time_step\")\n", "target_pipeline = PandasValueSelector(\"generation\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define the Neural Network Architecture\n", "\n", "We define a regression network with two inputs: sequence of previous readings, and the forecast horizon" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from timeserio.keras.multinetwork import MultiNetworkBase\n", "\n", "from keras.layers import Input, Dense, Flatten, Concatenate, Reshape, Permute, Conv1D, BatchNormalization, MaxPool1D, Activation\n", "from keras.models import Model\n", "from keras.optimizers import Adam\n", "from keras.callbacks import EarlyStopping, ReduceLROnPlateau\n", "\n", "class ARForecastingNetwork(MultiNetworkBase):\n", " def _model(\n", " self,\n", " *,\n", " seq_length=6, # number of real-valued features\n", " filters=(1, ),\n", " kernel_sizes=(1, ),\n", " strides=(1, ),\n", " pools=(1, ),\n", " hidden_units=(8, 8),\n", " lr=0.01\n", " ):\n", " horizon_input = Input(shape=(1,), name='horizon')\n", " seq_input = Input(shape=(seq_length,), name='sequence')\n", " encoding = Reshape(\n", " target_shape=(-1, 1)\n", " )(seq_input)\n", " \n", " for idx, (_filters, _kernel_size, _strides, _pool) in enumerate(zip(filters, kernel_sizes, strides, pools)):\n", " encoding = Conv1D(filters=_filters, kernel_size=_kernel_size, strides=_strides, padding=\"same\", name=f\"conv_{idx}\")(encoding)\n", " encoding = BatchNormalization()(encoding)\n", " encoding = Activation(activation='relu')(encoding)\n", " encoding = MaxPool1D(pool_size=_pool)(encoding)\n", " encoding = Flatten()(encoding)\n", "\n", " output = Concatenate(name='concatenate')([encoding, horizon_input])\n", " for idx, _hidden_units in enumerate(hidden_units):\n", " output = Dense(_hidden_units, activation='relu', name=f'dense_{idx}')(output)\n", " output = Dense(1, name='generation', activation='relu')(output)\n", " \n", " encoding_model = Model(seq_input, encoding)\n", " forecasting_model = Model([seq_input, horizon_input], output)\n", " \n", " optimizer = Adam(lr=lr)\n", " forecasting_model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])\n", " \n", " return {'encoder': encoding_model, 'forecast': forecasting_model}\n", "\n", " \n", "multinetwork = ARForecastingNetwork(seq_length=6, lr=0.001)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from keras.utils.vis_utils import model_to_dot\n", "from IPython.display import SVG\n", "\n", "def vis_model(model, show_shapes=False, show_layer_names=True, rankdir='TB'):\n", " \"\"\"Visualize model in a notebook.\"\"\"\n", " return SVG(\n", " model_to_dot(\n", " model, show_shapes=show_shapes, show_layer_names=show_layer_names, rankdir=rankdir\n", " ).create(prog='dot', format='svg')\n", " )" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "G\n", "\n", "\n", "\n", "139771994230344\n", "\n", "sequence: InputLayer\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6)\n", "\n", "(None, 6)\n", "\n", "\n", "\n", "139770585812888\n", "\n", "reshape_3: Reshape\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6)\n", "\n", "(None, 6, 1)\n", "\n", "\n", "\n", "139771994230344->139770585812888\n", "\n", "\n", "\n", "\n", "\n", "139770585811824\n", "\n", "conv_0: Conv1D\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6, 1)\n", "\n", "(None, 6, 1)\n", "\n", "\n", "\n", "139770585812888->139770585811824\n", "\n", "\n", "\n", "\n", "\n", "139770585812720\n", "\n", "batch_normalization_3: BatchNormalization\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6, 1)\n", "\n", "(None, 6, 1)\n", "\n", "\n", "\n", "139770585811824->139770585812720\n", "\n", "\n", "\n", "\n", "\n", "139770585810872\n", "\n", "activation_3: Activation\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6, 1)\n", "\n", "(None, 6, 1)\n", "\n", "\n", "\n", "139770585812720->139770585810872\n", "\n", "\n", "\n", "\n", "\n", "139770584937024\n", "\n", "max_pooling1d_3: MaxPooling1D\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6, 1)\n", "\n", "(None, 6, 1)\n", "\n", "\n", "\n", "139770585810872->139770584937024\n", "\n", "\n", "\n", "\n", "\n", "139770585809528\n", "\n", "flatten_3: Flatten\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 6, 1)\n", "\n", "(None, 6)\n", "\n", "\n", "\n", "139770584937024->139770585809528\n", "\n", "\n", "\n", "\n", "\n", "139770585375800\n", "\n", "concatenate: Concatenate\n", "\n", "input:\n", "\n", "output:\n", "\n", "[(None, 6), (None, 1)]\n", "\n", "(None, 7)\n", "\n", "\n", "\n", "139770585809528->139770585375800\n", "\n", "\n", "\n", "\n", "\n", "139771994164472\n", "\n", "horizon: InputLayer\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 1)\n", "\n", "(None, 1)\n", "\n", "\n", "\n", "139771994164472->139770585375800\n", "\n", "\n", "\n", "\n", "\n", "139770585810200\n", "\n", "dense_0: Dense\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 7)\n", "\n", "(None, 8)\n", "\n", "\n", "\n", "139770585375800->139770585810200\n", "\n", "\n", "\n", "\n", "\n", "139770586149104\n", "\n", "dense_1: Dense\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 8)\n", "\n", "(None, 8)\n", "\n", "\n", "\n", "139770585810200->139770586149104\n", "\n", "\n", "\n", "\n", "\n", "139770591142744\n", "\n", "generation: Dense\n", "\n", "input:\n", "\n", "output:\n", "\n", "(None, 8)\n", "\n", "(None, 1)\n", "\n", "\n", "\n", "139770586149104->139770591142744\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis_model(multinetwork.model[\"forecast\"], show_shapes=True, rankdir=\"LR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect feature pipelines to the neural network" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from timeserio.pipeline import MultiPipeline" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "multipipeline = MultiPipeline({\n", " \"sequence\": seq_pipeline,\n", " \"horizon\": fc_horizon_pipeline,\n", " \"target\": target_pipeline\n", "})" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from timeserio.multimodel import MultiModel\n", "\n", "manifold = {\n", " # keras_model_name: (input_pipes, output_pipes)\n", " \"encoder\": (\"sequence\", None),\n", " \"forecast\": ([\"sequence\", \"horizon\"], \"target\")\n", "}\n", "\n", "multimodel = MultiModel(\n", " multinetwork=multinetwork,\n", " multipipeline=multipipeline,\n", " manifold=manifold\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit model from the batch generator\n", "\n", "`multimodel.fit_generator()` will apply pipelines correctly to the training batch generator, and, if `validation_data` is provided in the form of another (pandas) batch generator,\n", "evaluate the relevant metrics. In addition, feature extraction for each batch will benefit from the `workers` parallelism." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from kerashistoryplot.callbacks import PlotHistory\n", "plot_callback = PlotHistory(figsize=(15, 3), n_cols=3, batches=False)\n", "\n", "multimodel.fit_generator(\n", " batchgen_train, model=\"forecast\", verbose=1, epochs=50,\n", " reset_weights=True,\n", " workers=4,\n", " callbacks=[plot_callback]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "persist the model:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from timeserio.utils.pickle import loadf, dumpf\n", "dumpf(multimodel, \"/tmp/PV_model_2.pickle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate performance on test data\n", "We can evaluate the model on the validation data generator, which can also be out-of-memory:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "batchgen_test = SequenceForecastBatchGenerator(\n", " df=df_test, batch_size=2**15,\n", " sequence_length=6,\n", " sequence_columns=[\"generation\", \"Time_step\"],\n", " last_step_columns=[\"Time_step\"],\n", " forecast_steps_min=1,\n", " forecast_steps_max=2,\n", " batch_offset=False,\n", " id_column=\"country\",\n", " batch_aggregator=1\n", ")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "35/35 [==============================] - 9s 254ms/step\n" ] }, { "data": { "text/plain": [ "[0.008467954644999866, 0.051904945554477826]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multimodel.evaluate_generator(batchgen_test, model=\"forecast\", verbose=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the model takes longer to train (and longer still with practical encoder architectures), it can be tuned to achieve higher performanec, especially if encodings are combined with datetime features." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }