Basic examples¶
To illustrate the use of the package, we use a simple dataset taken from the article 1 and used in the presentation of the Deep Ensemble model 2.
1. Toy dataset¶
This dataset consists of 20 training examples randomly selected according to the relationship
with \(x\) in the range \([-4, 4]\). We use the following code to generate the dataset.
import numpy as np
import matplotlib.pyplot as plt
# Generate the dataset
np.random.seed(12)
N = 20
X = np.random.rand(N,1)*8 -4 ; X_graph = np.linspace(-6, 6, 500).reshape(-1,1)
y = X**3 + np.random.randn(N,1)*3 ; y_graph = X_graph**3
# Plot the dataset
fig = plt.figure(figsize=(12,12))
plt.plot(X_graph, y_graph, 'b', label='y = x^3')
plt.scatter(X, y, s=60, c="r",label='Données')
plt.legend(loc="lower right")
plt.show()
2. DeepEnsemble 1¶
At present, there is 2 way to use the Deep Ensemble model. It’s possible to predict the mean and the variance or just predict the mean for each model. Here we present the first example predicting only the mean.
Firstly, it’s necessary to construct the structure of one model. In this first case, it is easy as the model is just a regular deterministic model. We use the following code to create the model.
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.initializers import RandomNormal
# Specify input and output shape
input_shape = (1,) ; output_shape = 1
# Specify the initialisation of the weights
ki = RandomNormal(mean=0.0, stddev=0.4, seed=12)
# Create one model
inputs = Input(shape=input_shape, name="input")
x = Dense(100, activation="relu", name="hidden_layer", kernel_initializer=ki)(inputs)
outputs = Dense(output_shape, name="output", kernel_initializer=ki)(x)
model = Model(inputs=inputs, outputs=outputs)
Tip
The kernel initializer is used to have a larger variance of the weights and thus a better diversity in the models.
Now, thanks to the package Purestochastic
, it’s possible to convert directly the model
into a Deep Ensemble model :
>>> from purestochastic.model.deep_ensemble import toDeepEnsemble
>>>
>>> nb_models = 5
>>> deep_ensemble = toDeepEnsemble(model, nb_models=nb_models)
Your network is compatible with a fully parallelizable implementation.
>>> deep_ensemble.summary()
Model: "deep_ensemble_model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) [(None, 1)] 0
ensemble_hidden_layer (Dense2Dto3D) (None, 5, 100) 1000
ensemble_output (Dense3Dto3D) (None, 5, 1) 505
=================================================================
Total params: 1,505
Trainable params: 1,505
Non-trainable params: 0
_________________________________________________________________
Then, you have to compile the model and launch the training as usual with keras API. You have
the possibility to put the model into an GaussianRegression
object that will allow
you to make all the pre-processing and post-processing of the data.
from tensorflow.keras.opitimizers import Adam
from purestochastic.model.base_uncertainty_models import GaussianRegression
# Compile the model
deep_ensemble.compile(loss="mse", optimizer=Adam(learning_rate=0.02))
# Create a GaussianRegression task
reg = GaussianRegression(deep_ensemble)
# Train the model for 40 epochs
reg.fit(X, y, epochs=40)
For the predict
method, it’s the same as usual except that the output is tuple of 2 values with the mean and the variance.
>>> y_mean, y_var = reg.predict(X_graph)
>>> y_mean.shape
(500, 1)
>>> y_var.shape
(500, 1)
>>> fig = plt.figure(figsize=(12,12))
>>> plt.plot(X_graph, y_mean, 'g', label='Predicted')
>>> plt.plot(X_graph, y_graph, 'b', label='Truth')
>>> plt.fill_between(X_graph.reshape(-1), y_mean.reshape(-1) - 1.96*np.sqrt(y_var.reshape(-1)), y_mean.reshape(-1) + 1.96*np.sqrt(y_var.reshape(-1)), alpha=0.2, color='g')
>>> plt.scatter(X, y, s=60, c="r",label='Données')
>>> plt.legend(loc="lower right")
>>> plt.tight_layout()
>>> plt.show()
All the code to reproduce the example is available in deep_ensemble1.py
.
3. DeepEnsemble 2¶
In the previous section, we have presented the first way of using the Deep Ensemble model using a regular network. However, the previous model assumes that the data is sampled from a homoscedastic Gaussian distribution and this is a strong assumption. To overcome this issue, the second way is to use a model that predicts the mean and the variance of a predicted gaussian random value for each data point. To build this model, we use the following code.
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.initializers import RandomNormal
from purestochastic.model.layers import Dense2Dto3D
from purestochastic.model.activations import MeanVarianceActivation
# Specify input and output shape
input_shape = (1,) ; output_shape = 1
# Specify the initialisation of the weights
ki = RandomNormal(mean=0.0, stddev=0.3, seed=8888)
# Create one model
inputs = Input(shape=input_shape, name="input")
x = Dense(100, activation="relu", name="hidden_layer", kernel_initializer=ki)(inputs)
outputs = Dense2Dto3D(output_shape, 2, activation=MeanVarianceActivation, name="output", kernel_initializer=ki)(x)
model = Model(inputs=inputs, outputs=outputs)
Note
We use the Dense2Dto3D
layer to predict the mean and the variance of the gaussian random value.
It’s useful to structure the output.
Tip
To enforce the positivity of the variance, we use the activation function MeanVarianceActivation()
on the output layer. The model is equivalent to a MVEM model.
Then, it is nearly the same as before except that we can’t use the MSE loss function. When the model outputs the mean and the variance,
a good choice is to optimize the Gaussian Negative Log Likelihood function defined by the package as gaussian_negative_log_likelihood()
.
from purestochastic.common.losses import gaussian_negative_log_likelihood
from purestochastic.model.activations import MeanVarianceActivation
nb_models = 5
deep_ensemble = toDeepEnsemble(model, nb_models=nb_models)
# Compile the model
deep_ensemble.compile(loss=gaussian_negative_log_likelihood, optimizer=Adam(learning_rate=0.02))
# Create a GaussianRegression task
reg = GaussianRegression(deep_ensemble)
# Train the model for 40 epochs
reg.fit(X, y, epochs=40)
Finally, when we use the predict()
function, we get now a tuple of 3 values. Indeed, with the
MeanVarianceActivation()
activation function and the DeepEnsemble
model, the variance
can be separated into an epistemic and an aleatoric component. The prediction interval represented
in the figure below are the one computed only with the aleatoric component and the one computed with
the total variance.
All the code to reproduce the example is available in deep_ensemble2.py
.
4. SWAG¶
For other models, it’s similar as DeepEnsemble. The difference for each model is the kwargs
parameters that we can specify for the methods
fit
and predict
. For instance, the parameter start_averaging
of the method fit
in the SWAGModel
is important and need
to be specify to have good results :
# Train the model for 55 epochs
reg.fit(X, y, epochs=55, learning_rate=0.008, start_averaging=40)
All the code to reproduce the example is available in swag.py
.
5. MultiSWAG¶
The class StochasticModel
allows to use metrics during the training of the model. For example, if we want to see the
evolution of the PICP (Prediction Interval Coverage Probability) during the training, it is sufficient to add it during the
call to the method compile
.
>>> from purestochastic.common.metrics import PredictionIntervalCoverageProbability
>>> multi_swag.compile(loss=gaussian_negative_log_likelihood, optimizer=Adam(learning_rate=0.02), stochastic_metrics=[PredictionIntervalCoverageProbability(0.8)])
>>>
>>> reg = GaussianRegression(multi_swag)
>>> reg.fit(X, y, epochs=55, learning_rate=0.008, start_averaging=40)
Epoch 1/40
1/1 [==============================] - 3s 3s/step - loss: 2.6212 - picp: 0.9000
Epoch 2/40
1/1 [==============================] - 0s 6ms/step - loss: 1.3922 - picp: 1.0000
Epoch 3/40
1/1 [==============================] - 0s 10ms/step - loss: 1.0888 - picp: 1.0000
Epoch 4/40
1/1 [==============================] - 0s 9ms/step - loss: 1.0117 - picp: 1.0000
Epoch 5/40
1/1 [==============================] - 0s 7ms/step - loss: 1.0116 - picp: 1.0000
Epoch 6/40
1/1 [==============================] - 0s 9ms/step - loss: 1.0726 - picp: 1.0000
Epoch 7/40
1/1 [==============================] - 0s 8ms/step - loss: 1.0793 - picp: 1.0000
Epoch 8/40
1/1 [==============================] - 0s 5ms/step - loss: 1.0857 - picp: 1.0000
Epoch 9/40
1/1 [==============================] - 0s 6ms/step - loss: 1.0608 - picp: 1.0000
Epoch 10/40
1/1 [==============================] - 0s 9ms/step - loss: 1.0468 - picp: 1.0000
Epoch 11/40
1/1 [==============================] - 0s 13ms/step - loss: 1.0081 - picp: 1.0000
Epoch 12/40
1/1 [==============================] - 0s 9ms/step - loss: 0.9568 - picp: 0.9500
Epoch 13/40
1/1 [==============================] - 0s 8ms/step - loss: 0.9007 - picp: 0.9500
Epoch 14/40
1/1 [==============================] - 0s 7ms/step - loss: 0.8316 - picp: 0.9500
Epoch 15/40
1/1 [==============================] - 0s 5ms/step - loss: 0.7517 - picp: 0.9500
...
With this functionnality, it is also possible to save the training history of the model with stochastic metrics. It is possible to
know the values of the metrics at the end of the training with the method evaluate
.
>>> reg.evaluate(X_graph, y_graph)
{'loss': -0.6353859901428223, 'picp': 0.6538169980049133}
All the code to reproduce the example is available in multi_swag.py
.
6. Orthonormal Certificates¶
The Orthonormal Certificates model consists of adding a linear layer at the top of the featurizer in order to map all the training samples to zeros. It can be a good alternative because it’s a single-model estimates. In other words, the training and inference time are low.
As there is a loss function for the orthonormal certificates and the initial model, the training’s display is different.
When the initial model is trained, the real loss is output_initial_model_loss
. Loss
is modified by the regularizer
added
to the certificates :
>>> reg.fit(X, y, epochs=40, epochs_oc=40)
Epoch 1/40
1/1 [==============================] - 2s 2s/step - loss: 1.1486 - output_initial_model_loss: 1.1389
Epoch 2/40
1/1 [==============================] - 0s 11ms/step - loss: 0.8712 - output_initial_model_loss: 0.8616
Epoch 3/40
1/1 [==============================] - 0s 6ms/step - loss: 0.2924 - output_initial_model_loss: 0.2828
Epoch 4/40
1/1 [==============================] - 0s 4ms/step - loss: 0.2771 - output_initial_model_loss: 0.2675
Epoch 5/40
1/1 [==============================] - 0s 5ms/step - loss: 0.3879 - output_initial_model_loss: 0.3783
Epoch 6/40
1/1 [==============================] - 0s 4ms/step - loss: 0.3566 - output_initial_model_loss: 0.3469
Epoch 7/40
1/1 [==============================] - 0s 3ms/step - loss: 0.2578 - output_initial_model_loss: 0.2481
Epoch 8/40
1/1 [==============================] - 0s 8ms/step - loss: 0.1755 - output_initial_model_loss: 0.1659
Epoch 9/40
1/1 [==============================] - 0s 6ms/step - loss: 0.1925 - output_initial_model_loss: 0.1829
...
When the OC model is trained, output_oc_loss
represent the loss whithout the regularizer added to the
certificates and loss
is the total loss :
>>> reg.fit(X, y, epochs=40, epochs_oc=40)
...
Epoch 40/40
1/1 [==============================] - 0s 8ms/step - loss: 0.0248 - output_initial_model_loss: 0.0152
Epoch 1/40
1/1 [==============================] - 1s 573ms/step - loss: 0.0098 - output_oc_loss: 1.3532e-04
Epoch 2/40
1/1 [==============================] - 0s 4ms/step - loss: 0.0097 - output_oc_loss: 1.0131e-04
Epoch 3/40
1/1 [==============================] - 0s 8ms/step - loss: 0.0096 - output_oc_loss: 6.5591e-05
Epoch 4/40
1/1 [==============================] - 0s 4ms/step - loss: 0.0096 - output_oc_loss: 3.8611e-05
Epoch 5/40
1/1 [==============================] - 0s 5ms/step - loss: 0.0095 - output_oc_loss: 2.3374e-05
Epoch 6/40
1/1 [==============================] - 0s 4ms/step - loss: 0.0095 - output_oc_loss: 1.8150e-05
Epoch 7/40
1/1 [==============================] - 0s 6ms/step - loss: 0.0094 - output_oc_loss: 1.9198e-05
Epoch 8/40
1/1 [==============================] - 0s 5ms/step - loss: 0.0094 - output_oc_loss: 2.2717e-05
Epoch 9/40
1/1 [==============================] - 0s 4ms/step - loss: 0.0093 - output_oc_loss: 2.5952e-05
...
All the code to reproduce the example is available in orthonormal_certificates.py
.
References
- 1
José Miguel Hernández-Lobato et Ryan P. Adams. « Probabilistic backpropagation for sca- lable learning of Bayesian neural networks ». In : 32nd International Conference on Machine Learning, ICML 2015 3 (2015), p. 1861-1869. arXiv : 1502.05336.
- 2
Balaji Lakshminarayanan, Alexander Pritzel et Charles Blundell. « Simple and scalable predictive uncertainty estimation using deep ensembles ». In : Advances in Neural Information Processing Systems 2017-Decem.Nips (2017), p. 6403-6414. issn : 10495258. arXiv : 1612.01474.