Stochastic Gradient Langevin Dynamics

7 minute read

As students who didnt have much experience with ML, we started off our journey to implementing SGLD by first familiarising ourselves with the basics of ML. As a first step, we implemented Linear and Logistic Regression models in python from scratch. In doing so, we developed a good understanding of the Stochastic Gradient Descent algorithm, and from here on we proceeded to try to parallelise and distribute it on the server provided. After doing so successfully, we moved on to Bayesian machine learning and began exploring the Naive Bayes classifier as well as Bayesian Linear Regression. We implemented the two models successfully, and finally moved onto understanding Stochastic Gradient Langevin Dynamics. We followed this up by implementing SGLD in python using a linear regression model and used custom datasets to run various tests on it, and then as a final step implemented SGLD optimizer with a Bayesian neural network model.

What is Gradient descent?

Gradient descent is one of the most popular optimization algorithms. We first came across gradient descent while implementing Linear and Logistic Regression, where gradient descent was used to minimize the Cost function for the respective model. It updates the models parameters against the direction of the gradient of the objective function with respect to the parameters. Each step size is controlled by a learning rate eta.

\(\theta = \theta - \eta \nabla_\theta J(\theta)\)

Here, \(\theta\) is the vector of parameters, and \(J(\theta)\) is the cost function to be minimized.

Soure Code for our implementation

We implemented both Batch gradient descent and Stochastic gradient descent algorithms from scratch and achieved the following results for the loss. We managed to almost match Python's Scikit-Learn library's accuracy with our implementation.

Accuracy of sklearns Logistic Regression = 0.7662337662337663
Accuracy of self Logistic Regression BGD = 0.7467532467532467
Accuracy of self Logistic Regression SGD = 0.7597402597402597


We also performed SGD for varying mini-batch sizes and achieved the following Loss vs Mini-Batch size plot.


For further understanding, we played around with the hyperparameters and also implemented the following variants:


  • Performing SGD for varying values of $\eta$ and plotting Loss vs Alphas to find the optimal value.
  • Performing SGD for logistic regression with dampening $\eta$ values.
  • Performing SGD for logistic regression with alpha dampening and RMSprop.
  • Performing SGD for logistic regression with different sampling techniques (with and without replacement)

We then managed to successfully distribute our implementation to work on multiple GPUs. The code for the same can be found here

Bayesian Linear Regression


<p>From here, we moved onto Bayesian Linear regression. We created a dummy dataset with the following specifications:</p>

def compute_function_labels(slope, intercept, noise_std_dev, data) -> np.ndarray:
    n_samples = len(data)
    if noise_std_dev == 0: # Real function
        return slope * data + intercept
    else: # Noise corrupted
        return slope * data + intercept + np.random.normal(0, noise_std_dev, n_samples)
n_datapoints = 1000
intercept = -0.7
slope = 0.9
noise_std_dev = 0.5
noise_var = noise_std_dev**2
lower_bound = -1.5
upper_bound = 1.5

features = np.random.uniform(lower_bound, upper_bound, n_datapoints)
labels = compute_function_labels(slope, intercept, 0., features)
noise_corrupted_labels = compute_function_labels(slope, intercept, noise_std_dev, features)

We then generate a multivariate gaussian distribution of priors and plot the prior parameter distribution.


We update the posteriors in each iteration using the following formulas:


The code for the above formulas is as follows:

def update_posterior(features:np.ndarray targets:np.ndarray,noise_var:float,prior_cov:np.ndarray,prior_mean:np.ndarray):
        
    targets = targets[:,np.newaxis]
    
    # adding ones to the feature matrix
    n_samples = len(features)
    phi_0 = np.ones(n_samples)
    features = np.stack((phi_0, features),axis=1)

    features_dot_product = features.T.dot(features)
    inv_prior_cov = np.linalg.inv(prior_cov)
    post_cov = np.linalg.inv(inv_prior_cov +  1/noise_var * features_dot_product)
    # Update the mean, shape (2, 1)
    post_mean = post_cov.dot(inv_prior_cov.dot(prior_mean) + (1/noise_var) * features.T.dot(targets))
    post_mean = post_mean[:,1]

    # Update the posterior distribution
    param_posterior = multivariate_normal(post_mean.flatten(), post_cov)
    return param_posterior,post_cov,post_mean

On plotting the updated paramter distributions after some iterations, we can see that the distribution converges to the true parameter values.




We then plotted the predictive posterior distribution, the code for the same can be found here.

After understanding Bayesian Linear Regression, we got to implementing Convolutional Neural Network with SGLD and finally got to implementing a Bayesian Neural Network with SGLD and MC Dropout.

Bayesian Nueral Network

Creating Dummy Dataset

Assume we have collected $N$ data points $(x_n, y_n), n\le N$ where $x_n \in R^2$ is a two dimensional vector, and $y_n \in R^1$ is a one dimensional response.

We generate the data using: \(y_n = \beta_0 + \beta x_n + N(0,\sigma)\)

Let us generate a dataset with $\beta_0 = -1$, $\beta = 2.0$ and $\sigma = 0.1$


Bayesian Model

Now, the goal in supervised learning is to find an estimate of the parameters $\theta$. Our objective is to maximize the likelihood function plus a prior belief of what the parameters could be. In this example we’ll make a really stupid belief and say that it is equally possible to find the parameters anywhere on the real line: $P(\theta) \sim 1$. This simplifies the analysis, and we can then write the posterior as proportional to the likelihood only:

[P(\theta X,Y) \sim \prod_{n=1}^N N(y_n \beta_0 + \beta x_n ,\sigma^2)]

We can implement this model in pytorch in the following way:

    class BayesNet(nn.Module):
    def __init__(self, seed = 42):
        super().__init__()
        torch.random.manual_seed(seed)
        # Initialize the parameters with some random values
        self.beta0 = nn.Parameter(torch.randn((1,)))
        self.beta = nn.Parameter(torch.randn((1,)))
        self.sigma = nn.Parameter(torch.tensor([1.0]) ) # this has to be positive
    
    def forward(self, data: dict):
        return self.beta0 + self.beta*data['x']

    def loglik(self, data):
        # Evaluates the log of the likelihood given a set of X and Y
        yhat = self.forward(data)
        logprob = dist.Normal(yhat, self.sigma.abs()).log_prob(data['y'])
        return logprob.sum()

    def logprior(self):
        # Return a scalar of 0 since we have uniform prior 
        return torch.tensor(0.0)
    
    def logpost(self, data):
        return self.loglik(data) + self.logprior()

SGLD Algorithm

The implementation of SGD algorithm requires us the calculate the gradient of the likelihood function. For this implementation, note that we do not subsample the data, as we only have 300 anyways.

    from torch.optim.optimizer import Optimizer
class TrainOptimizerSGLD(Optimizer):
    def __init__(self, net, alpha=1e-4):
        super(TrainOptimizerSGLD, self).__init__(net.parameters(), {})
        self.net = net
        self.alpha = alpha
    
    def step(self, batch, batch_size=1, num_data=1):
        self.zero_grad()
        weight = num_data/batch_size
        loss = -self.net.loglik(batch)*weight - self.net.logprior()
        loss.backward()
        
        with torch.no_grad():
            for name, par in self.net.named_parameters():
                newpar = (
                    par - 0.5*self.alpha*par.grad 
                    + torch.normal(torch.zeros_like(par), std=self.alpha)
                )
                par.copy_(newpar)
        return loss

    def fit(self, data=None, num_steps=1000):
        # We create one tensor per parameter so that we can keep track of the parameter values over time:
        self.parameter_trace = {key : torch.zeros( (num_steps,) + par.size()) for key, par in self.net.named_parameters()}

        for s in range(num_steps):
            loss = self.step(data)
            for key, val in self.net.named_parameters():
                self.parameter_trace[key][s,] = val.data

Results

We only have 3 parameters and can visualize all of them.

We also see that the beta parameters quickly snap into place. However, the sigma parameter is varying much more than anticipated. This must be due to the approximation error in the discretization of the stochastic differential equation. Reducing step size should solve this problem:


Testing on Real World Data

We used this basic dummy model to write Bayesian Nueral Network for real world datasets. We used an ensemble approach with MC-Dropout.

class Langevin_Model_UCI(nn.Module):
    def __init__(self, input_dim, output_dim, num_units):
        super(Langevin_Model_UCI, self).__init__()
        
        self.input_dim = input_dim
        self.output_dim = output_dim
        
        # network with two hidden and one output layer
        self.layer1 = Langevin_Layer(input_dim, num_units)
        self.layer2 = Langevin_Layer(num_units, num_units)
        self.layer3 = Langevin_Layer(num_units, 2*output_dim)
        
        # activation to be used between hidden layers
        self.activation = nn.ReLU(inplace = True)
    
    def forward(self, x):
        
        x = x.view(-1, self.input_dim)
        
        x = self.layer1(x)
        x = self.activation(x)
        
        x = self.layer2(x)
        x = self.activation(x)
        
        x = self.layer3(x)
        
        return x


def train_mc_dropout(data, n_splits, burn_in, mix_time, num_nets, num_units, learn_rate, weight_decay, log_every):
    
    kf = KFold(n_splits=n_splits)
    in_dim = data.shape[1] - 1
    train_logliks, test_logliks = [], []
    train_rmses, test_rmses = [], []

    for j, idx in enumerate(kf.split(data)):
        print('FOLD %d:' % j)
        train_index, test_index = idx

        x_train, y_train = data[train_index, :in_dim], data[train_index, in_dim:]
        x_test, y_test = data[test_index, :in_dim], data[test_index, in_dim:]

        x_means, x_stds = x_train.mean(axis = 0), x_train.var(axis = 0)**0.5
        y_means, y_stds = y_train.mean(axis = 0), y_train.var(axis = 0)**0.5

        x_train = (x_train - x_means)/x_stds
        y_train = (y_train - y_means)/y_stds

        x_test = (x_test - x_means)/x_stds
        y_test = (y_test - y_means)/y_stds

        net = Langevin_Wrapper(network=Langevin_Model_UCI(input_dim=in_dim, output_dim=1, num_units=num_units),
                               learn_rate=learn_rate, batch_size=batch_size, no_batches=1, weight_decay=weight_decay)

        nets, losses = [], []
        num_epochs = burn_in + mix_time*num_nets + 1
        fit_loss_train = np.zeros(num_epochs)

        for i in range(num_epochs):
            loss = net.fit(x_train, y_train)

            if i % mix_time == 0 and i > burn_in:
                nets.append(copy.deepcopy(net.network))
                
            if i % log_every == 0 or i == num_epochs - 1:
                test_loss = net.test_loss(x_test, y_test).cpu().data.numpy()

                if len(nets) > 10: ensemble_loss, rmse = eval_ensemble(x_test, y_test, nets)
                else: ensemble_loss, rmse = float('nan'), float('nan')

                print('Epoch: %4d, Train loss: %6.3f Test loss: %6.3f Ensemble loss: %6.3f RMSE: %.3f Num. networks: %2d' %
                      (i, loss.cpu().data.numpy()/len(x_train), test_loss/len(x_test), ensemble_loss, rmse*y_stds[0], len(nets)))


        train_loss, train_rmse = eval_ensemble(x_train, y_train, nets)
        test_loss, test_rmse = eval_ensemble(x_test, y_test, nets)

        train_logliks.append(-(train_loss.cpu().data.numpy() + np.log(y_stds)[0]))
        test_logliks.append(-(test_loss.cpu().data.numpy() + np.log(y_stds)[0]))

        train_rmses.append(y_stds[0]*train_rmse.cpu().data.numpy())
        test_rmses.append(y_stds[0]*test_rmse.cpu().data.numpy())


    print('Train log. lik. = %6.3f +/- %6.3f' % (np.array(train_logliks).mean(), np.array(train_logliks).var()**0.5))
    print('Test  log. lik. = %6.3f +/- %6.3f' % (np.array(test_logliks).mean(), np.array(test_logliks).var()**0.5))
    print('Train RMSE      = %6.3f +/- %6.3f' % (np.array(train_rmses).mean(), np.array(train_rmses).var()**0.5))
    print('Test  RMSE      = %6.3f +/- %6.3f' % (np.array(test_rmses).mean(), np.array(test_rmses).var()**0.5))
    
    return nets

Results

Dataset Train RMSE Test RMSE
Power Dataset 4.110 4.128
Red Wine Dataset 0.613 0.633

Ending Notes

We faced numerous difficulties while trying to understand the theory for many of the models we implemented but the constant guidance provided by Dr.Bapi Chatterjee greatly helped us overcome our theoretical insufficiencies and we’re very grateful for it.

Mayank Gupta

Mayank Gupta

A B.Tech. student at DCLL IIIT Delhi.

Yash Mathne

Yash Mathne

A B.Tech. student at DCLL IIIT Delhi.

Dipankar Jiwani

Dipankar Jiwani

A B.Tech. student at DCLL IIIT Delhi.

Comments

  Write a comment ...