Kai Xu and Asa Cooper Stickland
Tue 30 Apr 2019, 11:00 - 12:00
IF 4.31/4.33

If you have a question about this talk, please contact: Gareth Beedham (gbeedham)

Kai Xu

Title: Variational Russian Roulette for Deep Bayesian Nonparametrics

Abstract: Bayesian nonparametric models provide a principled way to automatically adapt the complexity of a model to the amount of the data available, but computation in such models is difficult. Amortized variational approximations are appealing because of their computational efficiency, but current methods rely on a fixed finite truncation of the infinite model. This truncation level can be difficult to set, and also interacts poorly with amortized methods due to the over-pruning problem. Instead, we propose a new variational approximation, based on a method from statistical physics called Russian roulette sampling. This allows the variational distribution to adapt its complexity during inference, without relying on a fixed truncation level, and while still obtaining an unbiased estimate of the gradient of the original variational objective. We demonstrate this method on infinite sized variational auto-encoders using a Beta-Bernoulli (Indian buffet process) prior.


Asa Cooper Stickland

Title: BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Abstract: There is a huge amount of text data available on the web (at least for some languages). An idea like 'Word2Vec' leverages this data to learn vector representations of single words. Much recent work concentrates on contextual word embeddings, learning vector representations of words given their context. The state of the art in this area is dominated by large (>110M parameters) models based off the 'transformer' architecture. One of these models, BERT (Bidirectional Encoder Representations from Transformers), when combined with a simple feedforward output layer, can be finetuned on downstream tasks with impressive, often state of the art, performance.

However storing many large models can have costs (say, for use on mobile devices). In this talk I will describe our recent paper 'BERT and PALs:  Projected Attention Layers for Efficient Adaptation in Multi-Task Learning ', where we explore the multi-task learning setting for the BERT model, and how to best add task-specific parameters to a pre-trained BERT network, with a high degree of parameter sharing between tasks. We introduce new adaptation modules, PALs or ?projected attention layers?, which use a low-dimensional multi-head attention mechanism, based on the idea that it is important to include layers with inductive biases useful for the input domain.