Berlin Machine Learning Seminar

About

Hide

Seminars are generally of a theoretical nature and concern machine learning. It is assumed that seminar members are interested in, and not afraid of, the relevant mathematics.

The idea is to use the seminar as a vehicle for learning — particularly for the speaker. Speakers volunteer to give talks on subjects that they want to force themselves to learn about. As such, seminars can be fundamental or specialised, beginner or less so, depending on what the speaker wants to learn. We are all learners and the atmosphere of the seminar reflects this.

We also host talks from experts in their field.

The seminar is run by Matthias Leimeister and Marcel Ackermann (previously, until October 2019, by Benjamin Wilson). We meet in the offices of Lateral GmbH, in Schöneberg, Berlin.

Please get in touch if you’d like to come along.

Which U-Net is the best? Tackling a Challenge in Brain Lesion Segmentation
Sebastian Hitzinger
Wednesday 3rd November 2021 more less

Biomedical image segmentation is a challenging field, due to large 3D datasets with highly imbalanced data and limited availability of annotated training data. Variations of the fully convolutional U-Net are a common and promising choice for this type of tasks. This is also reflected in the results of the recent challenge on MS new lesion segmentation (MSSEG-2). However, the questions of the optimal hyperparameter setting remains.

In this talk, we will analyze the methods with highest ranking on the official F1-score leaderboard of the challenge, including our own submission. We will describe the particular design choices and make out factors of success.

Resources:

U-Net: https://arxiv.org/abs/1505.04597
MSSEG-2 challenge website: https://portal.fli-iam.irisa.fr/msseg-2/

Neural networks for Optical Ray Tracing Simulations
Aditya Seshaditya
Wednesday 8th September 2021 more less

Recently several Neural Networks have also been developed for assisting several kinds of simulations such as optical, thermo-mechanical, and so on.
For the development of Optical Lasers system for High Flash LIDARs, Ray tracing simulations have been performed. Along with simulations Neural networks have also been trained to assist the simulations.

In this Talk we would like to cover the main topics related to the above developments, namely

1. Ray Tracing simulations and Point spread Functions (PSFs) parameters estimation using Convolutional Neural Networks (CNNs)
2. Inverse or Invertible Neural network framework: mapping from the PSF parameters space to simulation images
3. Introduction to Physics Inspired Neural Networks (PINNs) for these simulations

Past, present, and future of computational notebooks
Heiki Schmidle
Wednesday 2nd June 2021 more less

In the last few years the usage of computational notebooks in all parts of data related work is dramatically increasing.

Academia, machine-learning, data-scientist, analysts, business intelligence are using more and more notebooks as their tool of choice in many day-to-day tasks.

When did this computational paradigm start and what were the steps to the current landscape?

I will bring some light in the past 30 years of human-computer-interaction.

How are notebooks currently used and what are the biggest pain-points according to recent research studies?

I will try to show some design opportunities that might help to solve some of the problems and point us to where computational notebooks are developing in the future.

Uncertainty quantification for image recognition with deep neural networks
Hanno Gottschalk
Wednesday 17th March 2021 more less

In recent years, deep learning methods have outperformed other methods in image recognition. This has fostered imagination of potential application of deep learning technology including safety relevant applications like the interpretation of medical images or autonomous driving. The passage from assistance of a human decision maker to ever more automated systems however increases the need to properly handle the failure modes of deep learning modules. In this contribution, we review a set of techniques for the self-monitoring of machine-learning algorithms based on uncertainty quantification. In particular, we apply this to the task of semantic segmentation, where the machine learning algorithm decomposes an image according to semantic categories. We discuss false positive and false negative error modes at instance-level and review techniques for the detection of such errors that have been recently proposed by the authors. We also give an outlook on future research directions, including the recognition out of distribution objects and the retrieval of semantic classes outside the semantic space of the training data.

Links to papers:

https://arxiv.org/abs/2012.06575
https://arxiv.org/abs/2012.07504
https://arxiv.org/abs/1911.05075
https://arxiv.org/abs/1912.07420
https://arxiv.org/abs/1811.00648

Links to videos:
https://www.youtube.com/watch?v=6SoGmsAarTI
https://www.youtube.com/watch?v=TQaV5ONCV-Y&t=16s

Attacking Privacy in Neural Networks
Franziska Boenisch
Wednesday 3rd March 2021 more less

Neural networks are increasingly being applied in sensitive domains and on private data. For a long time, no thought was given to what this means for the privacy of the data used for their training. Only in recent years has there emerged an awareness that the process of converting training data into a model is not irreversible as previously thought. Since then, several specific attacks against privacy in neural networks have been developed. Of these, we will discuss two specific ones, namely membership inference and model inversion attacks. First, we will focus on how they retrieve potentially sensitive information from trained models. Then, we will look into several factors that influence the success of both attacks. At the end, we will discuss Differential Privacy as a possible protection measure.

Bio: Franziska has completed a Master’s degree in Computer Science at Freie University Berlin and Technical University Eindhoven. For the past 1,5 years, she has been working at Fraunhofer AISEC as a Research Associate in topics related to Privacy Preserving Machine Learning, Data Protection, and Intellectual Property Protection for Neural Networks. Additionally, she is currently doing her PhD in Berlin.

Towards reliable deep learning
Florian Wenzel
Wednesday 17th February 2021 more less

Deep learning models are bad at detecting their failure. They tend to make over-confident mistakes, especially, under distribution shift. Making deep learning more reliable is important in safety-critical applications including health care, self-driving cars, and recommender systems.
We discuss two approaches to reliable deep learning. First, we will focus on Bayesian neural networks that come with many promises to improved uncertainty estimation. However, why are they rarely used in industrial practice? In this talk, we will cast doubt on the current understanding of Bayes posteriors in deep networks. We show that Bayesian neural networks can be improved significantly through the use of a "cold posterior" that overcounts evidence and hence sharply deviates from the Bayesian paradigm. We will discuss several hypotheses that could explain cold posteriors.
In the second part, we will discuss a classical approach to more robust predictions: ensembles. Deep ensembles combine the predictions of models trained from different initializations. We will show that the diversity of predictions can be improved by considering models with different hyperparameters. Finally, we present an efficient method that leverages hyperparameter diversity within a single model.

Paper:

Florian Wenzel et. al.: How Good is the Bayes Posterior in Deep Neural Networks Really? ICML 2020, PDF

Intro to Causal Inference and Applications
Lars Roemheld
Wednesday 3rd February 2021 more less

This talk aims to give an introduction to causal inference as a new “old” problem being enriched by machine learning. I will try to cover a lot of ground, from Causality as a framework to think about generalization error and data drift, to learning causal effects from observational data, experiments, and causal structure learning. Using some concrete use cases inspired by my own experience in e-commerce, we will discuss modern ideas such as Orthogonal ML and heterogeneous treatment effects. This talk is intended to be approachable for a wide audience.

Benefits of Assistance over Reward Learning
Dima Krasheninnikov
Wednesday 13th January 2021 more less

Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward model through human feedback that is provided externally from the environment. The second is assistance, in which the human is modeled as a part of the environment, and the true reward function is modeled as a latent variable in the environment that the agent may make inferences about. The key difference between the two paradigms is that in the reward learning paradigm, by construction there is a separation between reward learning and control using the learned reward. In contrast, in assistance these functions are performed as needed by a single policy. By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning. We illustrate these advantages in simple environments by showing desirable qualitative behaviors of assistive agents that cannot be found by agents based on reward learning.

Paper:

Rohin Shah et. al.: Benefits of Assistance over Reward Learning, NeurIPS 2020 Cooperative AI Workshop, PDF

Progressive Layered Extraction for Multi Task Learning
Vaibhav Singh
Wednesday 2nd December 2020 more less

In this talk, I will explain the paper Progressive Layered
Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for
Personalized Recommendations which won the Best Long Paper award in
RecSys 2020.

MTL relates to a challenge where we want to learn different tasks eg.
predicting cat breeds and dog breeds using a single DNN. The basic
idea in this type of DNN is to share lower level features across tasks
such that the model learns some general features. However, this DNN
would not work for a task that is unrelated and does not share the
same features, and in this context such as predicting a car model.

Related to this in the field of Recommender Systems we would like to
learn different tasks such as the likelihood of clicking, finishing,
sharing, favoriting, commenting, etc. Some of these tasks are often
loosely correlated or conflicted which may lead to negative transfer.
Scenarios appear where MTL models improve certain tasks by degrading
the performance of other tasks (seesaw phenomenon). Some related works
like cross-stitch networks, sluice networks, multi-gate mixture of
experts address this problem in various ways.

This idea behind this paper is to explicitly separate shared and
task-specific experts to avoid harmful parameter interference. On top
of this multi-level experts and gating networks are introduced to fuse
more abstract representations. Finally, it adopts a novel progressive
separation routing to model interactions between experts and achieve
more efficient knowledge transferring between complicatedly correlated
tasks.

Speaker info: Vaibhav Singh currently heads machine learning work in
areas of Fraud Detection, App Personalization and Consumer Growth
within Klarna.

Langevin Cooling for Domain Translation
Vignesh Srinivasan
Wednesday 4th November 2020 more less

Domain translation is the task of finding correspondence between two domains. Several Deep Neural Network (DNN) models, e.g., CycleGAN and cross-lingual language models, have shown remarkable successes on this task under the unsupervised setting - the mappings between the domainsare learned from two independent sets of training data in both domains (without paired samples). However, those methods typically do not perform well on a significant proportion of test samples. In this paper, we hypothesize that many of such unsuccessful samples lie at the fringe - relatively low-density areas - of data distribution, where the DNN was not trained very well, and propose to perform Langevin dynamics to bring such fringe samples towards high density areas. We demonstrate qualitatively and quantitatively that our strategy, called Langevin Cooling (L-Cool), enhances state-of-the-art methods in image translation and language translation tasks.

Paper

Video

Examples of Reinforcement Learning applications in the financial market
Heuna Kim
Thursday 22nd October 2020 more less

Reinforcement Learning has been broadly employed in financial markets for the last few years by benefiting from its nature of combining the behavior optimization (in this case buy and sell) and the market prediction.
We will first discuss the hierarchical reinforcement learning scheme deployed by JPMorgan (NIPS Workshop 2018, paper) and then take a look at other two examples of DRL applied in trading.
The first one (KDD 2019, paper) is implementing an interpretable network that works similar to a traditional trading strategy (Buying-Winners-and-Selling-Losers).
The second one (ICML 2019, paper) is extending a traditional mathematical model (the Almgren and Chriss model) to a multi-agent setting in order to optimize a liquidation strategy.

Slides

Inductive programming in limited domain: How FlashFill solves string transformation tasks
Ulaş Türkmen
Wednesday 7th October 2020 more less

In this talk, I will take a deep dive into FlashFill, one of the most widely
used applications of inductive programming in consumer software. Inductive programming is the generation of programs, or processing of new input, based on a specification over the input-output space. FlashFill was developed to help with programming problems faced by Excel users, where the input and output are both strings. Due to its limited domain, FlashFill can deliver satisfying
results with low computational and implementation complexity. It works by using a domain-specific language to generate code that is valid for a given set of examples, and ranking them using a complexity heuristic to pick the simplest, which is also the most probable. The user can either run the generated mini-program over further input, correcting it by providing more examples if necessary, or get a pseudocode representation of the program.

References:

- Spreadsheet Data Manipulation Using Examples (pdf)

- Automating String Processing in Spreadsheets Using Input-Output Examples (pdf)

Semantic Maps
Johannes Mosig
Wednesday 5th August 2020 more less

We investigate a novel word embedding, called "Semantic Map". In contrast to now popular embeddings, the Semantic Map Embedding does not represent words in a vector space, but in something related to a topological space. This allows for new and interesting operations on words and we have some promising preliminary results. All of this is ongoing and as-of-yet unpublished research at Rasa.

Learning robust visual representations using data augmentation invariance
Alex Hernández-García
Wednesday 3rd June 2020 more less

We show that the representations learnt by neural networks trained on image object classification are not more robust (in the sense of representational similarity) to image transformations than at the pixel space. This contrasts with a well-studied property of the visual cortex: the increasing invariance along the visual ventral stream to identity-preserving transformations. Taking inspiration from this, we propose a learning objective, data augmentation invariance, that encourages robust visual representations. The models trained with data augmentation invariance learn robust representations without detriment to the classification accuracy.

Paper

Slides

Code

Unsupervised Representation Learning with Contrastive Pre-training
Oğuz Şerbetci
Wednesday 6th May 2020 more less

Recent self-supervised (unsupervised) pre-training methods in NLP have enabled to transfer learning from large unlabeled data to small labeled target data. In 2019, contrastive training has been introduced by [1] which demonstrated strong performance for image and speech in similar setups. Since then, others have improved on the idea and reached performance of supervised pre-training [3]. I found the main approach pretty cool and would like to present it along with interesting additions from more recent work.

Slides

[1] Initial paper introducing contrastive self-supervised pre-training for image and speech domains: http://arxiv.org/abs/1807.03748
[2] Update from first author with others that improves on it: http://arxiv.org/abs/1905.09272
[3] Best performance I'm aware of: http://arxiv.org/abs/2002.05709

Support vector classifiers
Stephen Enright-Ward
Wednesday 22nd April 2020 more less

I always wanted to understand support vector classifiers are (SVCs), but never got around to it. In this general introductory talk, I will remedy this by carefully defining SVCs, paying special attention to explaining the mathematics behind them.

The seminar will be held remotely:
https://tiny.cc/video-conference

Slides

Modelling of non-linear state space systems using deep neural networks
Julian Parker
Wednesday 4th March 2020 more less

'Virtual analog' (VA) is a common area of study in music related signal processing research. VA work focuses on constructing digital models of the many analog electronic circuits used in devices such as synthesisers, guitar amplifiers, effects pedals etc. Recently we introduced a new method of modelling such systems by training a neural network to approximate measurements of the many internal signals of such systems, hence reproducing the manifold the system inhabits in its 'state space'. In this talk, I'll give an overview of the method and examples of its application to several circuits. I'll also discuss how the method fits within the wider space of deep-learning techniques.

Paper:

Julian D. Parker, Fabián Esqueda, André Bergner: Modelling of non-linear state space systems using deep neural networks. DAFx 2019 pdf

Uncertainty quantification in Reinforcement Learning
Baddar
Wednesday 5th February 2020 more less

Uncertainty quantification is one of the hot research topics in machine learning and stochastic modeling. Despite the existence of many papers presenting different approaches to model and quantify uncertainty in machine learning models, few of them links uncertainty quantification with Reinforcement learning. IMHO, it is of great importance to study this link to understand how inherent model uncertainties affect agent strategies in RL.

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Nick Hoff
Wednesday 15th January 2020 more less

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
pdf

Dialogue Transformers
Vladimir Vlasov
Wednesday 4th December 2019 more less

We introduce a dialogue policy based on a transformer architecture, where the self-attention mechanism operates over the sequence of dialogue turns. Recent work has used hierarchical recurrent neural networks to encode multiple utterances in a dialogue context, but we argue that a pure self-attention mechanism is more suitable. By default, an RNN assumes that every item in a sequence is relevant for producing an encoding of the full sequence, but a single conversation can consist of multiple overlapping discourse segments as speakers interleave multiple topics. A transformer picks which turns to include in its encoding of the current dialogue state, and is naturally suited to selectively ignoring or attending to dialogue history. We compare the performance of the Transformer Embedding Dialogue (TED) policy to an LSTM and to the REDP, which was specifically designed to overcome this limitation of RNNs.

Vladimir Vlasov, Johannes E. M. Mosig, Alan Nichol: Dialogue Transformers. arXiv 2019 pdf

Slides

Introduction to transformers video

Explanations can be manipulated and geometry is to blame
Pan Kessel
Thursday 21st November 2019 more less

Explanations of machine learning decisions have attracted considerable attention in recent years as they aim to make machine decisions more transparent and trustworthy. In this talk, I will briefly review explanation methods for neural networks. I will then show that these methods are surprisingly vulnerable to manipulations. I will explain that this vulnerability can be understood theoretically by using differential geometry. In the last part of the talk, I will explain how these theoretical insights suggest modifications of explanation methods which make them more robust. The talk is based on this paper

Unsupervised machine translation
Joris Dolderer
Wednesday 16th October 2019 more less

Unsupervised machine translation systems require only monolingual text corpora - i.e. no dictionary or parallel sentences - for training. After rapid advances in the last few years, the current state of the art methods [1] achieve BLEU scores comparable to or better than the best supervised systems from 2014. Many of these systems incorporate old-fashioned statistical machine translation (SMT) systems, rather than relying on neural machine translation exclusively. Furthermore, they often start with cross-lingual word or phrase embeddings learned in an unsupervised fashion, using algorithms such as VecMap [2].

The aim of this talk is to give an overview of how an unsupervised MT system works. For concreteness, I will focus on the system developed by Artetxe et al. [3], which incorporates VecMap.

Slides

[1] https://paperswithcode.com/task/unsupervised-machine-translation
[2] https://arxiv.org/abs/1805.06297
[3] https://arxiv.org/abs/1902.01313

Neural Networks Don't Learn Default Rules for German Plurals, But That's Okay, Neither Do Germans
Kate McCurdy
Tuesday 3rd September 2019 more less

Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Some linguists have argued that the German number system cannot be modeled without rule-based symbolic computation, because the `default' plural marker, /-s/, is also the least frequent; they claim that speaker preferences for /-s/ in elsewhere conditions, such as novel and phonologically atypical nouns, require representation of linguistic rules and thus cannot be inferred from data alone.

We present a new dataset of German speakers' production and rating of plural forms for novel nouns, and note that the results provide at best weak support the claimed `default' status for /-s/, reducing its potential challenge for neural models. Nonetheless, we observe that neural encoder-decoder models, while broadly successful on this `wug' task, show distinctive failure modes suggesting they do not generalize in quite the same manner as human speakers.

A Style-Based Generator Architecture for Generative Adversarial Networks
Katharina Rasch
Wednesday 5th June 2019 more less

You might have seen this face generator that creates creepily
realistic faces? Let's take a look
at the paper behind this method and learn how they pulled it off. This
paper is a nice combination of different techniques that have been
developed in GANs over the last years and a chance to learn a bit about
what has been going on in this field. Some things I will talk about are
progressive GANs, GAN evaluation metrics, dataset preparation, maybe a
bit on different loss functions that people use.

Original paper abstract:

arXiv

We propose an alternative generator architecture for generative
adversarial networks, borrowing from style transfer literature. The
new architecture leads to an automatically learned, unsupervised
separation of high-level attributes (e.g., pose and identity when
trained on human faces) and stochastic variation in the generated
images (e.g., freckles, hair), and it enables intuitive,
scale-specific control of the synthesis. The new generator improves
the state-of-the-art in terms of traditional distribution quality
metrics, leads to demonstrably better interpolation properties, and
also better disentangles the latent factors of variation. To quantify
interpolation quality and disentanglement, we propose two new,
automated methods that are applicable to any generator architecture.
Finally, we introduce a new, highly varied and high-quality dataset of
human faces.

GPyTorch: A scalable approach to Gaussian Processes
Ludwig Winkler
Wednesday 15th May 2019 more less

Gaussian Processes are a class of popular and powerful probabilistic models which can be derived straight from a Normal distribution. While vanilla GP’s are very flexible due to their nonparametric formulation, inference and training both rely on the evaluation of the entire data set through several kernel matrices, their products, log determinants and inverses. These kernel matrices over the entire training data set and their numerous use in matrix matrix pdocuts, log determinants and inverses result in poor scalability to large data sets. The advent of GPU’s in machine learning has offered the possibility of computing matrix vector and matrix matrix products efficiently and in parallel. GPyTorch introduces several randomized and parallelized algorithms as replacements for exact computations which reduce the complexity of inference and training in GP’s. These include stochastic approximations of the trace, parallel conjugate gradient descent and several efficient applications of eigendecompositions. More importantly, all of these adaptations allow full utilization of parallel hardware during inference and training. Considerable focus will be on conjugated gradient descent which allows the efficient optimization of quadratic optimization problems through the use of line search and conjugate search directions.

Reinforcement Learning as Probabilistic Inference
Mathias Schmerling
Wednesday 27th March 2019 more less

Among the many challenges facing (Deep) Reinforcement Learning algorithms, most notably when advancing into real-world robotics applications, is their lack of stability: state-of-the-art algorithms are very sensitive to the choice of hyperparameters and asymptotic performance can vary widely across random seeds. A modification of the standard RL framework called Maximum Entropy Reinforcement Learning holds the promise to address this. In maximum entropy RL, the agent optimizes a tradeoff between the expected reward and entropy of its policy, i.e. tries to solve the task while behaving as randomly as possible.

While often framed as a purely algorithmic modification, surprisingly, maximum entropy RL can be identified with probabilistic inference in a particular graphical model. Drawing a link between Reinforcement Learning and Probabilistic Inference is interesting as graphical models allow for intuitive reasoning about partial observability, compositionality and (hierarchical) latent variables. It stands to reason that a probabilistic inference perspective can inspire new algorithms that are able to address the stability issues of Deep RL. This talk will explain the relation between RL and Probabilistic Inference and is based on "Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review" by Sergey Levine.

Inferring evolutionary histories with maximum likelihood in tree space
Benjamin Wilson
Wednesday 27th February 2019 more less

This talk discuss the problem of "phylogenetic inference", i.e. the task of inferring evolutionary history from available (typically present day) genetic samples. No background in biology is assumed -- we'll just be interested in the inference method.
We'll cover:

the Jukes-Cantor model of DNA evolution as a continuous time Markov chain;

the application of maximum likelihood to determine the edge lengths in an evolutionary tree, given DNA sequence data;

the combinatorial tree modifications that are used to move through the "space" of all possible trees.

One of the most interesting aspects (to me) is that the optimisation takes place in a space that is both continuous (you can change the edge lengths on a tree smoothly) and combinatorial (the collection of possible tree shapes, i.e. disregarding the edge lengths, is discrete). So what's the best way to do it?

Inference over graph structures using a diffusion kernel
Charley Wu
Wednesday 5th December 2018 more less

Many types of problems require highly structured representations, which can be better described using graph structures rather than vector space representations. In this talk, I will describe a method for performing inference over graph structures using the Gaussian Process regression framework. I will introduce the Diffusion Kernel, which provides a method for estimating the covariance structure of any weighted or unweighted undirected graph, based on it's adjacency matrix. The Diffusion kernel can be understood as an extension of the Successor Representation for Bayesian inference, allowing us to compute predictive posterior estimates of the expected value and underlying uncertainty of unobserved states. I will provide a tutorial on how to use Diffusion kernel for performing inference, and draw connections to a variety of Machine Learning problems, including re-use of learned structures in Reinforcement Learning problems.

The Transformer
Adriaan Schakel
Thursday 15th November 2018 more less

The attention mechanism introduced by Dzmitry Bahdanau et al. [1] in the context of neural machine translation has found application in a wide range of tasks in natural language processing [2,3] and beyond. Recently, Alec Radford et al. [4] proposed to use the mechanism to effectively train a language model on a corpus of unlabeled text. They demonstrated that their general, task-agnostic pre-trained model outperforms discriminatively trained models with architectures specifically crafted for the task at hand.
This talk introduces the attention mechanism, provides an intuitive application in the context of computer vision, and details the specific iteration of the mechanism used in the Radford paper, viz. the transformer decoder [3].

Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.

Generating Wikipedia by Summarizing Long Sequences, Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer

Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Their code can be found here.

On Laplacian Eigenmaps for Dimensionality Reduction
Juan Camilo Orduz
Thursday 4th October 2018 more less

The aim of this talk is to describe a non-linear dimensionality reduction algorithm based on spectral techniques introduced in Belkin & Niyogi (2003).
The goal of non-linear dimensionality reduction algorithms, so called manifold learning algorithms, is to construct a representation of data on a low dimensional manifold embedded in a high dimensional space. The particular case we are going to present exploits various relations of geometric and spectral methods (discrete and continuous). Spectral methods are sometimes motivated by Marc Kac's question Can One Hear the Shape of a Drum? which makes reference to the idea of recovering geometrical properties from the eigenvalues (spectrum) of a matrix (linear operator). Concretely, the approach followed in Belkin & Niyogi (2003) has its foundation on the spectral analysis of the graph Laplacian of the adjacency graph constructed from the data (von Luxburg 2007). The motivation of the construction relies on the continuous limit analogue, the Laplace-Beltrami operator, in providing an optimal embedding for manifolds. We will also indicate the relation with the associated heat kernel operator (Ham. 2004). Instead of following a pure formal approach we will present the main geometric and computational ideas of the algorithm. Hence, with basic knowledge of linear algebra (eigenvalues) and differential calculus you will be able to follow the talk. Finally we will show a concrete example in Python using scikit-learn.

An Introduction to Homomorphic Encryption
Stephen Enright-Ward
Wednesday 19th September 2018 more less

Classically, one must decrypt encrypted data in order to process it usefully. This restriction forces us to trust the security of third party computational resources, e.g. in the cloud. In the last decade, Gentry and others have invented new methods enabling arbitrary computations on encrypted data, yielding encrypted outputs whose decryptions are the “correct” answers — i.e. the output of the same computation on the original plain text. Such schemes are called “homomorphic encryption”, since encryption and function evaluation commute in this way. I’ll review a simple homomorphic encryption scheme, and explain how and why it works.

No prior reading required, but if you'd like a head-start, here is an introduction from Gentry

Handwritten notes.

Surface Realization with Neural Sequence-to-Sequence Inflection and Incremental Locality-Based Linearization
David King
Wednesday 18th July 2018 more less

Surface realization is a nontrivial task as it involves taking structured data and producing grammatically and semantically correct utterances. Many competing grammar-based and statistical models for realization still struggle with relatively simple sentences. For our submission to the 2018 Surface Realization Shared Task, we tackle the shallow task by first generating inflected wordforms with a neural sequence-to-sequence model before incrementally linearizing them. For linearization, we use a global linear model trained using early update that makes use of features that take into account the dependency structure and dependency locality. Using this pipeline sufficed to produce surprisingly strong results in the shared task. In future work, we intend to pursue joint approaches to linearization and morphological inflection and incorporating a neural language model into the linearization choices.

Authorship modeling and prediction using latent topics (or LDA)
Martin Stamenov
Wednesday 18th July 2018 more less

The advent of faster personal computers and tools such as automatic differentiation has enabled deep learning software like Theano. Software like this leads to wide adoption of deep learning, which in turn leads to an explosion in interest and new discoveries in the field. In a similar fashion, fast inference in probabilistic models with high quality and user-friendly implementations can lead to more interest and hence more research in Bayesian data analysis.

The main algorithm in this research is the fairly recent extension of LDA - Author-topic model. It promises to give data scientists a tool to simultaneously gain insight about authorship and content in terms of topics. The authors can represent many kinds of metadata attached to documents, for example, tags on posts on the web. The model can be used for data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata. Starting with the recent implementation of the Author-topic model in Gensim, we build on top of this work by creating a new feature, which allows inference of topics on a new collection of corpus data. We use it as querying tool for our training data in terms of finding similar authors or “tags”, by means of distance functions. We measure its author-predictive performance on two separate datasets. The research ended with an accepted and merged pull request and and a Jupyter-notebook[1] tutorial in Gensim’s Github repository, as this functionality seemed to be of interest for many [2] people.

[1] https://github.com/RaRe-Technologies/gensim/pull/1766
[2] https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_prediction_tutorial.ipynb

Machine learn-a-thon
all-in
Sunday 17th June 2018 more less

It's like a hack-a-thon, but for our machine learning ideas and projects. Let's choose something we want to learn about, break up into small groups (or go solo) and just do our best to learn about it for a day. It could be some theory, or a paper, or some code, whatever you like. We will teach and learn from each other, so please add to the spreadsheet below any topics you would like to learn about, and what help you could offer to others. It will be collaborative, not competitive, so there are no prizes. You can give a short presentation at the end of the day, if you like, or not, if you don’t. Afterwards, we’ll have dinner together. Hoping to see you there!

Neural Machine Translation - Better pay attention when you’re translating
Ludwig Winkler
Wednesday 23rd May 2018 more less

Machine translation made a big step forward through the use of deep recurrent neural networks paired with an attention mechanism. Motivated by the recent advances I will explain how the sequence-to-sequence neural machine translation system used in Google Translate works, what role the attention mechanism plays in it and how zero-shot learning for translation is possible with multi-lingual translation systems.

Furthermore I will explain how the ‘Transformer’ architecture works which relies solely on attention and eliminates the sequential dependencies while processing information.

A great article that sparked my interest in NMT a while back can be found here: https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

The papers which are the basis for this talk are:

Google Neural Machine Translation System: https://arxiv.org/abs/1609.08144

Google Multilingual Neural Machine Translation System: https://arxiv.org/abs/1611.04558

Attention Is All You Need: https://arxiv.org/abs/1706.03762

Logistic regression on Riemannian manifolds
Matthias Leimeister
Wednesday 9th May 2018 more less

In classification problems, item features are usually represented as a vector in Euclidean space. Common classifiers such as logistic regression or support vector machines can then be trained to distinguish two or more classes using these features. For some data sets, however, there is a more natural representation of the data on a lower dimensional manifold. For example, L2-normalized feature vectors are naturally points on a unit sphere. Graph and word embeddings in hyperbolic space, which have gained much attention recently, provide another example [1].

In general, Riemannian manifolds have different geometric properties than flat Euclidean space. For example, the shortest path between two points, also called a geodesic, might not be a straight line when viewed from the ambient space. Also, in the Euclidean setting the scalar product is often used as a similarity function between two vectors. On a Riemannian manifold, a scalar product is only defined within the tangent space at each point, therefore requiring some other way to relate the points to each other. This makes it necessary to rethink the formulation of a linear classifier in terms of these geometric structures. Lebanon and Lafferty [2] propose to reformulate logistic regression on the unit sphere by interpreting the loss function geometrically and transforming it to the manifold setting.

In the talk, I will present some basic concepts from Riemannian geometry, introduce the classifier used by Lebanon and Lafferty and outline an extension for hyperbolic space.

[1] Maximilian Nickel, Douwe Kiela: Poincaré Embeddings for Learning Hierarchical Representations. NIPS 2017.
[2] Guy Lebanon, John Laffertey: Hyperplane Margin Classifiers on the Multinomial Manifold. ICML 2004.

Notebook Notes

Bayesian compression of Deep Neural Networks
Simon Wiedemann
Wednesday 25th April 2018 more less

Every ML practitioner knows that neural networks are state-of-the-art in a wide spectrum of tasks. However, their large amount of weights and computations difficult their application in real world scenarios, in particular in cases where they need to be deployed into resource constrained devices. Therefore, compression and efficiency has recently become an active topic of interest in the deep learning community.
Motivated from Bayes' theorem and the minimum description length principle, in this talk I will present variational dropout, a powerful technique that regularises the training of deep neural networks into being low complex. The series of papers that have developed this technique have shown that: 1) variational dropout can be viewed as generalisation of the famous dropout method and thus preforms as good or better than it, 2) the resulting trained networks can be highly compressed. In fact, authors report that they can prune up to 99.5% of the weights of LeNet5 away, or compress the VGG network by 95x of it’s original size.

Slides from the talk are available here.

Papers:
arxiv.org/pdf/1506.02557
arxiv.org/pdf/1701.05369
arxiv.org/pdf/1705.08665

Hierarchical Reinforcement Learning with Options Framework
Oğuz Şerbetci
Wednesday 4th April 2018 more less

Despite recent breakthrough applications of Reinforcement Learning, temporal abstraction remains a long-standing goal. Instead of making decisions at atomic time-steps, temporal abstractions allow a reinforcement learning agent to act in different time-scales. For instance, while navigating a room an agent needs to make the high-level decision regarding which door to use, but it also need to chose low-level actions: the steps required to reach the door. In this talk, we will look at an approach from RL literature called the Options Framework (1999 Sutton, Precup, Singh) with its recent extension the Option-Critic Architecture (2016 Bacon, Harb, Precup).

You do not have to possess a deep understanding of Reinforcement Learning as the basics will be explained when needed.

The Options Framework

The Option-Critic Architecture

Breaking simple substitution ciphers with simple algorithms
Stephen Enright-Ward
Thursday 22nd March 2018 more less

Some historical texts are now unreadable, either because the script, or the underlying language, or both, are unknown. If record of the language still exists but the script is unfamiliar — for example, because the text is written in German, but in an unfamiliar alphabet — then reading the text is a two-step decipherment problem: First recognise the language, then map the unknown script to a familiar one (this applies both when the script was once widely used but is now forgotten, and when the script is a deliberate encryption scheme). I will discuss naive statistical algorithms to decipher such texts, applicable only for simple encryption techniques. Such techniques have been used successfully by Knight, Megyesi and Schaefer to break the Copiale Cipher, in 2011, and have also been applied, so far unsuccessfully, to the Voynich Manuscript.
This talk contains no original work and no machine learning. Notes are available here.

Recurrent Neural Network Grammars
Kate McCurdy
Tuesday 16th January 2018 more less

Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent Neural Network Grammars. arXiv:1602.07776 [Cs].
Published in NAACL 2016.
Retrieved from http://arxiv.org/abs/1602.07776

We introduce recurrent neural network grammars, probabilistic models of sentences with explicit phrase structure. We explain efficient inference procedures that allow application to both parsing and language modeling. Experiments show that they provide better parsing in English than any single previously published supervised generative model and better language modeling than state-of-the-art sequential RNNs in English and Chinese.

Learning embeddings in hyperbolic spaces
Benjamin Wilson
Thursday 30th November 2017 more less

Consider the problem of embedding a graph in two-dimensional Euclidean space. Nodes should be close together in the embedding space precisely when they are joined by an edge in the graph (where "close" means e.g. distance <= 1). How would you embed, say, a binary tree? One space-efficient way is to place the root at the origin, and arrange the nodes of depth N on a circle of radius N centred at the root. But things get crowded fast: there are 2^N nodes of depth N, and they have to fit around a circle whose circumference is only linear in N. In fact, already for small N, the distances between the neighbouring nodes on the Nth circle are <=1, and so are implicitly joined by an edge. But then you won't have a tree anymore! In order to proceed, you'll either need to work in a higher dimensional Euclidean space, or work in a space where circles have a longer circumference ...

In hyperbolic space, the circumference of circles grow exponentially with the radius, so it might be a great place for embedding trees, or indeed for embedding data that has a hierarchical structure. In this talk, we'll introduce a model or two for hyperbolic space and talk about how to train embeddings there, and compare this to the problem of embedding points on a sphere, where things are much easier to visualise.

Notes

Learning Semantic Hierarchies via Word Embeddings
Janna Lipenkova
Tuesday 7th November 2017 more less

In this talk, I will first outline the basic concepts of ontology structures, use cases and construction methods for ontologies. Then, I will go into detail on the paper "Learning Semantic Hierarchies via Word Embeddings" by Fu et al. (ACL 2014) which proposes a new method for the construction of semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. The method identifies whether a candidate word pair has a hypernym–hyponym relation by using the word-embedding-based semantic projections between words and their hypernyms.

Mapping paired representations to common spaces with CCA
Stephen Enright-Ward
Wednesday 25th October 2017 more less

Sometimes data scientists have different vectorial representations of the same thing, and wish to interrelate them. For example, a bilingual German/English newspaper might model each article with a pair of vectors: one for the German version, and one for the English. If the two text-to-vector models have been learned independently, the German and English vectors for each article will be dissimilar in general, but related because they represent the same story. Canonical Correlation Analysis, or CCA, is a mathematical technique that both quantifies such relationships, and provides an algorithm for mapping the two sets of vectors into a common space, such that the paired vectors become similar. I will explain what CCA is and how it works.

Notes are available here.

Fooling neural networks
Katharina Rasch
Wednesday 13th September 2017 more less

It is surprisingly easy to fool deep neural networks (and other machine learning models). Change a few pixels in an image (changes not visible to humans!) and suddenly a zebra is classified with high confidence as a microwave. Let's look at some methods for generating such adversarial examples and discuss what this could tell us about what our models are actually learning.

Here is a good read if you can't make it.

Generative classification and spotting outliers
Alan Nichol
Thursday 24th August 2017 more less

Classifiers like SVMs and NNs don't give us a real, meaningful probability along with their prediction. They also lack the capacity to recognise 'outlier' data as not belonging to any class in particular. I posit that we need generative models to do this, and we'll try to build a Chinese Restaurant Process + Gaussian mixture model that can (i) model the full data distribution and hence give meaningful probability estimates and (ii) incorporate outliers into new classes.

Using multi-armed bandits for dealing with the cold start problem
Till Breuer
Wednesday 9th August 2017 more less

Bandit algorithms can be used in the context of recommendation systems to reasonably fill the knowledge gap at cold start. Throughout this talk at least one simple bandit approach will be demonstrated at the example of a playful swipe app for exploring the car offerings of an Austrian car reselling platform.

Locality-Sensitive Hashing
Aaron Levin
Wednesday 26th July 2017 more less

In this talk we’ll explore locality-sensitive hashing, a technique to turn the computationally expensive exact nearest-neighbor search problem into an inexpensive approximate solution (it’s a neat trick and I promise you’ll love it). We’ll see how locality-sensitive hashing is used in image search, recommendations, and other machine learning problems. And of course, we’ll mention deep hashing, because why not?

Hierarchical softmax
Benjamin Wilson
Wednesday 26th July 2017 more less

Hierarchical softmax is an alternative to the softmax that scales well in the number of possible classification outcomes. It has been applied in several word embedding models, where the task is predicting a missing vocabulary word, given the context. Indeed, together with negative sampling, it is what has made large-corpus word embeddings computationally viable. It works via a really neat trick, that begins by choosing an arbitrary binary tree whose leaves are the outcomes of the classification. I'll explain to you exactly how it works and explore some questions about the effect of the choice of tree on the classification quality.

Notes from the talk are written up as a blog post.

Grammatical gender associations outweigh topical gender bias in cross-linguistic word embeddings
Kate McCurdy
Wednesday 5th July 2017 more less

Recent research has demonstrated that the relations captured in vector space semantic models can reflect undesirable biases in human culture. Our investigation of cross-linguistic word embeddings reveals that topical gender bias interacts with, and is surpassed in magnitude by, the effect of grammatical gender associations, and both may be attenuated by corpus lemmatization.

Minsky and Papert's "Perceptrons"
Benjamin Wilson
Wednesday 7th June 2017 more less

Perceptrons are simple neural networks with a single-layer of learned weights. Despite their simplicity, their invention (by Rosenblatt in 1958) brought about a first great wave of optimism for connectionist methods in artificial intelligence. This optimism was brought to a halt by the book of Minsky and Papert, published in 1969, which proved a number of interesting results about the sorts of functions that can and can not be represented by a perceptron. In this talk, we'll review these results and the interesting methods used to prove them (elementary group theory). We'll also take the opportunity to talk also about the perceptron learning algorithm, which uses no knowledge of the error function beyond its evaluation at the training examples - in particular, it doesn't need gradients.

Notes (pages 1-8)

Learning end-to-end optimized lossy compression codecs
Simon Wiedemann
Wednesday 17th May 2017 more less

At the core of all multimedia technologies we find compression codecs. This are information processing methods that aim to eliminate as much redundancies as possible contained in the raw data of a source (thus, reducing the amount of information needed in order to represent them). In lossy compression schemes, at some point during the process, one quantizes the data values which inevitably results in a loss of information. Thus, in the design of such codecs, one searches for the optimal trade-off between information loss (or distortion) and size reduction (or rate). However, due to the complexity of most real world data sources (like images and videos), solving the posed rate-distortion problem becomes intractable. Therefore, most of todays codecs (e.g. JPEG, H264, etc.) rely on clever heuristics and methods that make the lossy compression problem amenable.
Nonetheless, in recent years, neural networks have become commonplace to perform tasks that had for decades been accomplished by ad hoc algorithms and heuristics (e.g. object recognition). In fact, recent papers in this field have shown that this powerful class of methods are able to beat standard codecs (like JPEG) for the task of lossy image compression. This results prove their potential to perhaps some day replace the current paradigms of approaching the compression problem.

In the first part of my talk, I will give a basic introduction to information and source coding theory and explain some of the standard heuristics employed in the field of lossy image compression. In my second part, I will explain the approach taken by Ballé et al. (under review for ICLR 2017), which is based solely on neural networks.

A deeper introduction to reinforcement learning
Ludwig Winkler
Wednesday 3rd May 2017 more less

An introduction to reinforcement learning covering model-based, model-free and approximate methods with deep neural networks.

Sampling-based approaches for language modelling and word embeddings
Matthias Leimeister
Wednesday 26th April 2017 more less

Neural probabilistic language models (NPLM) aim at predicting a word given its context. Using neural networks, those models learn a real-valued dense representation, or word embedding, for each word in the vocabulary that can be used for subsequent tasks. Training an NPLM based on a softmax classifier output for a target word given the context is computationally expensive for large vocabularies, so that various algorithms have been proposed for making the training more efficient. The talk will introduce several sampling-based approximations to the softmax that aim at distinguishing the true target word from a number of noise words, with a focus on noise contrastive estimation and negative sampling (word2vec).

Andriy Mnih and Yee Whye Teh: A fast and simple algorithm for training neural probabilistic language models. ICML 2012. pdf

Tomas Mikolov et. al.: Distributed representations of words and phrases and their compositionality. NIPS 2013. pdf

Notes

Everybody GANs Now
Charley Wu
Wednesday 15th March 2017 more less

Generative Adversarial Networks (GANs) are one of the biggest topics in Machine Learning right now. I will give a tutorial on the principles behind Adversarial Training, and how it involves finding a Nash Equilibrium between two competing agents, the Generator and the Discriminator. I will also show that this solution has interesting implications in terms of informational-theoretic divergence measures (KL-Divergence and Jensen-Shannon Divergence), which underly all traditional approaches to computational modeling (i.e., finding a maximum likelihood estimate). notes

Are RNNs Turing complete?
Benjamin Wilson
Wednesday 1st February 2017 more less

At the recent NIPS conference, I must have heard it said five times that “RNNs are Turing complete”. All fives time were by Juergen Schmidhuber. At least as many times, I heard other attendees grumbling that this was false. In this talk we’ll explore the assumptions and main ideas of the proof as well as consider whether these assumptions hold for “real-world” RNNs implemented using finite-precision arithmetic.

If you feel like a head-start, we’ll be covering the paper of Siegelmann and Sontag “On the Computational Power of Neural Nets” (1995). Notes from the talk are available here.

NIPS report
Marcel Ackermann
Wednesday 11th January 2017

COLING report
Kate McCurdy
Wednesday 11th January 2017

A simple introduction to Bayesian Learning
Stephen Enright-Ward
Wednesday 30th November 2016 more less

Machine learning often means minimising a loss function by randomly initialising its parameters, then updating them repeatedly in response to batches of training data, for example using gradient descent. The idea behind Bayesian Learning is to replace a point estimate of the parameters with a probability distribution over the parameter space, and replace pointwise updates with Bayes' Theorem, to turn this "prior" distribution into a "posterior", which incorporates evidence from the training set. This talk will be an elementary introduction Bayesian Learning, using simple examples and small amount of theory. notes

Deep Neural Networks for YouTube Recommendations
Alexey Rodriguez Yakushev
Wednesday 16th November 2016 more less

A talk on the model and architecture of the YouTube recommendation system (abstract). It should offer some really interesting insights into the workings of one of the world's most heavily-used recommenders!

Connectionist temporal classification
Matthias Leimeister
Wednesday 9th November 2016 more less

Notes
Implementation

Predicting incidental comprehension of macaronic texts
Kate McCurdy
Wednesday 19th October 2016

A word embedding model for explaining compositionality
Stephen Enright-Ward
Wednesday 17th August 2016 more less

Notes

Audio featurisation and speaker recognition
Benjamin Wilson
Wednesday 27th July 2016 more less

Notes

Collaborative Filtering and the NetFlix Prize
David Yu
Wednesday 6th July 2016 more less

We'll describe the Netflix Problem and then go through the paper "Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights” by Yehuda Koren and Robert Bell (Bell Labs) - the eventual winners of the competition. If time permits, description of the Map-Reduce framework and how to apply the approach in the paper to that framework.

Gradient Descent Methods
Benjamin Wilson
Wednesday 22nd June 2016 more less

We study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995. notes

Bayesian Dialog management: how to build a chat interface without a flow chart.
Alan Nichol
Thursday 9th June 2016 more less

Statistical dialogue systems are motivated by the need for a data-driven framework that reduces the cost of laboriously hand-crafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimising the policy via a reward-driven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, ex- act model representation and optimisation is computationally intractable. Hence, the practical application of POMDP-based systems requires efficient algorithms and carefully constructed approximations.

Music similarity and speaker recognition what do they have in common?
Matthias Leimeister
Wednesday 11th May 2016 more less

Music information retrieval has largely been influenced by methods from speech processing. In this talk I want to present a line of research on music similarity models that followed a parallel development in the area of speaker recognition and identification. Specifically I'll discuss a common set of features (MFCCs) and statistical models (single Gaussians, GMM supervectors, and i-vector factor analysis) that have been used successfully in both domains. slides

Factorisation Machines
Stephen Enright-Ward
Wednesday 27th April 2016 more less

Notes available here.

Negative sampling for recommender systems and word embeddings
Benjamin Wilson
Wednesday 13th April 2016 more less

Common to several approaches for training recommenders is contrasting the data observed in a context (e.g. the products purchased by a user) with negative samples chosen from the pool of all possible products. In this talk well review some different ways of choosing negative samples, starting with static approaches like uniform and popularity based sampling and moving through to adaptive approaches (like BPRs adaptive oversampling and WARP) that take into account the context and the models current belief. All of these ideas can be transferred to the world of word embeddings, where the negative sampling techniques are at present less sophisticated.

References:

Steffen Rendle and Christoph Freudenthaler, Improving Pairwise Learning for Item Recommendation from Implicit Feedback, 2014, pdf my notes

Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011 pdf my notes

Gaussian Process Models: what to do when you can’t optimize?
Charley Wu
Wednesday 16th March 2016 more less

Optimization problems are ubiquitous in machine learning, but what do we do when we cant optimize? When the problem space is too large, search costs are expensive, or it is not feasible to exhaustively search through all possible solutions, it becomes impossible to find a guaranteed optimal solution. Instead, we need to find the environment-appropriate balance between exploration and exploitation in order to produce a satisfactory solution, within a finite search horizon.

I will be giving a tutorial on Gaussian Process (GP) models as a method to model Bayesian beliefs about unexplored areas of the problem space. GPs give us both the expected value and uncertainty of each location in the problem space, based on knowledge from previous observations. With a belief model of the environment, we can treat the optimization problem as a bandit problem, with an exploration goal (reduction of uncertainty) and an exploitation goal (closer approximation of the global optima).

Lastly, I will also introduce a bit of the literature on human exploration-exploitation behavior, and how untrained undergrads in a non-mathematical discipline are better and more efficient than the best state-of-the art optimization algorithms. What makes us good at this task and how can we use this information to build better search algorithms?

Active evaluation of Predictive Models
Christoph Sawade
Wednesday 2nd March 2016 more less

In order to make an informed decision about the deployment of a predictive model, it is crucial to know the model’s approximate performance. To evaluate performance, a set of labeled test instances is required that is drawn from the distribution the model will be exposed to at application time. In many practical scenarios, unlabeled test instances are readily available, but the process of labeling them can be a time- and cost-intensive task and may involve a human expert.

This talk addresses the problem of evaluating a given predictive model accurately with minimal labeling effort. We study an active model evaluation process that selects certain instances of the data according to an instrumental sampling distribution and queries their labels. We derive sampling distributions that minimize estimation error with respect to different performance measures such as error rate, mean squared error, and F-measures.

Another instance is the problem of evaluation the performance of a ranking function. In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We apply the concept of active evaluation to ranking functions and derive optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR).

Hierarchical Self-Organising Maps and Random Swapping for 2-d Sorting
Kai Uwe Barthel
Wednesday 17th February 2016 more less

There are many techniques for dimensionality reduction used to visualize high-dimensional datasets. A typical approach is principal component analysis, however a linear PCA projection cannot preserve local pairwise distances. Other approaches are multidimensional scaling, self-organizing maps (SOMs), local linear embedding or t-Distributed Stochastic Neighbor Embedding (t-SNE). In most cases the high-dimensional datasets are non-linearly projected and shown as points on a 2D map. If images are to be arranged in a 2D fashion, there are further constraints. The images should not overlap and the 2D display should be equally filled. I will talk about different approaches how to visually sort/arrange images using hierarchical SOMs and directed random swapping approaches. One example of such a visually sorted image display is picsbuffet.com which is a visual image browsing system to visually explore and search millions of images from stock photo agencies and the like. Similar to map services like Google Maps users may navigate through multiple image layers by zooming and dragging.

Counterfactual Performance Estimation or “how can your ML models have a Groundhog Day”
Alexey Rodriguez Yakushev
Wednesday 3rd February 2016 more less

I am currently very interested in evaluating new recommendation models without having to perform A/B testing every single time. So I have been reading a lot on how to properly use log data to evaluate model performance.

It is very common to use log data to train and perform model selection/hyperparameter search of your models, think for instance of click through rate prediction in which you train models on your users’ click stream, or recommendation models which you train using the consumption histories of your users. Log data is relatively easy to get if you have a large user base but it has serious pitfalls.

Log data usually has only partial feedback, you know that a user clicked or not on an ad, but you don’t know whether she would have clicked or not on a different ad. This bandit like feedback will introduce bias in your evaluation metrics and will likely give you a distorted view of what will happen at A/B test time.

We will use Counterfactual Estimators to ask “what if” questions, that is, what would have been the performance if you use a different model than used at logging time. Of course there are no miracles, we will look at limitations of these estimators and their theoretical guarantees.

An intuitive approach to clustering that works … sometimes.
Vinzenz Schönfelder
Wednesday 20th January 2016 more less

This talk presents a clustering method proposed in a recent Science paper. Generally favouring pragmatic approaches to problem solving (as a physicist…), I immediately fell in love with this approach when I learned about it from my colleagues at SISSA. Rodriguez and Laio devised a method in which the cluster centres are recognised as local density maxima that are far away from any points of higher density. The algorithm depends only on the relative densities rather than their absolute values. The authors tested the method on a series of data sets, and its performance compares favourably to that of established techniques. Looking at the method more closely, however, reveals essential weaknesses that depend on apparently minor details in the distribution of the data.

Rodriguez, A., Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

Improving Distributional Similarity with Lessons Learned from Word Embeddings
Marcel Ackermann
Wednesday 6th January 2016 more less

A review of the main highlights of the paper of Levy et al. 2015.

Kernels – Duality between Feature Representation and Similarity Measures
Christoph Sawade
Wednesday 2nd December 2015 more less

The questions of ‘how to represent data points’ and ‘how to measure similarity’ are crucial for the success of all methods in data science. Kernel functions can be seen as inner products in some Hilbert space and thus connect feature encoding with the concept of similarity. In this talk, I will elaborate this relationship in the regularized empirical risk framework, which captures lots of standard algorithms like Support Vector Machines, Ridge Regression and Logistic Regression. The goal is to develop a general understanding that is applicable throughout the machine learning world.

MCMC, a space odyssey!
Simon Pepin Lehalleur
Wednesday 18th November 2015 more less

Computing integrals in high-dimensional spaces is a ubiquitous and challenging problem: traditional numerical quadrature methods fail miserably, because they sample the space uniformly while the integrand is usually supported on a very small volume. Markov Chain Monte Carlo is a family of algorithms which solve this problem by sampling more intelligently, using a Markov chain whose stationary distribution is the one we want to integrate. I will explain the basic ideas, including some of the many ways to set up a suitable Markov chain (Metropolis-Hastings, Gibbs sampling)

References:
Part 1 and Part 2 of M. Bethancourt 2-part course in MLSS 2014.
Survey of the aspects of Markov chains which bear on MCMC.
Handbook of MCMC, esp. chap. 1 for a general overview and 5 for Hamiltonian MCMC.

Limiting Distributions of Markov Chains
Benjamin Wilson
Thursday 5th November 2015 more less

A seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

Notes and further reading available here.

Topological data analysis and the Mapper algorithm
Stephen Enright-Ward
Thursday 5th November 2015 more less

Giving a sensible visualisation of point clouds in high dimensions is a difficult problem in data science. Many traditional approaches fit point clouds to surfaces, trying to preserve distances. In the last ten years, people have started to use methods from topology, a branch of mathematics in which “distance” is replaced with the spongier notion of “in the neighbourhood of”, to model data points. The aim is to capture the global shape of the data — the signal — while ignoring local noise. I’ll explain one such approach, called the Mapper algorithm, developed by Singh, Mémoli and Carlsson in this paper

Parameterising the Mahalanobis distances
Benjamin Wilson
Wednesday 21st October 2015 more less

Given pairs of points that are known to be similar, and pairs that are known to the dissimilar, can we learn a metric such that similar points are close together? Well consider the special case of the Mahalanobis distance, which is just the Euclidean distance after applying a linear transform. Posing this problem in different ways leads to very different optimisation problems. Ill share my first steps working on this. The introduction of this survey paper gives a good introduction to the topic.

Notes from the talk are here.

Report on EMNLP 2015
Kate M.
Wednesday 7th October 2015

Report on RecSys 2015
Marcel Ackermann
Wednesday 7th October 2015

Limiting Distributions of Markov Chains
Benjamin Wilson
Tuesday 6th October 2015 more less

Pocket-sized Neural Networks
Adriaan Schakel
Tuesday 8th September 2015

RNNs for Language Modelling
Stephen Enright-Ward
Tuesday 25th August 2015 more less

Notes part 1
Notes part 2

Nonnegative Matrix Factorization for Audio Source Separation
Matthias Leimeister
Tuesday 11th August 2015 more less

Audio source separation deals with the problem of decomposing a mixture of sound sources into their individual parts. Typical examples include noise suppression in speech signals or the extraction of single instruments from a music mix. A popular class of algorithms to approach the problem is nonnegative matrix factorization (NMF) and its extensions. In the talk I will review the basics of NMF for source separation following a recent survey paper [1] and present some concrete examples for speech and music [2].

[1] http://cal.cs.illinois.edu/papers/virtanen-spm2015.pdf

[2] http://www.cs.tut.fi/~tuomasv/demopage.html

Slides from the talk are available here.

Interpreting word embedding arithmetic as set operations
Alexey Rodriguez Yakushev
Tuesday 28th July 2015

Vector length as significance in word embeddings
Benjamin Wilson
Tuesday 28th July 2015 more less

An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. Written up here.

Attentional models for object recognition in Recurrent Neural Nets.
Dave Kammeyer
Tuesday 14th July 2015 more less

A discussion of an attention-based model for recognizing multiple objects in images, from this paper.

Sparse L1-regularization for Feature Extraction
Vinzenz Schönfelder
Tuesday 16th June 2015 more less

I present the general idea behind regularization in general and its sparse incarnation in particular. As a motivation, I start with the project I worked on during my dissertation. It represents a practical application of sparse regularization for feature selection in the auditory perceptual sciences. By discussing geometric interpretations of L1-regularisation as well as its relation to sparse Bayes-priors, I hope to provide an intuitive understanding of the underlying mechanism. Finally, I discuss standard methods for hyperparameter optimisation in regularised regression.

Slides are available here.

Expectatation Maximisation and Gaussian Mixture Models
Benjamin Wilson
Thursday 21st May 2015 more less

A talk on Expectation Maximisation where Gaussian Mixture Models are considered as an example application. The exposition follows Bishop section 2.6 and Andrew Ng’s CS229 lecture notes. If you weren’t at the seminar, then it is probably better to read one of these instead. Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).

Notes are here.

Convolutional Neural Nets NLP
Adriaan Schakel
Wednesday 22nd April 2015

Non-negative Matrix Factorisation
Benjamin Wilson
Thursday 12th February 2015 more less

A talk on non-negative matrix factorization (NMF), its probabilistic interpretation and the optimization problems it poses. I find non-negativity constraints really interesting from the point of view of model interpretability, and NMF is a famous example. Most of us will have see the example of facial image decomposition using NMF before. If you wanted to read something yourself, you could start with Lee and Seungs paper in Nature.

Gaussian processes
Margo K.
Monday 15th December 2014

Negative sampling and noise contrastive estimation in the context of probabilistic language modeling
Adriaan Schakel
Monday 8th December 2014

Deep Belief Nets
Stephen Enright-Ward
Monday 24th November 2014

Energy-based models, inference and Gibbs Sampling
Benjamin Wilson
Monday 27th October 2014 more less

Notes available here.

Latent Dirichlet Allocation
Marija V.
Monday 20th October 2014

Estimator theory and the bias/variance trade-off
Benjamin Wilson
Monday 11th August 2014 more less

A talk on the bias-variance trade-off, including the basics of estimator theory, some examples and maximum likelihood estimation.

Notes are here. See also squared pair-wise differences estimate the variance.

Dependently-typed programming languages
Andreas G.
Friday 8th August 2014 more less

The presentation will be an introduction to dependently typed programming languages. I hope to give a brief glimpse into how advanced type systems allows the programmer to be more explicit in the way he expresses properties of his programs. In fact, the system is rigorous to the point where we can represent logical propositions by types, and further prove our proposition by writing a program of the corresponding type.

Restricted Boltzmann Machines
Stephen Enright-Ward
Monday 21st July 2014

Feasibility of Learning
Marija V.
Monday 30th June 2014 more less

The VC dimension and learning feasibility.

Mathematical properties of Principal Component Analysis
Benjamin Wilson
Monday 19th May 2014 more less

Here are some notes from the talk.

Transfer learning
Heiko Schmidle
Monday 28th April 2014

Word2vec and the Distributional Hypothesis
Benjamin Wilson
Monday 14th April 2014 more less

An overview of word2vec, covering the CBOW learning task, hierarchical softmax, and the fundamental linguistic assumption: the distributional hypothesis.

Here are better slides from a later talk on the same subject.