Berlin Machine Learning Seminar

About Hide

Seminars are generally of a theoretical nature and concern machine learning. It is assumed that seminar members are interested in, and not afraid of, the relevant mathematics.

The idea is to use the seminar as a vehicle for learning — particularly for the speaker. Speakers volunteer to give talks on subjects that they want to force themselves to learn about. As such, seminars can be fundamental or specialised, beginner or less so, depending on what the speaker wants to learn. We are all learners and the atmosphere of the seminar reflects this.

We also host talks from experts in their field.

The seminar is run by Matthias Leimeister and Marcel Ackermann (previously, until October 2019, by Benjamin Wilson). We meet in the offices of Lateral GmbH, in Schöneberg, Berlin.

Please get in touch if you’d like to come along.

  1. Which U-Net is the best? Tackling a Challenge in Brain Lesion Segmentation
    Sebastian Hitzinger
    Wednesday 3rd November 2021 more less
    Biomedical image segmentation is a challenging field, due to large 3D datasets with highly imbalanced data and limited availability of annotated training data. Variations of the fully convolutional U-Net are a common and promising choice for this type of tasks. This is also reflected in the results of the recent challenge on MS new lesion segmentation (MSSEG-2). However, the questions of the optimal hyperparameter setting remains.

    In this talk, we will analyze the methods with highest ranking on the official F1-score leaderboard of the challenge, including our own submission. We will describe the particular design choices and make out factors of success.


    MSSEG-2 challenge website:

  2. Neural networks for Optical Ray Tracing Simulations
    Aditya Seshaditya
    Wednesday 8th September 2021 more less
    Recently several Neural Networks have also been developed for assisting several kinds of simulations such as optical, thermo-mechanical, and so on.
    For the development of Optical Lasers system for High Flash LIDARs, Ray tracing simulations have been performed. Along with simulations Neural networks have also been trained to assist the simulations.

    In this Talk we would like to cover the main topics related to the above developments, namely

    1. Ray Tracing simulations and Point spread Functions (PSFs) parameters estimation using Convolutional Neural Networks (CNNs)
    2. Inverse or Invertible Neural network framework: mapping from the PSF parameters space to simulation images
    3. Introduction to Physics Inspired Neural Networks (PINNs) for these simulations

  3. Past, present, and future of computational notebooks
    Heiki Schmidle
    Wednesday 2nd June 2021 more less
    In the last few years the usage of computational notebooks in all parts of data related work is dramatically increasing.

    Academia, machine-learning, data-scientist, analysts, business intelligence are using more and more notebooks as their tool of choice in many day-to-day tasks.

    When did this computational paradigm start and what were the steps to the current landscape?

    I will bring some light in the past 30 years of human-computer-interaction.

    How are notebooks currently used and what are the biggest pain-points according to recent research studies?

    I will try to show some design opportunities that might help to solve some of the problems and point us to where computational notebooks are developing in the future.

  4. Uncertainty quantification for image recognition with deep neural networks
    Hanno Gottschalk
    Wednesday 17th March 2021 more less
    In recent years, deep learning methods have outperformed other methods in image recognition. This has fostered imagination of potential application of deep learning technology including safety relevant applications like the interpretation of medical images or autonomous driving. The passage from assistance of a human decision maker to ever more automated systems however increases the need to properly handle the failure modes of deep learning modules. In this contribution, we review a set of techniques for the self-monitoring of machine-learning algorithms based on uncertainty quantification. In particular, we apply this to the task of semantic segmentation, where the machine learning algorithm decomposes an image according to semantic categories. We discuss false positive and false negative error modes at instance-level and review techniques for the detection of such errors that have been recently proposed by the authors. We also give an outlook on future research directions, including the recognition out of distribution objects and the retrieval of semantic classes outside the semantic space of the training data.

    Links to papers:

    Links to videos:

  5. Attacking Privacy in Neural Networks
    Franziska Boenisch
    Wednesday 3rd March 2021 more less
    Neural networks are increasingly being applied in sensitive domains and on private data. For a long time, no thought was given to what this means for the privacy of the data used for their training. Only in recent years has there emerged an awareness that the process of converting training data into a model is not irreversible as previously thought. Since then, several specific attacks against privacy in neural networks have been developed. Of these, we will discuss two specific ones, namely membership inference and model inversion attacks. First, we will focus on how they retrieve potentially sensitive information from trained models. Then, we will look into several factors that influence the success of both attacks. At the end, we will discuss Differential Privacy as a possible protection measure.

    Bio: Franziska has completed a Master’s degree in Computer Science at Freie University Berlin and Technical University Eindhoven. For the past 1,5 years, she has been working at Fraunhofer AISEC as a Research Associate in topics related to Privacy Preserving Machine Learning, Data Protection, and Intellectual Property Protection for Neural Networks. Additionally, she is currently doing her PhD in Berlin.

  6. Towards reliable deep learning
    Florian Wenzel
    Wednesday 17th February 2021 more less
    Deep learning models are bad at detecting their failure. They tend to make over-confident mistakes, especially, under distribution shift. Making deep learning more reliable is important in safety-critical applications including health care, self-driving cars, and recommender systems.
    We discuss two approaches to reliable deep learning. First, we will focus on Bayesian neural networks that come with many promises to improved uncertainty estimation. However, why are they rarely used in industrial practice? In this talk, we will cast doubt on the current understanding of Bayes posteriors in deep networks. We show that Bayesian neural networks can be improved significantly through the use of a "cold posterior" that overcounts evidence and hence sharply deviates from the Bayesian paradigm. We will discuss several hypotheses that could explain cold posteriors.
    In the second part, we will discuss a classical approach to more robust predictions: ensembles. Deep ensembles combine the predictions of models trained from different initializations. We will show that the diversity of predictions can be improved by considering models with different hyperparameters. Finally, we present an efficient method that leverages hyperparameter diversity within a single model.


    Florian Wenzel et. al.: How Good is the Bayes Posterior in Deep Neural Networks Really? ICML 2020, PDF

  7. Intro to Causal Inference and Applications
    Lars Roemheld
    Wednesday 3rd February 2021 more less
    This talk aims to give an introduction to causal inference as a new “old” problem being enriched by machine learning. I will try to cover a lot of ground, from Causality as a framework to think about generalization error and data drift, to learning causal effects from observational data, experiments, and causal structure learning. Using some concrete use cases inspired by my own experience in e-commerce, we will discuss modern ideas such as Orthogonal ML and heterogeneous treatment effects. This talk is intended to be approachable for a wide audience.

  8. Benefits of Assistance over Reward Learning
    Dima Krasheninnikov
    Wednesday 13th January 2021 more less
    Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward model through human feedback that is provided externally from the environment. The second is assistance, in which the human is modeled as a part of the environment, and the true reward function is modeled as a latent variable in the environment that the agent may make inferences about. The key difference between the two paradigms is that in the reward learning paradigm, by construction there is a separation between reward learning and control using the learned reward. In contrast, in assistance these functions are performed as needed by a single policy. By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning. We illustrate these advantages in simple environments by showing desirable qualitative behaviors of assistive agents that cannot be found by agents based on reward learning.


    Rohin Shah et. al.: Benefits of Assistance over Reward Learning, NeurIPS 2020 Cooperative AI Workshop, PDF

  9. Progressive Layered Extraction for Multi Task Learning
    Vaibhav Singh
    Wednesday 2nd December 2020 more less
    In this talk, I will explain the paper Progressive Layered
    Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for
    Personalized Recommendations which won the Best Long Paper award in
    RecSys 2020.

    MTL relates to a challenge where we want to learn different tasks eg.
    predicting cat breeds and dog breeds using a single DNN. The basic
    idea in this type of DNN is to share lower level features across tasks
    such that the model learns some general features. However, this DNN
    would not work for a task that is unrelated and does not share the
    same features, and in this context such as predicting a car model.

    Related to this in the field of Recommender Systems we would like to
    learn different tasks such as the likelihood of clicking, finishing,
    sharing, favoriting, commenting, etc. Some of these tasks are often
    loosely correlated or conflicted which may lead to negative transfer.
    Scenarios appear where MTL models improve certain tasks by degrading
    the performance of other tasks (seesaw phenomenon). Some related works
    like cross-stitch networks, sluice networks, multi-gate mixture of
    experts address this problem in various ways.

    This idea behind this paper is to explicitly separate shared and
    task-specific experts to avoid harmful parameter interference. On top
    of this multi-level experts and gating networks are introduced to fuse
    more abstract representations. Finally, it adopts a novel progressive
    separation routing to model interactions between experts and achieve
    more efficient knowledge transferring between complicatedly correlated

    Speaker info: Vaibhav Singh currently heads machine learning work in
    areas of Fraud Detection, App Personalization and Consumer Growth
    within Klarna.

  10. Langevin Cooling for Domain Translation
    Vignesh Srinivasan
    Wednesday 4th November 2020 more less
    Domain translation is the task of finding correspondence between two domains. Several Deep Neural Network (DNN) models, e.g., CycleGAN and cross-lingual language models, have shown remarkable successes on this task under the unsupervised setting - the mappings between the domainsare learned from two independent sets of training data in both domains (without paired samples). However, those methods typically do not perform well on a significant proportion of test samples. In this paper, we hypothesize that many of such unsuccessful samples lie at the fringe - relatively low-density areas - of data distribution, where the DNN was not trained very well, and propose to perform Langevin dynamics to bring such fringe samples towards high density areas. We demonstrate qualitatively and quantitatively that our strategy, called Langevin Cooling (L-Cool), enhances state-of-the-art methods in image translation and language translation tasks.



  11. Examples of Reinforcement Learning applications in the financial market
    Heuna Kim
    Thursday 22nd October 2020 more less
    Reinforcement Learning has been broadly employed in financial markets for the last few years by benefiting from its nature of combining the behavior optimization (in this case buy and sell) and the market prediction.
    We will first discuss the hierarchical reinforcement learning scheme deployed by JPMorgan (NIPS Workshop 2018, paper) and then take a look at other two examples of DRL applied in trading.
    The first one (KDD 2019, paper) is implementing an interpretable network that works similar to a traditional trading strategy (Buying-Winners-and-Selling-Losers).
    The second one (ICML 2019, paper) is extending a traditional mathematical model (the Almgren and Chriss model) to a multi-agent setting in order to optimize a liquidation strategy.


  12. Inductive programming in limited domain: How FlashFill solves string transformation tasks
    Ulaş Türkmen
    Wednesday 7th October 2020 more less
    In this talk, I will take a deep dive into FlashFill, one of the most widely
    used applications of inductive programming in consumer software. Inductive programming is the generation of programs, or processing of new input, based on a specification over the input-output space. FlashFill was developed to help with programming problems faced by Excel users, where the input and output are both strings. Due to its limited domain, FlashFill can deliver satisfying
    results with low computational and implementation complexity. It works by using a domain-specific language to generate code that is valid for a given set of examples, and ranking them using a complexity heuristic to pick the simplest, which is also the most probable. The user can either run the generated mini-program over further input, correcting it by providing more examples if necessary, or get a pseudocode representation of the program.


    - Spreadsheet Data Manipulation Using Examples (pdf)

    - Automating String Processing in Spreadsheets Using Input-Output Examples (pdf)

  13. Semantic Maps
    Johannes Mosig
    Wednesday 5th August 2020 more less
    We investigate a novel word embedding, called "Semantic Map". In contrast to now popular embeddings, the Semantic Map Embedding does not represent words in a vector space, but in something related to a topological space. This allows for new and interesting operations on words and we have some promising preliminary results. All of this is ongoing and as-of-yet unpublished research at Rasa.

  14. Learning robust visual representations using data augmentation invariance
    Alex Hernández-García
    Wednesday 3rd June 2020 more less
    We show that the representations learnt by neural networks trained on image object classification are not more robust (in the sense of representational similarity) to image transformations than at the pixel space. This contrasts with a well-studied property of the visual cortex: the increasing invariance along the visual ventral stream to identity-preserving transformations. Taking inspiration from this, we propose a learning objective, data augmentation invariance, that encourages robust visual representations. The models trained with data augmentation invariance learn robust representations without detriment to the classification accuracy.




  15. Unsupervised Representation Learning with Contrastive Pre-training
    Oğuz Şerbetci
    Wednesday 6th May 2020 more less
    Recent self-supervised (unsupervised) pre-training methods in NLP have enabled to transfer learning from large unlabeled data to small labeled target data. In 2019, contrastive training has been introduced by [1] which demonstrated strong performance for image and speech in similar setups. Since then, others have improved on the idea and reached performance of supervised pre-training [3]. I found the main approach pretty cool and would like to present it along with interesting additions from more recent work.


    [1] Initial paper introducing contrastive self-supervised pre-training for image and speech domains:
    [2] Update from first author with others that improves on it:
    [3] Best performance I'm aware of:

  16. Support vector classifiers
    Stephen Enright-Ward
    Wednesday 22nd April 2020 more less
    I always wanted to understand support vector classifiers are (SVCs), but never got around to it. In this general introductory talk, I will remedy this by carefully defining SVCs, paying special attention to explaining the mathematics behind them.

    The seminar will be held remotely:


  17. Modelling of non-linear state space systems using deep neural networks
    Julian Parker
    Wednesday 4th March 2020 more less
    'Virtual analog' (VA) is a common area of study in music related signal processing research. VA work focuses on constructing digital models of the many analog electronic circuits used in devices such as synthesisers, guitar amplifiers, effects pedals etc. Recently we introduced a new method of modelling such systems by training a neural network to approximate measurements of the many internal signals of such systems, hence reproducing the manifold the system inhabits in its 'state space'. In this talk, I'll give an overview of the method and examples of its application to several circuits. I'll also discuss how the method fits within the wider space of deep-learning techniques.


    Julian D. Parker, Fabián Esqueda, André Bergner: Modelling of non-linear state space systems using deep neural networks. DAFx 2019 pdf

  18. Uncertainty quantification in Reinforcement Learning
    Wednesday 5th February 2020 more less
    Uncertainty quantification is one of the hot research topics in machine learning and stochastic modeling. Despite the existence of many papers presenting different approaches to model and quantify uncertainty in machine learning models, few of them links uncertainty quantification with Reinforcement learning. IMHO, it is of great importance to study this link to understand how inherent model uncertainties affect agent strategies in RL.

  19. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
    Nick Hoff
    Wednesday 15th January 2020 more less
    This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

  20. Dialogue Transformers
    Vladimir Vlasov
    Wednesday 4th December 2019 more less
    We introduce a dialogue policy based on a transformer architecture, where the self-attention mechanism operates over the sequence of dialogue turns. Recent work has used hierarchical recurrent neural networks to encode multiple utterances in a dialogue context, but we argue that a pure self-attention mechanism is more suitable. By default, an RNN assumes that every item in a sequence is relevant for producing an encoding of the full sequence, but a single conversation can consist of multiple overlapping discourse segments as speakers interleave multiple topics. A transformer picks which turns to include in its encoding of the current dialogue state, and is naturally suited to selectively ignoring or attending to dialogue history. We compare the performance of the Transformer Embedding Dialogue (TED) policy to an LSTM and to the REDP, which was specifically designed to overcome this limitation of RNNs.

    Vladimir Vlasov, Johannes E. M. Mosig, Alan Nichol: Dialogue Transformers. arXiv 2019 pdf


    Introduction to transformers video

  21. Explanations can be manipulated and geometry is to blame
    Pan Kessel
    Thursday 21st November 2019 more less
    Explanations of machine learning decisions have attracted considerable attention in recent years as they aim to make machine decisions more transparent and trustworthy. In this talk, I will briefly review explanation methods for neural networks. I will then show that these methods are surprisingly vulnerable to manipulations. I will explain that this vulnerability can be understood theoretically by using differential geometry. In the last part of the talk, I will explain how these theoretical insights suggest modifications of explanation methods which make them more robust. The talk is based on this paper

  22. Unsupervised machine translation
    Joris Dolderer
    Wednesday 16th October 2019 more less
    Unsupervised machine translation systems require only monolingual text corpora - i.e. no dictionary or parallel sentences - for training. After rapid advances in the last few years, the current state of the art methods [1] achieve BLEU scores comparable to or better than the best supervised systems from 2014. Many of these systems incorporate old-fashioned statistical machine translation (SMT) systems, rather than relying on neural machine translation exclusively. Furthermore, they often start with cross-lingual word or phrase embeddings learned in an unsupervised fashion, using algorithms such as VecMap [2].

    The aim of this talk is to give an overview of how an unsupervised MT system works. For concreteness, I will focus on the system developed by Artetxe et al. [3], which incorporates VecMap.



  23. Neural Networks Don't Learn Default Rules for German Plurals, But That's Okay, Neither Do Germans
    Kate McCurdy
    Tuesday 3rd September 2019 more less
    Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Some linguists have argued that the German number system cannot be modeled without rule-based symbolic computation, because the `default' plural marker, /-s/, is also the least frequent; they claim that speaker preferences for /-s/ in elsewhere conditions, such as novel and phonologically atypical nouns, require representation of linguistic rules and thus cannot be inferred from data alone.

    We present a new dataset of German speakers' production and rating of plural forms for novel nouns, and note that the results provide at best weak support the claimed `default' status for /-s/, reducing its potential challenge for neural models. Nonetheless, we observe that neural encoder-decoder models, while broadly successful on this `wug' task, show distinctive failure modes suggesting they do not generalize in quite the same manner as human speakers.

  24. A Style-Based Generator Architecture for Generative Adversarial Networks
    Katharina Rasch
    Wednesday 5th June 2019 more less
    You might have seen this face generator that creates creepily
    realistic faces? Let's take a look
    at the paper behind this method and learn how they pulled it off. This
    paper is a nice combination of different techniques that have been
    developed in GANs over the last years and a chance to learn a bit about
    what has been going on in this field. Some things I will talk about are
    progressive GANs, GAN evaluation metrics, dataset preparation, maybe a
    bit on different loss functions that people use.

    Original paper abstract:


    We propose an alternative generator architecture for generative
    adversarial networks, borrowing from style transfer literature. The
    new architecture leads to an automatically learned, unsupervised
    separation of high-level attributes (e.g., pose and identity when
    trained on human faces) and stochastic variation in the generated
    images (e.g., freckles, hair), and it enables intuitive,
    scale-specific control of the synthesis. The new generator improves
    the state-of-the-art in terms of traditional distribution quality
    metrics, leads to demonstrably better interpolation properties, and
    also better disentangles the latent factors of variation. To quantify
    interpolation quality and disentanglement, we propose two new,
    automated methods that are applicable to any generator architecture.
    Finally, we introduce a new, highly varied and high-quality dataset of
    human faces.

  25. GPyTorch: A scalable approach to Gaussian Processes
    Ludwig Winkler
    Wednesday 15th May 2019 more less
    Gaussian Processes are a class of popular and powerful probabilistic models which can be derived straight from a Normal distribution. While vanilla GP’s are very flexible due to their nonparametric formulation, inference and training both rely on the evaluation of the entire data set through several kernel matrices, their products, log determinants and inverses. These kernel matrices over the entire training data set and their numerous use in matrix matrix pdocuts, log determinants and inverses result in poor scalability to large data sets. The advent of GPU’s in machine learning has offered the possibility of computing matrix vector and matrix matrix products efficiently and in parallel. GPyTorch introduces several randomized and parallelized algorithms as replacements for exact computations which reduce the complexity of inference and training in GP’s. These include stochastic approximations of the trace, parallel conjugate gradient descent and several efficient applications of eigendecompositions. More importantly, all of these adaptations allow full utilization of parallel hardware during inference and training. Considerable focus will be on conjugated gradient descent which allows the efficient optimization of quadratic optimization problems through the use of line search and conjugate search directions.

  26. Reinforcement Learning as Probabilistic Inference
    Mathias Schmerling
    Wednesday 27th March 2019 more less
    Among the many challenges facing (Deep) Reinforcement Learning algorithms, most notably when advancing into real-world robotics applications, is their lack of stability: state-of-the-art algorithms are very sensitive to the choice of hyperparameters and asymptotic performance can vary widely across random seeds. A modification of the standard RL framework called Maximum Entropy Reinforcement Learning holds the promise to address this. In maximum entropy RL, the agent optimizes a tradeoff between the expected reward and entropy of its policy, i.e. tries to solve the task while behaving as randomly as possible.

    While often framed as a purely algorithmic modification, surprisingly, maximum entropy RL can be identified with probabilistic inference in a particular graphical model. Drawing a link between Reinforcement Learning and Probabilistic Inference is interesting as graphical models allow for intuitive reasoning about partial observability, compositionality and (hierarchical) latent variables. It stands to reason that a probabilistic inference perspective can inspire new algorithms that are able to address the stability issues of Deep RL. This talk will explain the relation between RL and Probabilistic Inference and is based on "Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review" by Sergey Levine.

  27. Inferring evolutionary histories with maximum likelihood in tree space
    Benjamin Wilson
    Wednesday 27th February 2019 more less
    This talk discuss the problem of "phylogenetic inference", i.e. the task of inferring evolutionary history from available (typically present day) genetic samples. No background in biology is assumed -- we'll just be interested in the inference method.
    We'll cover:

    1. the Jukes-Cantor model of DNA evolution as a continuous time Markov chain;

    2. the application of maximum likelihood to determine the edge lengths in an evolutionary tree, given DNA sequence data;

    3. the combinatorial tree modifications that are used to move through the "space" of all possible trees.

    One of the most interesting aspects (to me) is that the optimisation takes place in a space that is both continuous (you can change the edge lengths on a tree smoothly) and combinatorial (the collection of possible tree shapes, i.e. disregarding the edge lengths, is discrete). So what's the best way to do it?

  28. Inference over graph structures using a diffusion kernel
    Charley Wu
    Wednesday 5th December 2018 more less
    Many types of problems require highly structured representations, which can be better described using graph structures rather than vector space representations. In this talk, I will describe a method for performing inference over graph structures using the Gaussian Process regression framework. I will introduce the Diffusion Kernel, which provides a method for estimating the covariance structure of any weighted or unweighted undirected graph, based on it's adjacency matrix. The Diffusion kernel can be understood as an extension of the Successor Representation for Bayesian inference, allowing us to compute predictive posterior estimates of the expected value and underlying uncertainty of unobserved states. I will provide a tutorial on how to use Diffusion kernel for performing inference, and draw connections to a variety of Machine Learning problems, including re-use of learned structures in Reinforcement Learning problems.

  29. The Transformer
    Adriaan Schakel
    Thursday 15th November 2018 more less
    The attention mechanism introduced by Dzmitry Bahdanau et al. [1] in the context of neural machine translation has found application in a wide range of tasks in natural language processing [2,3] and beyond. Recently, Alec Radford et al. [4] proposed to use the mechanism to effectively train a language model on a corpus of unlabeled text. They demonstrated that their general, task-agnostic pre-trained model outperforms discriminatively trained models with architectures specifically crafted for the task at hand.
    This talk introduces the attention mechanism, provides an intuitive application in the context of computer vision, and details the specific iteration of the mechanism used in the Radford paper, viz. the transformer decoder [3].

    1. Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.

    2. Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.

    3. Generating Wikipedia by Summarizing Long Sequences, Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer

    4. Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Their code can be found here.

  30. On Laplacian Eigenmaps for Dimensionality Reduction
    Juan Camilo Orduz
    Thursday 4th October 2018 more less
    The aim of this talk is to describe a non-linear dimensionality reduction algorithm based on spectral techniques introduced in Belkin & Niyogi (2003).
    The goal of non-linear dimensionality reduction algorithms, so called manifold learning algorithms, is to construct a representation of data on a low dimensional manifold embedded in a high dimensional space. The particular case we are going to present exploits various relations of geometric and spectral methods (discrete and continuous). Spectral methods are sometimes motivated by Marc Kac's question Can One Hear the Shape of a Drum? which makes reference to the idea of recovering geometrical properties from the eigenvalues (spectrum) of a matrix (linear operator). Concretely, the approach followed in Belkin & Niyogi (2003) has its foundation on the spectral analysis of the graph Laplacian of the adjacency graph constructed from the data (von Luxburg 2007). The motivation of the construction relies on the continuous limit analogue, the Laplace-Beltrami operator, in providing an optimal embedding for manifolds. We will also indicate the relation with the associated heat kernel operator (Ham. 2004). Instead of following a pure formal approach we will present the main geometric and computational ideas of the algorithm. Hence, with basic knowledge of linear algebra (eigenvalues) and differential calculus you will be able to follow the talk. Finally we will show a concrete example in Python using scikit-learn.

  31. An Introduction to Homomorphic Encryption
    Stephen Enright-Ward
    Wednesday 19th September 2018 more less
    Classically, one must decrypt encrypted data in order to process it usefully. This restriction forces us to trust the security of third party computational resources, e.g. in the cloud. In the last decade, Gentry and others have invented new methods enabling arbitrary computations on encrypted data, yielding encrypted outputs whose decryptions are the “correct” answers — i.e. the output of the same computation on the original plain text. Such schemes are called “homomorphic encryption”, since encryption and function evaluation commute in this way. I’ll review a simple homomorphic encryption scheme, and explain how and why it works.

    No prior reading required, but if you'd like a head-start, here is an introduction from Gentry

    Handwritten notes.

  32. Surface Realization with Neural Sequence-to-Sequence Inflection and Incremental Locality-Based Linearization
    David King
    Wednesday 18th July 2018 more less
    Surface realization is a nontrivial task as it involves taking structured data and producing grammatically and semantically correct utterances. Many competing grammar-based and statistical models for realization still struggle with relatively simple sentences. For our submission to the 2018 Surface Realization Shared Task, we tackle the shallow task by first generating inflected wordforms with a neural sequence-to-sequence model before incrementally linearizing them. For linearization, we use a global linear model trained using early update that makes use of features that take into account the dependency structure and dependency locality. Using this pipeline sufficed to produce surprisingly strong results in the shared task. In future work, we intend to pursue joint approaches to linearization and morphological inflection and incorporating a neural language model into the linearization choices.

  33. Authorship modeling and prediction using latent topics (or LDA)
    Martin Stamenov
    Wednesday 18th July 2018 more less
    The advent of faster personal computers and tools such as automatic differentiation has enabled deep learning software like Theano. Software like this leads to wide adoption of deep learning, which in turn leads to an explosion in interest and new discoveries in the field. In a similar fashion, fast inference in probabilistic models with high quality and user-friendly implementations can lead to more interest and hence more research in Bayesian data analysis.

    The main algorithm in this research is the fairly recent extension of LDA - Author-topic model. It promises to give data scientists a tool to simultaneously gain insight about authorship and content in terms of topics. The authors can represent many kinds of metadata attached to documents, for example, tags on posts on the web. The model can be used for data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata. Starting with the recent implementation of the Author-topic model in Gensim, we build on top of this work by creating a new feature, which allows inference of topics on a new collection of corpus data. We use it as querying tool for our training data in terms of finding similar authors or “tags”, by means of distance functions. We measure its author-predictive performance on two separate datasets. The research ended with an accepted and merged pull request and and a Jupyter-notebook[1] tutorial in Gensim’s Github repository, as this functionality seemed to be of interest for many [2] people.


  34. Machine learn-a-thon
    Sunday 17th June 2018 more less
    It's like a hack-a-thon, but for our machine learning ideas and projects. Let's choose something we want to learn about, break up into small groups (or go solo) and just do our best to learn about it for a day. It could be some theory, or a paper, or some code, whatever you like. We will teach and learn from each other, so please add to the spreadsheet below any topics you would like to learn about, and what help you could offer to others. It will be collaborative, not competitive, so there are no prizes. You can give a short presentation at the end of the day, if you like, or not, if you don’t. Afterwards, we’ll have dinner together. Hoping to see you there!

  35. Neural Machine Translation - Better pay attention when you’re translating
    Ludwig Winkler
    Wednesday 23rd May 2018 more less
    Machine translation made a big step forward through the use of deep recurrent neural networks paired with an attention mechanism. Motivated by the recent advances I will explain how the sequence-to-sequence neural machine translation system used in Google Translate works, what role the attention mechanism plays in it and how zero-shot learning for translation is possible with multi-lingual translation systems.

    Furthermore I will explain how the ‘Transformer’ architecture works which relies solely on attention and eliminates the sequential dependencies while processing information.

    A great article that sparked my interest in NMT a while back can be found here:

    The papers which are the basis for this talk are:

  36. Logistic regression on Riemannian manifolds
    Matthias Leimeister
    Wednesday 9th May 2018 more less
    In classification problems, item features are usually represented as a vector in Euclidean space. Common classifiers such as logistic regression or support vector machines can then be trained to distinguish two or more classes using these features. For some data sets, however, there is a more natural representation of the data on a lower dimensional manifold. For example, L2-normalized feature vectors are naturally points on a unit sphere. Graph and word embeddings in hyperbolic space, which have gained much attention recently, provide another example [1].

    In general, Riemannian manifolds have different geometric properties than flat Euclidean space. For example, the shortest path between two points, also called a geodesic, might not be a straight line when viewed from the ambient space. Also, in the Euclidean setting the scalar product is often used as a similarity function between two vectors. On a Riemannian manifold, a scalar product is only defined within the tangent space at each point, therefore requiring some other way to relate the points to each other. This makes it necessary to rethink the formulation of a linear classifier in terms of these geometric structures. Lebanon and Lafferty [2] propose to reformulate logistic regression on the unit sphere by interpreting the loss function geometrically and transforming it to the manifold setting.

    In the talk, I will present some basic concepts from Riemannian geometry, introduce the classifier used by Lebanon and Lafferty and outline an extension for hyperbolic space.

    [1] Maximilian Nickel, Douwe Kiela: Poincaré Embeddings for Learning Hierarchical Representations. NIPS 2017.
    [2] Guy Lebanon, John Laffertey: Hyperplane Margin Classifiers on the Multinomial Manifold. ICML 2004.

    Notebook Notes

  37. Bayesian compression of Deep Neural Networks
    Simon Wiedemann
    Wednesday 25th April 2018 more less
    Every ML practitioner knows that neural networks are state-of-the-art in a wide spectrum of tasks. However, their large amount of weights and computations difficult their application in real world scenarios, in particular in cases where they need to be deployed into resource constrained devices. Therefore, compression and efficiency has recently become an active topic of interest in the deep learning community.
    Motivated from Bayes' theorem and the minimum description length principle, in this talk I will present variational dropout, a powerful technique that regularises the training of deep neural networks into being low complex. The series of papers that have developed this technique have shown that: 1) variational dropout can be viewed as generalisation of the famous dropout method and thus preforms as good or better than it, 2) the resulting trained networks can be highly compressed. In fact, authors report that they can prune up to 99.5% of the weights of LeNet5 away, or compress the VGG network by 95x of it’s original size.

    Slides from the talk are available here.


  38. Hierarchical Reinforcement Learning with Options Framework
    Oğuz Şerbetci
    Wednesday 4th April 2018 more less
    Despite recent breakthrough applications of Reinforcement Learning, temporal abstraction remains a long-standing goal. Instead of making decisions at atomic time-steps, temporal abstractions allow a reinforcement learning agent to act in different time-scales. For instance, while navigating a room an agent needs to make the high-level decision regarding which door to use, but it also need to chose low-level actions: the steps required to reach the door. In this talk, we will look at an approach from RL literature called the Options Framework (1999 Sutton, Precup, Singh) with its recent extension the Option-Critic Architecture (2016 Bacon, Harb, Precup).

    You do not have to possess a deep understanding of Reinforcement Learning as the basics will be explained when needed.

    The Options Framework

    The Option-Critic Architecture

  39. Breaking simple substitution ciphers with simple algorithms
    Stephen Enright-Ward
    Thursday 22nd March 2018 more less
    Some historical texts are now unreadable, either because the script, or the underlying language, or both, are unknown. If record of the language still exists but the script is unfamiliar — for example, because the text is written in German, but in an unfamiliar alphabet — then reading the text is a two-step decipherment problem: First recognise the language, then map the unknown script to a familiar one (this applies both when the script was once widely used but is now forgotten, and when the script is a deliberate encryption scheme). I will discuss naive statistical algorithms to decipher such texts, applicable only for simple encryption techniques. Such techniques have been used successfully by Knight, Megyesi and Schaefer to break the Copiale Cipher, in 2011, and have also been applied, so far unsuccessfully, to the Voynich Manuscript.
    This talk contains no original work and no machine learning. Notes are available here.

  40. Recurrent Neural Network Grammars
    Kate McCurdy
    Tuesday 16th January 2018 more less
    Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent Neural Network Grammars. arXiv:1602.07776 [Cs].
    Published in NAACL 2016.
    Retrieved from

    We introduce recurrent neural network grammars, probabilistic models of sentences with explicit phrase structure. We explain efficient inference procedures that allow application to both parsing and language modeling. Experiments show that they provide better parsing in English than any single previously published supervised generative model and better language modeling than state-of-the-art sequential RNNs in English and Chinese.

  41. Learning embeddings in hyperbolic spaces
    Benjamin Wilson
    Thursday 30th November 2017 more less
    Consider the problem of embedding a graph in two-dimensional Euclidean space. Nodes should be close together in the embedding space precisely when they are joined by an edge in the graph (where "close" means e.g. distance <= 1). How would you embed, say, a binary tree? One space-efficient way is to place the root at the origin, and arrange the nodes of depth N on a circle of radius N centred at the root. But things get crowded fast: there are 2^N nodes of depth N, and they have to fit around a circle whose circumference is only linear in N. In fact, already for small N, the distances between the neighbouring nodes on the Nth circle are <=1, and so are implicitly joined by an edge. But then you won't have a tree anymore! In order to proceed, you'll either need to work in a higher dimensional Euclidean space, or work in a space where circles have a longer circumference ...

    In hyperbolic space, the circumference of circles grow exponentially with the radius, so it might be a great place for embedding trees, or indeed for embedding data that has a hierarchical structure. In this talk, we'll introduce a model or two for hyperbolic space and talk about how to train embeddings there, and compare this to the problem of embedding points on a sphere, where things are much easier to visualise.


  42. Learning Semantic Hierarchies via Word Embeddings
    Janna Lipenkova
    Tuesday 7th November 2017 more less
    In this talk, I will first outline the basic concepts of ontology structures, use cases and construction methods for ontologies. Then, I will go into detail on the paper "Learning Semantic Hierarchies via Word Embeddings" by Fu et al. (ACL 2014) which proposes a new method for the construction of semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. The method identifies whether a candidate word pair has a hypernym–hyponym relation by using the word-embedding-based semantic projections between words and their hypernyms.

  43. Mapping paired representations to common spaces with CCA
    Stephen Enright-Ward
    Wednesday 25th October 2017 more less
    Sometimes data scientists have different vectorial representations of the same thing, and wish to interrelate them. For example, a bilingual German/English newspaper might model each article with a pair of vectors: one for the German version, and one for the English. If the two text-to-vector models have been learned independently, the German and English vectors for each article will be dissimilar in general, but related because they represent the same story. Canonical Correlation Analysis, or CCA, is a mathematical technique that both quantifies such relationships, and provides an algorithm for mapping the two sets of vectors into a common space, such that the paired vectors become similar. I will explain what CCA is and how it works.

    Notes are available here.

  44. Fooling neural networks
    Katharina Rasch
    Wednesday 13th September 2017 more less
    It is surprisingly easy to fool deep neural networks (and other machine learning models). Change a few pixels in an image (changes not visible to humans!) and suddenly a zebra is classified with high confidence as a microwave. Let's look at some methods for generating such adversarial examples and discuss what this could tell us about what our models are actually learning.

    Here is a good read if you can't make it.

  45. Generative classification and spotting outliers
    Alan Nichol
    Thursday 24th August 2017 more less
    Classifiers like SVMs and NNs don't give us a real, meaningful probability along with their prediction. They also lack the capacity to recognise 'outlier' data as not belonging to any class in particular. I posit that we need generative models to do this, and we'll try to build a Chinese Restaurant Process + Gaussian mixture model that can (i) model the full data distribution and hence give meaningful probability estimates and (ii) incorporate outliers into new classes.

  46. Using multi-armed bandits for dealing with the cold start problem
    Till Breuer
    Wednesday 9th August 2017 more less
    Bandit algorithms can be used in the context of recommendation systems to reasonably fill the knowledge gap at cold start. Throughout this talk at least one simple bandit approach will be demonstrated at the example of a playful swipe app for exploring the car offerings of an Austrian car reselling platform.

  47. Locality-Sensitive Hashing
    Aaron Levin
    Wednesday 26th July 2017 more less
    In this talk we’ll explore locality-sensitive hashing, a technique to turn the computationally expensive exact nearest-neighbor search problem into an inexpensive approximate solution (it’s a neat trick and I promise you’ll love it). We’ll see how locality-sensitive hashing is used in image search, recommendations, and other machine learning problems. And of course, we’ll mention deep hashing, because why not?

  48. Hierarchical softmax
    Benjamin Wilson
    Wednesday 26th July 2017 more less
    Hierarchical softmax is an alternative to the softmax that scales well in the number of possible classification outcomes. It has been applied in several word embedding models, where the task is predicting a missing vocabulary word, given the context. Indeed, together with negative sampling, it is what has made large-corpus word embeddings computationally viable. It works via a really neat trick, that begins by choosing an arbitrary binary tree whose leaves are the outcomes of the classification. I'll explain to you exactly how it works and explore some questions about the effect of the choice of tree on the classification quality.

    Notes from the talk are written up as a blog post.

  49. Grammatical gender associations outweigh topical gender bias in cross-linguistic word embeddings
    Kate McCurdy
    Wednesday 5th July 2017 more less
    Recent research has demonstrated that the relations captured in vector space semantic models can reflect undesirable biases in human culture. Our investigation of cross-linguistic word embeddings reveals that topical gender bias interacts with, and is surpassed in magnitude by, the effect of grammatical gender associations, and both may be attenuated by corpus lemmatization.

  50. Minsky and Papert's "Perceptrons"
    Benjamin Wilson
    Wednesday 7th June 2017 more less
    Perceptrons are simple neural networks with a single-layer of learned weights. Despite their simplicity, their invention (by Rosenblatt in 1958) brought about a first great wave of optimism for connectionist methods in artificial intelligence. This optimism was brought to a halt by the book of Minsky and Papert, published in 1969, which proved a number of interesting results about the sorts of functions that can and can not be represented by a perceptron. In this talk, we'll review these results and the interesting methods used to prove them (elementary group theory). We'll also take the opportunity to talk also about the perceptron learning algorithm, which uses no knowledge of the error function beyond its evaluation at the training examples - in particular, it doesn't need gradients.

    Notes (pages 1-8)

  51. Learning end-to-end optimized lossy compression codecs
    Simon Wiedemann
    Wednesday 17th May 2017 more less
    At the core of all multimedia technologies we find compression codecs. This are information processing methods that aim to eliminate as much redundancies as possible contained in the raw data of a source (thus, reducing the amount of information needed in order to represent them). In lossy compression schemes, at some point during the process, one quantizes the data values which inevitably results in a loss of information. Thus, in the design of such codecs, one searches for the optimal trade-off between information loss (or distortion) and size reduction (or rate). However, due to the complexity of most real world data sources (like images and videos), solving the posed rate-distortion problem becomes intractable. Therefore, most of todays codecs (e.g. JPEG, H264, etc.) rely on clever heuristics and methods that make the lossy compression problem amenable.
    Nonetheless, in recent years, neural networks have become commonplace to perform tasks that had for decades been accomplished by ad hoc algorithms and heuristics (e.g. object recognition). In fact, recent papers in this field have shown that this powerful class of methods are able to beat standard codecs (like JPEG) for the task of lossy image compression. This results prove their potential to perhaps some day replace the current paradigms of approaching the compression problem.

    In the first part of my talk, I will give a basic introduction to information and source coding theory and explain some of the standard heuristics employed in the field of lossy image compression. In my second part, I will explain the approach taken by Ballé et al. (under review for ICLR 2017), which is based solely on neural networks.

  52. A deeper introduction to reinforcement learning
    Ludwig Winkler
    Wednesday 3rd May 2017 more less
    An introduction to reinforcement learning covering model-based, model-free and approximate methods with deep neural networks.

  53. Sampling-based approaches for language modelling and word embeddings
    Matthias Leimeister
    Wednesday 26th April 2017 more less
    Neural probabilistic language models (NPLM) aim at predicting a word given its context. Using neural networks, those models learn a real-valued dense representation, or word embedding, for each word in the vocabulary that can be used for subsequent tasks. Training an NPLM based on a softmax classifier output for a target word given the context is computationally expensive for large vocabularies, so that various algorithms have been proposed for making the training more efficient. The talk will introduce several sampling-based approximations to the softmax that aim at distinguishing the true target word from a number of noise words, with a focus on noise contrastive estimation and negative sampling (word2vec).

    Andriy Mnih and Yee Whye Teh: A fast and simple algorithm for training neural probabilistic language models. ICML 2012. pdf

    Tomas Mikolov et. al.: Distributed representations of words and phrases and their compositionality. NIPS 2013. pdf


  54. Everybody GANs Now
    Charley Wu
    Wednesday 15th March 2017 more less
    Generative Adversarial Networks (GANs) are one of the biggest topics in Machine Learning right now. I will give a tutorial on the principles behind Adversarial Training, and how it involves finding a Nash Equilibrium between two competing agents, the Generator and the Discriminator. I will also show that this solution has interesting implications in terms of informational-theoretic divergence measures (KL-Divergence and Jensen-Shannon Divergence), which underly all traditional approaches to computational modeling (i.e., finding a maximum likelihood estimate). notes

  55. Are RNNs Turing complete?
    Benjamin Wilson
    Wednesday 1st February 2017 more less
    At the recent NIPS conference, I must have heard it said five times that “RNNs are Turing complete”. All fives time were by Juergen Schmidhuber. At least as many times, I heard other attendees grumbling that this was false. In this talk we’ll explore the assumptions and main ideas of the proof as well as consider whether these assumptions hold for “real-world” RNNs implemented using finite-precision arithmetic.

    If you feel like a head-start, we’ll be covering the paper of Siegelmann and Sontag “On the Computational Power of Neural Nets” (1995). Notes from the talk are available here.

  56. NIPS report
    Marcel Ackermann
    Wednesday 11th January 2017
  57. COLING report
    Kate McCurdy
    Wednesday 11th January 2017
  58. A simple introduction to Bayesian Learning
    Stephen Enright-Ward
    Wednesday 30th November 2016 more less
    Machine learning often means minimising a loss function by randomly initialising its parameters, then updating them repeatedly in response to batches of training data, for example using gradient descent. The idea behind Bayesian Learning is to replace a point estimate of the parameters with a probability distribution over the parameter space, and replace pointwise updates with Bayes' Theorem, to turn this "prior" distribution into a "posterior", which incorporates evidence from the training set. This talk will be an elementary introduction Bayesian Learning, using simple examples and small amount of theory. notes

  59. Deep Neural Networks for YouTube Recommendations
    Alexey Rodriguez Yakushev
    Wednesday 16th November 2016 more less
    A talk on the model and architecture of the YouTube recommendation system (abstract). It should offer some really interesting insights into the workings of one of the world's most heavily-used recommenders!

  60. Connectionist temporal classification
    Matthias Leimeister
    Wednesday 9th November 2016 more less
  61. Predicting incidental comprehension of macaronic texts
    Kate McCurdy
    Wednesday 19th October 2016
  62. A word embedding model for explaining compositionality
    Stephen Enright-Ward
    Wednesday 17th August 2016 more less
  63. Audio featurisation and speaker recognition
    Benjamin Wilson
    Wednesday 27th July 2016 more less
  64. Collaborative Filtering and the NetFlix Prize
    David Yu
    Wednesday 6th July 2016 more less
    We'll describe the Netflix Problem and then go through the paper "Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights” by Yehuda Koren and Robert Bell (Bell Labs) - the eventual winners of the competition. If time permits, description of the Map-Reduce framework and how to apply the approach in the paper to that framework.

  65. Gradient Descent Methods
    Benjamin Wilson
    Wednesday 22nd June 2016 more less
    We study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995. notes

  66. Bayesian Dialog management: how to build a chat interface without a flow chart.
    Alan Nichol
    Thursday 9th June 2016 more less
    Statistical dialogue systems are motivated by the need for a data-driven framework that reduces the cost of laboriously hand-crafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimising the policy via a reward-driven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, ex- act model representation and optimisation is computationally intractable. Hence, the practical application of POMDP-based systems requires efficient algorithms and carefully constructed approximations.

  67. Music similarity and speaker recognition what do they have in common?
    Matthias Leimeister
    Wednesday 11th May 2016 more less
    Music information retrieval has largely been influenced by methods from speech processing. In this talk I want to present a line of research on music similarity models that followed a parallel development in the area of speaker recognition and identification. Specifically I'll discuss a common set of features (MFCCs) and statistical models (single Gaussians, GMM supervectors, and i-vector factor analysis) that have been used successfully in both domains. slides

  68. Factorisation Machines
    Stephen Enright-Ward
    Wednesday 27th April 2016 more less
    Notes available here.

  69. Negative sampling for recommender systems and word embeddings
    Benjamin Wilson
    Wednesday 13th April 2016 more less
    Common to several approaches for training recommenders is contrasting the data observed in a context (e.g. the products purchased by a user) with negative samples chosen from the pool of all possible products. In this talk well review some different ways of choosing negative samples, starting with static approaches like uniform and popularity based sampling and moving through to adaptive approaches (like BPRs adaptive oversampling and WARP) that take into account the context and the models current belief. All of these ideas can be transferred to the world of word embeddings, where the negative sampling techniques are at present less sophisticated.


    1. Steffen Rendle and Christoph Freudenthaler, Improving Pairwise Learning for Item Recommendation from Implicit Feedback, 2014, pdf my notes

    2. Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011 pdf my notes

  70. Gaussian Process Models: what to do when you can’t optimize?
    Charley Wu
    Wednesday 16th March 2016 more less
    Optimization problems are ubiquitous in machine learning, but what do we do when we cant optimize? When the problem space is too large, search costs are expensive, or it is not feasible to exhaustively search through all possible solutions, it becomes impossible to find a guaranteed optimal solution. Instead, we need to find the environment-appropriate balance between exploration and exploitation in order to produce a satisfactory solution, within a finite search horizon.

    I will be giving a tutorial on Gaussian Process (GP) models as a method to model Bayesian beliefs about unexplored areas of the problem space. GPs give us both the expected value and uncertainty of each location in the problem space, based on knowledge from previous observations. With a belief model of the environment, we can treat the optimization problem as a bandit problem, with an exploration goal (reduction of uncertainty) and an exploitation goal (closer approximation of the global optima).

    Lastly, I will also introduce a bit of the literature on human exploration-exploitation behavior, and how untrained undergrads in a non-mathematical discipline are better and more efficient than the best state-of-the art optimization algorithms. What makes us good at this task and how can we use this information to build better search algorithms?

  71. Active evaluation of Predictive Models
    Christoph Sawade
    Wednesday 2nd March 2016 more less
    In order to make an informed decision about the deployment of a predictive model, it is crucial to know the model’s approximate performance. To evaluate performance, a set of labeled test instances is required that is drawn from the distribution the model will be exposed to at application time. In many practical scenarios, unlabeled test instances are readily available, but the process of labeling them can be a time- and cost-intensive task and may involve a human expert.

    This talk addresses the problem of evaluating a given predictive model accurately with minimal labeling effort. We study an active model evaluation process that selects certain instances of the data according to an instrumental sampling distribution and queries their labels. We derive sampling distributions that minimize estimation error with respect to different performance measures such as error rate, mean squared error, and F-measures.

    Another instance is the problem of evaluation the performance of a ranking function. In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We apply the concept of active evaluation to ranking functions and derive optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR).

  72. Hierarchical Self-Organising Maps and Random Swapping for 2-d Sorting
    Kai Uwe Barthel
    Wednesday 17th February 2016 more less
    There are many techniques for dimensionality reduction used to visualize high-dimensional datasets. A typical approach is principal component analysis, however a linear PCA projection cannot preserve local pairwise distances. Other approaches are multidimensional scaling, self-organizing maps (SOMs), local linear embedding or t-Distributed Stochastic Neighbor Embedding (t-SNE). In most cases the high-dimensional datasets are non-linearly projected and shown as points on a 2D map. If images are to be arranged in a 2D fashion, there are further constraints. The images should not overlap and the 2D display should be equally filled. I will talk about different approaches how to visually sort/arrange images using hierarchical SOMs and directed random swapping approaches. One example of such a visually sorted image display is which is a visual image browsing system to visually explore and search millions of images from stock photo agencies and the like. Similar to map services like Google Maps users may navigate through multiple image layers by zooming and dragging.

  73. Counterfactual Performance Estimation or “how can your ML models have a Groundhog Day”
    Alexey Rodriguez Yakushev
    Wednesday 3rd February 2016 more less
    I am currently very interested in evaluating new recommendation models without having to perform A/B testing every single time. So I have been reading a lot on how to properly use log data to evaluate model performance.

    It is very common to use log data to train and perform model selection/hyperparameter search of your models, think for instance of click through rate prediction in which you train models on your users’ click stream, or recommendation models which you train using the consumption histories of your users. Log data is relatively easy to get if you have a large user base but it has serious pitfalls.

    Log data usually has only partial feedback, you know that a user clicked or not on an ad, but you don’t know whether she would have clicked or not on a different ad. This bandit like feedback will introduce bias in your evaluation metrics and will likely give you a distorted view of what will happen at A/B test time.

    We will use Counterfactual Estimators to ask “what if” questions, that is, what would have been the performance if you use a different model than used at logging time. Of course there are no miracles, we will look at limitations of these estimators and their theoretical guarantees.

  74. An intuitive approach to clustering that works … sometimes.
    Vinzenz Schönfelder
    Wednesday 20th January 2016 more less
    This talk presents a clustering method proposed in a recent Science paper. Generally favouring pragmatic approaches to problem solving (as a physicist…), I immediately fell in love with this approach when I learned about it from my colleagues at SISSA. Rodriguez and Laio devised a method in which the cluster centres are recognised as local density maxima that are far away from any points of higher density. The algorithm depends only on the relative densities rather than their absolute values. The authors tested the method on a series of data sets, and its performance compares favourably to that of established techniques. Looking at the method more closely, however, reveals essential weaknesses that depend on apparently minor details in the distribution of the data.

    Rodriguez, A., Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

  75. Improving Distributional Similarity with Lessons Learned from Word Embeddings
    Marcel Ackermann
    Wednesday 6th January 2016 more less
    A review of the main highlights of the paper of Levy et al. 2015.

  76. Kernels – Duality between Feature Representation and Similarity Measures
    Christoph Sawade
    Wednesday 2nd December 2015 more less
    The questions of ‘how to represent data points’ and ‘how to measure similarity’ are crucial for the success of all methods in data science. Kernel functions can be seen as inner products in some Hilbert space and thus connect feature encoding with the concept of similarity. In this talk, I will elaborate this relationship in the regularized empirical risk framework, which captures lots of standard algorithms like Support Vector Machines, Ridge Regression and Logistic Regression. The goal is to develop a general understanding that is applicable throughout the machine learning world.

  77. MCMC, a space odyssey!
    Simon Pepin Lehalleur
    Wednesday 18th November 2015 more less
    Computing integrals in high-dimensional spaces is a ubiquitous and challenging problem: traditional numerical quadrature methods fail miserably, because they sample the space uniformly while the integrand is usually supported on a very small volume. Markov Chain Monte Carlo is a family of algorithms which solve this problem by sampling more intelligently, using a Markov chain whose stationary distribution is the one we want to integrate. I will explain the basic ideas, including some of the many ways to set up a suitable Markov chain (Metropolis-Hastings, Gibbs sampling)

    Part 1 and Part 2 of M. Bethancourt 2-part course in MLSS 2014.
    Survey of the aspects of Markov chains which bear on MCMC.
    Handbook of MCMC, esp. chap. 1 for a general overview and 5 for Hamiltonian MCMC.

  78. Limiting Distributions of Markov Chains
    Benjamin Wilson
    Thursday 5th November 2015 more less
    A seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

    Notes and further reading available here.

  79. Topological data analysis and the Mapper algorithm
    Stephen Enright-Ward
    Thursday 5th November 2015 more less
    Giving a sensible visualisation of point clouds in high dimensions is a difficult problem in data science. Many traditional approaches fit point clouds to surfaces, trying to preserve distances. In the last ten years, people have started to use methods from topology, a branch of mathematics in which “distance” is replaced with the spongier notion of “in the neighbourhood of”, to model data points. The aim is to capture the global shape of the data — the signal — while ignoring local noise. I’ll explain one such approach, called the Mapper algorithm, developed by Singh, Mémoli and Carlsson in this paper

  80. Parameterising the Mahalanobis distances
    Benjamin Wilson
    Wednesday 21st October 2015 more less
    Given pairs of points that are known to be similar, and pairs that are known to the dissimilar, can we learn a metric such that similar points are close together? Well consider the special case of the Mahalanobis distance, which is just the Euclidean distance after applying a linear transform. Posing this problem in different ways leads to very different optimisation problems. Ill share my first steps working on this. The introduction of this survey paper gives a good introduction to the topic.

    Notes from the talk are here.

  81. Report on EMNLP 2015
    Kate M.
    Wednesday 7th October 2015
  82. Report on RecSys 2015
    Marcel Ackermann
    Wednesday 7th October 2015
  83. Limiting Distributions of Markov Chains
    Benjamin Wilson
    Tuesday 6th October 2015 more less
    A seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

    Notes and further reading available here.

  84. Pocket-sized Neural Networks
    Adriaan Schakel
    Tuesday 8th September 2015
  85. RNNs for Language Modelling
    Stephen Enright-Ward
    Tuesday 25th August 2015 more less
  86. Nonnegative Matrix Factorization for Audio Source Separation
    Matthias Leimeister
    Tuesday 11th August 2015 more less
    Audio source separation deals with the problem of decomposing a mixture of sound sources into their individual parts. Typical examples include noise suppression in speech signals or the extraction of single instruments from a music mix. A popular class of algorithms to approach the problem is nonnegative matrix factorization (NMF) and its extensions. In the talk I will review the basics of NMF for source separation following a recent survey paper [1] and present some concrete examples for speech and music [2].



    Slides from the talk are available here.

  87. Interpreting word embedding arithmetic as set operations
    Alexey Rodriguez Yakushev
    Tuesday 28th July 2015
  88. Vector length as significance in word embeddings
    Benjamin Wilson
    Tuesday 28th July 2015 more less
    An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. Written up here.

  89. Attentional models for object recognition in Recurrent Neural Nets.
    Dave Kammeyer
    Tuesday 14th July 2015 more less
    A discussion of an attention-based model for recognizing multiple objects in images, from this paper.

  90. Sparse L1-regularization for Feature Extraction
    Vinzenz Schönfelder
    Tuesday 16th June 2015 more less
    I present the general idea behind regularization in general and its sparse incarnation in particular. As a motivation, I start with the project I worked on during my dissertation. It represents a practical application of sparse regularization for feature selection in the auditory perceptual sciences. By discussing geometric interpretations of L1-regularisation as well as its relation to sparse Bayes-priors, I hope to provide an intuitive understanding of the underlying mechanism. Finally, I discuss standard methods for hyperparameter optimisation in regularised regression.

    Slides are available here.

  91. Expectatation Maximisation and Gaussian Mixture Models
    Benjamin Wilson
    Thursday 21st May 2015 more less
    A talk on Expectation Maximisation where Gaussian Mixture Models are considered as an example application. The exposition follows Bishop section 2.6 and Andrew Ng’s CS229 lecture notes. If you weren’t at the seminar, then it is probably better to read one of these instead. Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).

    Notes are here.

  92. Convolutional Neural Nets NLP
    Adriaan Schakel
    Wednesday 22nd April 2015
  93. Non-negative Matrix Factorisation
    Benjamin Wilson
    Thursday 12th February 2015 more less
    A talk on non-negative matrix factorization (NMF), its probabilistic interpretation and the optimization problems it poses. I find non-negativity constraints really interesting from the point of view of model interpretability, and NMF is a famous example. Most of us will have see the example of facial image decomposition using NMF before. If you wanted to read something yourself, you could start with Lee and Seungs paper in Nature.

  94. Gaussian processes
    Margo K.
    Monday 15th December 2014
  95. Negative sampling and noise contrastive estimation in the context of probabilistic language modeling
    Adriaan Schakel
    Monday 8th December 2014
  96. Deep Belief Nets
    Stephen Enright-Ward
    Monday 24th November 2014
  97. Energy-based models, inference and Gibbs Sampling
    Benjamin Wilson
    Monday 27th October 2014 more less
    Notes available here.

  98. Latent Dirichlet Allocation
    Marija V.
    Monday 20th October 2014
  99. Estimator theory and the bias/variance trade-off
    Benjamin Wilson
    Monday 11th August 2014 more less
    A talk on the bias-variance trade-off, including the basics of estimator theory, some examples and maximum likelihood estimation.

    Notes are here. See also squared pair-wise differences estimate the variance.

  100. Dependently-typed programming languages
    Andreas G.
    Friday 8th August 2014 more less
    The presentation will be an introduction to dependently typed programming languages. I hope to give a brief glimpse into how advanced type systems allows the programmer to be more explicit in the way he expresses properties of his programs. In fact, the system is rigorous to the point where we can represent logical propositions by types, and further prove our proposition by writing a program of the corresponding type.

  101. Restricted Boltzmann Machines
    Stephen Enright-Ward
    Monday 21st July 2014
  102. Feasibility of Learning
    Marija V.
    Monday 30th June 2014 more less
    The VC dimension and learning feasibility.

  103. Mathematical properties of Principal Component Analysis
    Benjamin Wilson
    Monday 19th May 2014 more less
    Here are some notes from the talk.

  104. Transfer learning
    Heiko Schmidle
    Monday 28th April 2014
  105. Word2vec and the Distributional Hypothesis
    Benjamin Wilson
    Monday 14th April 2014 more less
    An overview of word2vec, covering the CBOW learning task, hierarchical softmax, and the fundamental linguistic assumption: the distributional hypothesis.

    Here are better slides from a later talk on the same subject.