Bayesian Inference and Functional Programming - Bayesian Statistics and Functional Programming

Model Comparison with Hierarchical Models

Comparing the performance of multiple machine learning models using Bayesian Hierarchical models.

Uncertainty in Neural Networks

Python

Deep Learning

Bayesian

Using MC Dropout to get probability intervals for neural network predictions.

Entity Embeddings

Python

Deep Learning

Creating entity embeddings for categorical predictors using Python.

Neural Networks in R

R

Deep Learning

This post explores how to create a simple neural network to learn a linear function and a non-linear function using both standard R and the Torch library for R.

Functional Programming and Hidden Markov Models

Bayesian

R

The hidden Markov model is a state-space model with a discrete latent state, \(x_{1:T}\) and noisy observations \(y_{1:T}\). The model can be described mathematically as

Multi State Models

R

Bayesian

Multi-state models are used to model disease progression. The model is a continuous time Markov process. The states and time of transitions are fully observed. There are three states a patient can be in, “healthy”, “illness” and “deceased”. The possible pairs of transitions between these states include healthy -> illness, illness -> healthy, illness -> death and healthy -> death. The model can be expressed as a directed graph.

Tidy Tuesday: Tour de France

R

The Tour de France is the biggest annual sporting event in the world featuring 21 days of bicycle racing and two rest days around France (sometimes starting other countries, including Yorkshire in 2014). There have been 106 editions up to the 2019 race with the first event held 116 years ago in July 1903. During that time the race has evolved. Initially the races were entirely self-supported, meaning you had to carry your own spare tyres and fix any mechanical issues. Additionally, many of the mountain passes had gravel roads instead of the pristine tarmac of modern day. In the 1913 race, Eugène Christophe was hit by a race vehicle on his descent from the Tourmalet, a 2,115m mountain pass in the French Pyrenees. This caused Christophe’s front fork to break which he would be forced to repair himself. He walked 10km to the nearest village and used a forge to render a new fork and thus repair his bicycle. However, Christophe paid a boy to operate the bellows on the forge meaning he received a ten minute penalty! The image below is Christophe during the 1913 tour, credit Bike Race Info.

Bayesian Inference for an SIR Model

R

Bayesian

Johns Hopkins University have put together a repository containing confirmed cases of COVID19, deaths and recovered patients. Below we plot the confirmed cases, confirmed recovered and deaths.

Tidy Tuesday: The Office

tidy-tuesday

R

Bayesian

First we download the ratings for each office episode using the tidytuesdayR package.

Tidy Tuesday: US Tuition Data

tidy-tuesday

R

This weeks data consists of tuition costs, salary potential and diversity information of US colleges. This includes 2 year colleges which offer associate degrees, certificates and diplomas and 4 year colleges which offer bachelors and masters degrees. These are further split by private institutions, public and for profit. Additionally, Universities in the US charge different tuition fees for in-state or out-of-state students. Also, the ticket price is not always reflective of the students costs. The fees can be wholly or partially subsidised by scholarships and financial aid.

Releasing Harrier League Data

R

The North East Harrier League is a series of cross country running races in the North East of England taking place over the winter from September to March. Results are available online from 2012-13 season to the present season 2019-20. The results are available online in HTML format. I have downloaded and cleaned the data and it can be used for analysis or exploration. The data for senior men and women is available in a tabular format in my blog package - see the file which contains the parsing functions here to get an insight into what it takes to parse this kind of data.

Tidy Tuesday: NHL Goalscorers

R

tidy-tuesday

First install the Tidy Tuesday R package.

Analysing .fit files in R

R

Garmin running watches output a file type called .fit, the developer SDK can be downloaded from the ANT website. There is also Python library named fitparse which has been written to parse .fit files. This blog post will show you how to use reticulate to parse a .fit file.

Bayesian Survival Analysis: Exponential Model

R

Bayesian

Consider an arbitrary interval where the expected number of events in the interval is denoted as \(\lambda\). The number of events in this interval is Poisson distributed with rate \(\lambda\). To see this, proceed to subdivide the interval into \(n\) smaller intervals \(t_1, \dots, t_n\) in which the probability of an event occurring in each small interval is \(\lambda / n\) and can be represented as an independent Bernoulli trial. The number of events in the entire interval is distributed according to a Binomial distribution with number of trials \(n\) and probability of success \(\lambda / n\). If the intervals are infinitesimally small, in the limit as \(n \rightarrow \infty\), then number of trials increases and the Binomial distribution tends to the Poisson distribution:

Forward Mode AD in R

R

Automatic differentiation can be used to calculate the exact derivative of a function at a point using applications of the chain rule. Dual numbers provide a straightforward implementation in R using S3 generic methods. A dual number has a real component and a “dual” component which can be used to exactly calculate the expression and derivative at a specific value of \(x\). Consider the quadratic form \(f(x) = 5x^2 + 3x + 10\) with derivative \(f^\prime(x) = 10x + 3\). The function and derivative can be evaluated at a value, say \(x = 5\) using the dual number \(5 + \varepsilon\), the dual component \(\varepsilon\) is considered small such that \(\varepsilon^2 = 0\) then calculating \(f(5 + \varepsilon)\):

Hamiltonian Monte Carlo in R

R

Bayesian

Determining the posterior distribution for the parameters of a real-world Bayesian model inevitably requires calculating high-dimensional integrals. Often these are tedious or impossible to calculate by hand. Markov chain Monte Carlo (MCMC) algorithms are popular approaches, samplers such as the Gibbs sampler can be used to sample from models with conditionally conjugate specifications and the Metropolis-Hastings algorithm can be used when the conditionally conjugate form is not present.

Bayesian Linear Regression with Gibbs Sampling in R

R

Bayesian

Linear regression models are commonly used to explain relationships between predictor variables and outcome variables. The data consists of pairs of independent observations \((y_i, x_i)\) where \(y_i \in \mathbb{R}\) represents the outcome variable of the \(i^\text{th}\) observation and \(x_i \in \mathbb{R}^m\) represents the predictors (or covariates) of the \(i^\text{th}\) observation. The specification for this model is:

Multi-armed Bandits in Scala

Scala

This post uses Almond in order to run Scala code in a Jupyter notebook. See my previous post to learn how to setup Jupyter, Ammonite and Almond. That post examined using the Scala libraries EvilPlot (including inline plotting in the Jupyter notebook) and Rainier for Bayesian inference in a simple linear model.

Scala and Jupyter Notebook with Almond

Scala

Typically, when programming with Scala I use a combination of ensime in emacs, sbt and the Scala repl. However, sometimes when working on a new project which requires a lot of data exploration and graphics it is sometimes more useful to have a notebook where figures are rendered inline with descriptions of why each figure has been generated and what it shows for future reference. Jupyter notebooks have long been the standard in Python (although I prefer rmarkdown and knitr when using R).

Sampling from a distribution with a known CDF

R

A distribution with an inverse cumulative distribution function (CDF) can be sampled from using just samples from \(U[0, 1]\). The inverse CDF (sometimes called the quantile function) is the value of \(x\) such that \(F_X(x) = Pr(X \leq x) = p\). Consider a that a transformation \(g: [0, 1] \rightarrow \mathbb{R}\), exists which takes a value sampled from the standard uniform distribution \(u \sim U[0, 1]\) and returns a value distributed according to the target distribution. Then the inverse CDF can be written as:

Bayesian Inference using rejection sampling

R,Bayesian

As an example, consider a (possibly biased) coin flip experiment. The parameter of interest is the probability of heads \(p_h\). A Beta distribution is chosen for the prior of \(p_h\), \(p(p_h) = \mathcal{B}(\alpha, \beta)\). The Beta distribution has support between 0 and 1, which is appropriate for a probability. The likelihood of a coin flip is Bernoulli, however the coin should be flipped several times in order to learn the parameter \(p_h\). The distribution for \(n\) independent Bernoulli trials is the Binomial distribution, hence the likelihood can be written as \(\textrm{Bin}(Y;n,p_h)\). The coin is flipped \(n = 10\) times and the results are displayed below:

A Statistical Model for Finishing Positions at the National Cross Country

R

Download the results from Power of ten for northern, midlands, southern and national.

Efficient Markov chain Monte Carlo in R with Rcpp

R

Bayesian

This post considers how to implement a simple Metropolis scheme to determine the parameter posterior distribution of a bivariate Normal distribution. The implementation is generic, using higher-order-functions and hence can be re-used with new algorithms by specifying the un-normalised log-posterior density and a proposal distribution for the parameters. The built-in parallel package is used fit multiple chains in parallel, finally the Metropolis algorithm is reimplemented in C++ using Rcpp which seemlessly integrates with R.

Harrier League Cross Country

R

The Harrier League is a cross country running league with seven fixtures across the North East of England in the 2017/18 season across the winter months from September ’17 until March ’18.

MCMC with Scala Breeze

Scala,Bayesian

Scala Breeze is a numerical computing library, which also provides facilities for statistical computing. For instance, implementations of distributions and Markov chain Monte Carlo (MCMC), which can be used for solving the integrals required in Bayesian modelling. In this post, I am going to simulate data from a bivariate Gaussian model and use the Scala Breeze library to recover the mean and the variance of the bivariate Gaussian distribution.

An Akka HTTP Client with JSON Parsing

Scala

There are many sources of open data on the web, freely accessible via an Application Programming Interface (API) made available over the web. A common interchange format for these APIs is Javascript Object Notation (JSON) which is human readable and predictable, however is not in the correct format for analysis. The data needs to be parsed from the JSON string and made available as an object we can work with. This blog post considers a simple Akka Http client to read data from the Urban Observatory in Newcastle. If you just want to read the code, see this Gist.

Using Monads for Handling Failures and Exceptions

Scala

In this post I will give a practical introduction to some useful structures for handling failure in functional programming.

Seasonal DLM

Bayesian

Scala

I introduced the class of state space models called DLMs in a previous post covering the Kalman Filter. The seasonal DLM is similar to the first order DLM, however it incorporates a deterministic transformation to the state, in order to capture cyclic trends. Remember a general DLM can be written as:

The Kalman Filter in Scala

Scala

Bayesian

A Dynamic Linear Model (DLM) is a special type of state space model, where the state and observation equations are Normally distributed and linear. A general DLM can be written as follows:

Practical Introduction to Akka Streaming

Scala

Akka Streaming is a streaming IO engine used to build high performance, fault tolerant and scalable streaming data services. In this post I will describe how you can implement some of the features included in Akka Streaming using only simple streams of integers and strings, although the true power of Akka streams only becomes apparent when we are consuming data from real sources such as Websockets, databases and files. Akka is available in Java and Scala, but I will be focusing on the Scala API in this post.