Bayesian Statistics and Functional Programming
https://jonnylaw.rocks/
Recent content on Bayesian Statistics and Functional Programming Hugo -- gohugo.ioen-usFri, 01 May 2020 00:00:00 +0000Functional Programming and Hidden Markov Models
https://jonnylaw.rocks/blog/multi-state-survival-models-part-2/
Fri, 01 May 2020 00:00:00 +0000https://jonnylaw.rocks/blog/multi-state-survival-models-part-2/The hidden Markov model is a state-space model with a discrete latent state, \(x_{1:T}\) and noisy observations \(y_{1:T}\). The model can be described mathematically as
\[p(y_{1:T}, x_{1:T}) = p(x_1)p(y_1|x_1)\prod_{t=2}^Tp(y_t|x_t)p(x_t|x_{t-1})\]
Where \(y_{1:T} = y_1, \dots, y_T\) represents the sequence of observed values and \(x_{1:T} = x_1, \dots, x_T\) is the sequence of latent, unobserved values. The state space is assumed to be finite and countable, \(X \in \{1,\dots,K\}\) and the time gaps between each observation are constant.Multi State Models
https://jonnylaw.rocks/blog/multi-state-survival-models/
Sun, 19 Apr 2020 00:00:00 +0000https://jonnylaw.rocks/blog/multi-state-survival-models/Multi-state models are used to model disease progression. The model is a continuous time Markov process. The states and time of transitions are fully observed. There are three states a patient can be in, “healthy”, “illness” and “deceased”. The possible pairs of transitions between these states include healthy -> illness, illness -> healthy, illness -> death and healthy -> death. The model can be expressed as a directed graph.
state transition diagramBayesian Inference for an SIR Model
https://jonnylaw.rocks/blog/bayesian-inference-for-an-sir-model/
Fri, 27 Mar 2020 00:00:00 +0000https://jonnylaw.rocks/blog/bayesian-inference-for-an-sir-model/Johns Hopkins University have put together a repository containing confirmed cases of COVID19, deaths and recovered patients. Below we plot the confirmed cases, confirmed recovered and deaths.
SIR Model The system of ordinary differential equations (ODE) for the Susceptible Infected Recovered (SIR) model is given by
\[\begin{align} & \frac{dS}{dt} = - \frac{\beta I S}{N}, \\ & \frac{dI}{dt} = \frac{\beta I S}{N}- \gamma I, \\ & \frac{dR}{dt} = \gamma I,\\ & N = S + I + R.Tidy Tuesday: The Office
https://jonnylaw.rocks/blog/tidy-tuesday-the-office/
Tue, 17 Mar 2020 00:00:00 +0000https://jonnylaw.rocks/blog/tidy-tuesday-the-office/First we download the ratings for each office episode using the tidytuesdayR package.
office <- tidytuesdayR::tt_load(x = 2020, 12) episode_ratings <- office$office_ratings We can use glimpse from the tibble package to see the column types and some example data from the head of the table.
glimpse(episode_ratings) ## Rows: 188 ## Columns: 6 ## $ season <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ episode <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … ## $ title <chr> "Pilot", "Diversity Day", "Health Care", "The Alliance", … ## $ imdb_rating <dbl> 7.Tidy Tuesday: US Tuition Data
https://jonnylaw.rocks/blog/tidy-tuesday-us-tuition-data/
Tue, 10 Mar 2020 00:00:00 +0000https://jonnylaw.rocks/blog/tidy-tuesday-us-tuition-data/tuesdata <- tidytuesdayR::tt_load(2020, week = 11) This weeks data consists of tuition costs, salary potential and diversity information of US colleges. This includes 2 year colleges which offer associate degrees, certificates and diplomas and 4 year colleges which offer bachelors and masters degrees. These are further split by private institutions, public and for profit. Additionally, Universities in the US charge different tuition fees for in-state or out-of-state students. Also, the ticket price is not always reflective of the students costs.Releasing Harrier League Data
https://jonnylaw.rocks/blog/harrier_league_open_data/
Wed, 04 Mar 2020 00:00:00 +0000https://jonnylaw.rocks/blog/harrier_league_open_data/The North East Harrier League is a series of cross country running races in the North East of England taking place over the winter from September to March. Results are available online from 2012-13 season to the present season 2019-20. The results are available online in HTML format. I have downloaded and cleaned the data and it can be used for analysis or exploration. The data for senior men and women is available in a tabular format in my blog package - see the file which contains the parsing functions here to get an insight into what it takes to parse this kind of data.Tidy Tuesday: NHL Goalscorers
https://jonnylaw.rocks/blog/tidy-tuesday/
Tue, 03 Mar 2020 00:00:00 +0000https://jonnylaw.rocks/blog/tidy-tuesday/First install the Tidy Tuesday R package.
# install.packages("remotes") remotes::install_github("thebioengineer/tidytuesdayR") The data for this Tuesday can be downloaded using tt_load.
tuesdata <- tidytuesdayR::tt_load('2020-03-03') Unfortunately this only grabbed one file - the top 250 goalscorers. First look at this file.
Top Career Goalscorers tuesdata$top_250 %>% top_n(30, wt = total_goals) %>% mutate(player = forcats::fct_reorder(player, total_goals)) %>% ggplot(aes(x = player, y = total_goals)) + geom_col() + coord_flip() Average goals per season parse_end_year <- function(years) { end_tens <- substr(years, 6, 7) possible_end <- as.Analysing .fit files in R
https://jonnylaw.rocks/blog/analysing-fit-files-in-r/
Mon, 04 Nov 2019 00:00:00 +0000https://jonnylaw.rocks/blog/analysing-fit-files-in-r/Garmin running watches output a file type called .fit, the developer SDK can be downloaded from the ANT website. There is also Python library named fitparse which has been written to parse .fit files. This blog post will show you how to use reticulate to parse a .fit file.
First create a Python virtual environment, this is commonly used to store a projects’ package collection together to enable more straightforward reproducibility.Bayesian Survival Analysis: Exponential Model
https://jonnylaw.rocks/blog/bayesian-survival-analysis/
Fri, 09 Aug 2019 00:00:00 +0000https://jonnylaw.rocks/blog/bayesian-survival-analysis/Poisson Distribution Consider an arbitrary interval where the expected number of events in the interval is denoted as \(\lambda\). The number of events in this interval is Poisson distributed with rate \(\lambda\). To see this, proceed to subdivide the interval into \(n\) smaller intervals \(t_1, \dots, t_n\) in which the probability of an event occurring in each small interval is \(\lambda / n\) and can be represented as an independent Bernoulli trial.Forward Mode AD in R
https://jonnylaw.rocks/blog/forward-mode-automatic-differentiation-r/
Mon, 05 Aug 2019 00:00:00 +0000https://jonnylaw.rocks/blog/forward-mode-automatic-differentiation-r/Forward Mode Automatic Differentation Automatic differentiation can be used to calculate the exact derivative of a function at a point using applications of the chain rule. Dual numbers provide a straightforward implementation in R using S3 generic methods. A dual number has a real component and a “dual” component which can be used to exacly calculate the expression and derivative at a specific value of \(x\). Consider the quadratic form \(f(x) = 5x^2 + 3x + 10\) with derivative \(f^\prime(x) = 10x + 3\).Hamiltonian Monte Carlo in R
https://jonnylaw.rocks/blog/hamiltonian_monte_carlo_in_r/
Wed, 31 Jul 2019 00:00:00 +0000https://jonnylaw.rocks/blog/hamiltonian_monte_carlo_in_r/Introduction Determining the posterior distribution for the parameters of a real-world Bayesian model inevitably requires calculating high-dimensional integrals. Often these are tedious or impossible to calculate by hand. Markov chain Monte Carlo (MCMC) algorithms are popular approaches, samplers such as the Gibbs sampler can be used to sample from models with conditionally conjugate specifications and the Metropolis-Hastings algorithm can be used when the conditionally conjugate form is not present.
There are downsides to these established methods, Gibbs sampling puts a restrictive form on the prior distribution.Bayesian Linear Regression with Gibbs Sampling in R
https://jonnylaw.rocks/blog/bayesian-linear-regression-gibbs/
Fri, 14 Jun 2019 00:00:00 +0000https://jonnylaw.rocks/blog/bayesian-linear-regression-gibbs/Linear regression models are commonly used to explain relationships between predictor variables and outcome variables. The data consists of pairs of independent observations \((y_i, x_i)\) where \(y_i \in \mathbb{R}^p\) represents the outcome variable of the \(i^{th}\) observation and \(x_i \in \mathbb{R}^{m \times 1}\) represents the predictor variable of the \(i^{th}\) observation. The specification for this model is:
\[y_i = \beta^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \Sigma).\]
The parameters of the model include the coefficients of the predictor variables, \(\beta \in \mathbb{R}^{1 \times m}\) and the variance of the unmodelled noise, \(\Sigma \in \mathbb{R}^{p \times p}\).Multi-armed Bandits in Scala
https://jonnylaw.rocks/blog/multi-armed-bandits/
Tue, 16 Apr 2019 00:00:00 +0000https://jonnylaw.rocks/blog/multi-armed-bandits/Setting up the Environment This post uses Almond in order to run Scala code in a Jupyter notebook. See my previous post to learn how to setup Jupyter, Ammonite and Almond. That post examined using the Scala libraries EvilPlot (including inline plotting in the Jupyter notebook) and Rainier for Bayesian inference in a simple linear model.
The imports required for this post are:
import coursier.MavenRepository interp.repositories() ++= Seq(MavenRepository( "http://dl.bintray.com/cibotech/public" )) import $ivy.Scala and Jupyter Notebook with Almond
https://jonnylaw.rocks/blog/scala-and-jupyter-notebook-with-almond/
Mon, 15 Apr 2019 00:00:00 +0000https://jonnylaw.rocks/blog/scala-and-jupyter-notebook-with-almond/Typically, when programming with Scala I use a combination of ensime in emacs, sbt and the Scala repl. However, sometimes when working on a new project which requires a lot of data exploration and graphics it is sometimes more useful to have a notebook where figures are rendered inline with descriptions of why each figure has been generated and what it shows for future reference. Jupyter notebooks have long been the standard in Python (although I prefer rmarkdown and knitr when using R).Bayesian Inference using rejection sampling
https://jonnylaw.rocks/blog/rejection-sampling/
Mon, 25 Feb 2019 00:00:00 +0000https://jonnylaw.rocks/blog/rejection-sampling/Coin Flip Model As an example, consider a (possibly biased) coin flip experiment. The parameter of interest is the probability of heads \(p_h\). A Beta distribution is chosen for the prior of \(p_h\), \(p(p_h) = \mathcal{B}(\alpha, \beta)\). The Beta distribution has support between 0 and 1, which is appropriate for a probability. The likelihood of a coin flip is Bernoulli, however the coin should be flipped several times in order to learn the parameter \(p_h\).Sampling from a distribution with a known CDF
https://jonnylaw.rocks/blog/inverse-sampling/
Mon, 25 Feb 2019 00:00:00 +0000https://jonnylaw.rocks/blog/inverse-sampling/A distribution with an inverse cumulative distribution function (CDF) can be sampled from using just samples from \(U[0, 1]\). The inverse CDF (sometimes called the quantile function) is the value of \(x\) such that \(F_X(x) = Pr(X \leq x) = p\). Consider a that a transformation \(g: [0, 1] \rightarrow \mathbb{R}\), exists which takes a value sampled from the standard uniform distribution \(u \sim U[0, 1]\) and returns a value distributed according to the target distribution.A Statistical Model for Finishing Positions at the National Cross Country
https://jonnylaw.rocks/blog/national-cross-country/
Fri, 22 Feb 2019 00:00:00 +0000https://jonnylaw.rocks/blog/national-cross-country/Area Results Download the results from Power of ten for northern, midlands, southern and national.
A linear model The goal is to fit a model, where the outcome is the position at the national and the input is the position at the northern XC. This then allows us to determine the quality of the field at each XC and determine what position you are likely to finish in the National this season given a result in the area championships.Efficient Markov Chain Monte Carlo in R with Rcpp
https://jonnylaw.rocks/blog/efficient_mcmc_using_rcpp/
Mon, 11 Feb 2019 00:00:00 +0000https://jonnylaw.rocks/blog/efficient_mcmc_using_rcpp/Bivariate Normal Model This post considers how to implement a simple Metropolis scheme to determine the parameter posterior distribution of a bivariate Normal distribution. The implementation is generic, using higher-order-functions and hence can be re-used with new algorithms by specifying the un-normalised log-posterior density and a proposal distribution for the parameters. The built-in parallel package is used fit multiple chains in parallel, finally the Metropolis algorithm is reimplemented in C++ using Rcpp which seemlessly integrates with R.Harrier League Cross Country
https://jonnylaw.rocks/blog/harrier-league-cross-country/
Thu, 26 Oct 2017 00:00:00 +0000https://jonnylaw.rocks/blog/harrier-league-cross-country/The Harrier League is a cross country running league with seven fixtures across the North East of England in the 2017/18 season across the winter months from September ’17 until March ‘18.
The Harrier League is unique to other cross country fixtures because the senior runners are divided up into slow, medium and fast packs. In the senior men’s race, the slow runners start first followed 2 minutes 30 seconds later by the medium pack runners, then a further 2 minutes 30 seconds by the fast pack runners.MCMC with Scala Breeze
https://jonnylaw.rocks/blog/breezemcmc/
Sun, 23 Apr 2017 14:13:12 -0500https://jonnylaw.rocks/blog/breezemcmc/Bivariate Gaussian Model Scala Breeze is a numerical computing library, which also provides facilities for statistical computing. For instance, implementations of distributions and Markov Chain Monte Carlo for, typically used for Bayesian inference of intractable models. Today I am going to build a simple bivariate Gaussian model, simulate some realisations from the model and use the Breeze library to recover the mean of the bivariate Gaussian distribution and the variance.An Akka HTTP Client with JSON Parsing
https://jonnylaw.rocks/blog/akkaclient/
Tue, 21 Feb 2017 12:13:14 -0500https://jonnylaw.rocks/blog/akkaclient/There are many sources of open data on the web, freely accessible via an Application Programming Interface (API) made available over the web. A common interchange format for these APIs is Javascript Object Notation (JSON) which is human readable and predictable, however is not in the correct format for analysis. The data needs to be parsed from the JSON string and made available as an object we can work with. This blog post considers a simple Akka Http client to read data from the Urban Observatory in Newcastle.Using Monads for Handling Failures and Exceptions
https://jonnylaw.rocks/blog/failureinfunctionalprogramming/
Wed, 04 Jan 2017 12:13:14 -0500https://jonnylaw.rocks/blog/failureinfunctionalprogramming/In this post I will give a practical introduction to some useful structures for handling failure in functional programming.
Referential Transparency One of the most important properties of functional programming is referential transparency and programming with pure functions. This means we can substitute a pure function with its result, for intance if we have the function def f = 1 + 2, we can replace every occurence of f with 3 and the final evaluation will remain unchangedSeasonal DLM
https://jonnylaw.rocks/blog/seasonaldlm/
Tue, 13 Dec 2016 12:13:14 -0500https://jonnylaw.rocks/blog/seasonaldlm/The Seasonal DLM I introduced the class of state space models called DLMs in a previous post covering the Kalman Filter. The seasonal DLM is similar to the first order DLM, however it incorporates a deterministic transformation to the state, in order to capture cyclic trends. Remember a general DLM can be written as:
\[\begin{align} y_t &= F_t x_t + \nu_t, \qquad \mathcal{N}(0, V_t), \\ x_t &= G_t x_{t-1} + \omega_t, \quad \mathcal{N}(0, W_t).The Kalman Filter in Scala
https://jonnylaw.rocks/blog/kalmanfilter/
Mon, 12 Dec 2016 12:13:14 -0500https://jonnylaw.rocks/blog/kalmanfilter/A Dynamic Linear Model (DLM) is a special type of state space model, where the state and observation equations are Normally distributed and linear. A general DLM can be written as follows:
\[\begin{aligned} y_t &= F_t x_t + \nu_t, \qquad \nu_t \sim \mathcal{N}(0, V_t) \\ x_t &= G_tx_{t-1} + \omega_t \qquad \omega_t \sim \mathcal{N}(0, W_t), \end{aligned}\]
\(y_t\) represents the observation of the process at time \(t\), \(x_t\) is the value of the unobserved state at time \(t\).Practical Introduction to Akka Streaming
https://jonnylaw.rocks/blog/practicalakkastreams/
Thu, 01 Dec 2016 12:13:14 -0500https://jonnylaw.rocks/blog/practicalakkastreams/Akka Streaming is a streaming IO engine used to build high performance, fault tolerant and scalable streaming data services. In this post I will describe how you can implement some of the features included in Akka Streaming using only simple streams of integers and strings, although the true power of Akka streams only becomes apparent when we are consuming data from real sources such as Websockets, databases and files. Akka is available in Java and Scala, but I will be focusing on the Scala API in this post.About Me
https://jonnylaw.rocks/about/
Mon, 01 Jan 0001 00:00:00 +0000https://jonnylaw.rocks/about/Jonny Law is a data scientist at the National Innovation Centre for Data, working with local and national companies to allow them to make the most of their data. Previously he was a Statistics PhD student on the Cloud Computing for Big Data CDT. His primary research interest is in using functional programming for Bayesian Inference (often called probabilistic programming), using Scala.
Skills R Scala Python SQL Bayesian Modelling Time series analysis Machine Learning Projects Bayesian State Space Model