The North East Harrier League is a series of cross country running races in the North East of England taking place over the winter from September to March. Results are available online from 2012-13 season to the present season 2019-20. The results are available online in HTML format. I have downloaded and cleaned the data and it can be used for analysis or exploration. The data for senior men and women is available in a tabular format in my blog package - see the file which contains the parsing functions here to get an insight into what it takes to parse this kind of data.
I used the following R packages to download, parse the HTML and clean the resulting data
- rvest is a web scraping library for R. The functions
html_tablewere used to download the raw html and extract tables created using
html_nodecan be used to find a specific tag, for instance the date of a fixture was often in a main header denoted by either
- dplyr provides a suite of functions which can be used to manipulate columns of a dataframe in a declarative, functional way.
- stringr was used to parse strings in the raw data and the HTML. Information such as division is included in the same table cell as the name. This can be extracted using
str_extract(name, pattern = "[1-3]")the regex matches either 1, 2 or 3 and extracts the matching number.
The data can be accessed by installing my R package which contains a selection of R code relating to this blog.
# install.packages("remotes") # remotes::install_github("jonnylaw/jonnylaw") data("harrier_league_results")
Determining the most difficult course
As a quick example of what can be done with the data I will consider the running time by course. The data can be split by male and female. However the men and women don’t compete over the same distance with the women completing two laps and the men completing three. Therefore we can plot the average time for a single lap of the course (obviously this doesn’t account for changing pace throughout the race). It appears that the hardest (or longest) course is Aykley Heads with the highest median race time.