Clustering for understanding

Drowning in data and need to find a means of reducing it in a meaningful way?! We have thousands of data sets that tell us something about all the communities in Essex - but we can’t possibly explore each of these individually, which is where cluster analysis becomes a useful analytical tool.

Grouping data together in such a way that it is meaningful, or useful – or preferably both -allows us to effectively summarise data and understand how it is related. Cluster analysis helps to classify data according to shared common characteristics and profile behaviours within samples, which, in turn, can help inform decisions on how best to meet the needs of a specific audience, or what services are needed in an area.

For our recent work on identifying physical inactivity in Essex we wanted to look at whether the Essex LSOAs in our data ‘cluster’ together based on similarities or differences between each other on the deciles by theme (e.g. health, crime, education, vulnerable groups, economy). This approach enabled us to identify which low level geography areas shared common issues, what these issues are and gather insight on the differences between the different clusters.

Essex is known for being a diverse county, with picture perfect sea sides, rural landscapes, new and historic towns, and all sorts of different communities. It turns out our communities have more in common than we thought and there are actually only FOUR different types of community in the county when it comes to our activity levels (at least according to data!). We determined the optimal number of clusters needed to describe the LSOAs using the elbow method. Based on the within and between similarity measures, four clusters were created. Four was the point where the distance decrease levelled off as number of clusters increased (e.g. the difference in distance between 4 – 5 was a lot smaller than 1 – 2, 2 – 3, 3 – 4). K-means clustering was then used to identify the four clusters (‘types of LSOAs) across themes. The average decile by theme by cluster was used to produce descriptions of LSOA clusters, which were then corroborated and sense-checked according to the weighted average of the deciles of the LSOAs within each cluster across all the themes. A lower number indicates more of a problem in that area.

Valuable insight was produced by applying cluster analysis to the data, we were able to determine; which areas in Essex were most likely to participate in less than 30 minutes of physical activity; barriers to activity; and recommend whether an area may benefit from holistic interventions spanning a number of themes or a targeted intervention to a specific theme.

Fancy testing your rhythm as part of a drumming circle, or maybe cycling is more your thing?

A physical inactivity dashboard was created as a tool for colleagues in the LDP to commission and develop interventions that support people to overcome the barriers to inactivity within the areas identified as most in need. The dashboard is helping the LDP team to determine exactly what works and for who, so that they can determine whether free bikes or bongos will have more success initiating activity and kickstarting new community groups. The tool is available here via the Open Data website.

All data transformation and cluster analysis was carried out in R, enabling me to improve my coding skills and experience following training I received from the University of Essex BLG team. I have since adapted the cluster analysis R code to complete some analysis as part of our work to identify communities vulnerable to COVID-19. This work has also been published on open data https://data.essex.gov.uk/dataset/2ydz7/covid19-risk-and-vulnerability-mapping-dashboard

Emma Farrow

Analyst, Population Health, Essex County Council and ecda

Updated 20/07/20