Ohalo Data X-Ray anonymisation evaluation for phase one of the Lloyds Accelerator Challenge

Ohalo, Jo Sproston (HSE)

Ohalo & HSE

The effectiveness of the Ohalo Data X-Ray anonymisation has been evaluated in comparison to manual anonymisation. The evaluation made use of RIDDOR data used for the HSE Construction Division RIDDOR dashboard1, which was manually anonymised in 2017 and made public.  The standard of assessment used to define a significant data breach was that there will likely be a risk to people’s rights and freedoms in the event that HSE data is shared externally. This is the same as that used by the UK Information Commissioners Office (ICO)2. 

From the 1998 RIDDOR reports analysed, 743 contained sensitive text, including some personally identifiable information (PII). After anonymisation using Ohalo’s Data X-Ray, 94 records retained some personally identifiable information, of which 19 would be considered sufficient for a significant breach. Based on these figures, Ohalo’s Data X-Ray was able to reduce the number of sensitive records by 97% (i.e. 724/743 were adequately anonymised, 19/743 remained sensitive). For the 3% of records that remained sensitive, it was generally because information on a named individual along with details of a specific injury or other event still remained after anonymisation.

Aims and objectives

Ohalo Data X-Ray is a server-based, customizable tool for anonymising data, however it has not been used in a Health and Safety context previously and some types of data that we would wish to anonymise were not originally recognised. Additionally, the methods of anonymisation meant that the text entity association was not retained as part of the anonymisation process (e.g. ‘Date’, ‘Person’, ‘Organisation’). This project was to evaluate and improve the anonymisation of health and safety data and add the capability for context specific anonymisation and entity association to improve its use as part of an integrated research desensitization pipeline. 

Due to the complex and evolving methods that can be used to identify individuals from their data, it is not anticipated that complete anonymisation will be possible in all cases. However, the reduction in risk provided by this anonymisation evaluation methodology allows that risk to be properly understood so that suitable controls can be put in place to manage residual risk. 

Key findings

After using the Ohalo Data X-Ray platform to auto-anonymise the sample of HSE RIDDOR data, the final dataset generated cannot be considered fully anonymised. However, the sensitivity and risk of a significant breach if the data was shared externally has been reduced very significantly. Whether the residual risk remaining if the data was shared is still significant needs to be considered in relation to the wider measures that would be in place to reduce risk and prevent a significant breach (e.g. non-disclosure agreements, IT security etc.).  

The level of anonymisation achieved can reasonably be considered a base level from which future work can build; work is already planned to shortly introduce new features to further reduce the sensitivity of data post anonymisation. The free text data in RIDDOR is very sensitive, relatively short and contains a wide variety of syntactic variations, so effective anonymisation proves challenging. Using established tools and processes, future anonymisation analysis will be comparatively straightforward. 


Provided in the report is a detailed list of the repeated types of text that is currently over or under redacted. Of these, the most significant is the under redaction of some names, either a full name or just the first or last name. However, it is also noted that in many cases Data X-Ray does also successfully redact the same name/s when positioned in a different location within the same body of text. A solution has been proposed to minimise this under redaction scenario programmatically, by adding extra processing to search the full text a second time, to look for versions of successfully identified name entities. It is anticipated that resolving this issue could potentially reduce the ‘Significant Breach’ count to just 10 records, reducing the significant breach records to 1%. 

This documented list of identified issues will potentially serve to direct and prioritise efforts to improve results going forward and form a baseline for future anonymisation assessments.