Phase one outputs from development of capacities to extract health and safety insights from free-text sources

The text mining project being delivered as part of the LRF Discovering Safety Programme is looking to build upon existing state of the art text mining and natural language processing to develop a suite of text mining and natural language processing tools and techniques for specific use on unstructured health and safety datasets. As well as enabling the Discovering Safety Programme to generate new health and safety insights and learning using such tools and techniques on the HSE datasets available to the programme, the intention is to make the tools developed available for industry to use on their own datasets, for their own specific purposes, further leveraging benefits arising from the work undertaken on the programme.

Aims and objectives

The core aim of Phase 1 work has been to convert the HSE reports archive to a format more amenable to collective analysis and demonstrate how it might be put to applied use. 

Key findings

The major outcomes from this stage of the project are the tools developed. Performance of the tools is expected to improve significantly when larger datasets can be shared by HSE and annotated, allowing models to be trained on a more substantial document corpus. The first phase of annotation of RIDDORs has been completed by a team of nine annotators, and the inter-annotator agreement has been assessed. Detailed analysis of the results has highlighted concepts and entities that are not consistently labelled (such as confusion between Materials, Equipment and Physical Environment), and which could consequently reduce the performance of machine learning algorithms.


Planned work is looking to develop tools to support specific tasks: 1) enhanced search and retrieval of individual documents and specific content within documents based, and 2) targeted health and safety knowledge discovery, both descriptive (i.e. identifying existing knowledge within the knowledge base) and inferential (i.e. generating new knowledge through inference). The ability to apportion documents into different clusters based on content and auto-label specific content are both key in being able to perform enhanced search and retrieve tasks on the reports corpus and subsequently summarise the returned content.