Web scraping health and safety intelligence phase one report

There are huge volumes of useful health and safety information openly available on the Internet. This includes information pertaining to industrial accidents that have happened around the world, information on new and emerging industry risks, new approaches to mitigating such risks, new regulatory requirements, industry standards, industry guidance and industry good practice. All such information has huge learning value for the Discovering Safety Programme.

Aims and objectives

The aim of this feasibility study was to explore the potential of using technology company, Polecat’s website monitoring and web scraping capabilities, along with and their RepVault platform, to identify and harvest health and safety information from websites of value to the Discovering Safety Programme, integrate it within the RepVault platform and develop enhanced functionality enabling it to be expediently interrogated using the platform. The vision for the deliverable was the creation of an interactive map of the world, made up of a number of information layers, similar to a geographic information system, where different categories of health and safety information, one for each layer, could be explored using the map as an intelligent interface. 

Specific objectives were five-fold: 

  1. Agree health and safety topic areas to provide the focus of the map layers 

  1. Build web search terms, harvest information content and associated metadata (e.g. information source, date of publication, company name etc.) 

  1. Geotag and time stamp information records, flag other pertinent contextual data relating to records 

  1. Integrate within RepVault platform, create search functionality enabling intelligent interrogation by a user 

  1. Test prototype platform   

Key findings

The pilot worked well to test and prove that unstructured online and social media data could be interrogated to deliver intelligence on safety concerns and effectively visualised on an interactive and searchable geovisualisation. The pilot also surfaced clear areas for further research and development.


  • Enhance specificity through application to selected use cases, such as confined spaces and / or safety at sea.   

  • Enhance ease of access to country-level results, e.g. by providing interactive indexes of countries ranked for association with certain macro-themes, sub-topics or use-cases and searchable to underlying posts.   

  • Enrich context of results by bringing more insights to bear regarding sectors, related topics (e.g. ESG or risk terminology) and influencers (e.g. civil society groups, politicians, investors).  

  • Explore synthesis with Manchester University’s text mining work around 10 key concepts underpinning safety performance.  

  • Sharpen excludes at macro and micro / sub-topic level to hone results.  

  • Enhance user interface to simplify lay-out and ease of navigation, filtering and searching to underlying posts.  

  • Explore value of integration into HSE systems and also appetite for Lloyds Register Foundation to leverage as part of their contribution to enriching understanding and insight to global safety concerns.