What is text mining?

data analytics
Text mining
Many organisations hold a huge volume of data, which may be numeric (perhaps measurements from scientific instruments, or financial book-keeping data) or written text.

Many organisations hold a huge volume of data, which may be numeric (perhaps measurements from scientific instruments, or financial book-keeping data) or written text. Numeric data can be readily analysed with the help of a computer using standard statistical techniques, which were devised to analyse data expressed as numbers. In contrast, concepts expressed in language can be easily understood by a human but not by a computer, making analysis of these concepts difficult. If a collection of reports and other documents held by an organisation is too large for a human to read and analyse, computer techniques are needed to enable the extraction of useful information, summarise it and use it for further analysis. 

Text mining refers to a range of methods that allow computers to extract information from text perhaps to identify concepts such as activities, equipment, and places, and allow exploration of the relationships between the concepts. Text mining goes much further than simple keyword searches and requires the computer to keep track of the meaning of words. For example, if you are investigating occupational causes of back pain, you only want to analyse documents that use the word ‘back’ when it refers to the rear torso, but exclude uses such as ‘she came back into the room’, or ‘he stepped back’. Moreover, text mining techniques can go beyond analysing individual documents, making links between concepts, identifying common topics, and automatically summarising content across multiple documents. By analysing a large collection of documents, we can gather evidence and gain insights that are unlikely to be discovered by manually reading a smaller sample of reports.  

Why is text mining important for health and safety? 

Much of the valuable information relating to health and safety is recorded in written documents, from work schedules and machinery maintenance contracts held by stakeholders, to accident investigation reports and safety guidance written by regulators. Beyond its value as an archive of previous events, this vast source of data can provide an evidence base to make informed decisions about health and safety, such as the design of effective regulatory interventions, identification of leading indicators of risk, and more efficient resourcing of inspection activity based on predictive analytics. Applications based on the information extracted might also be used by stakeholders to improve their risk assessments, to help benchmark health and safety performance of their organisation, to avoid unsafe working practice, or to develop safer equipment for use within their sector. 

Written by - Health and Safety Executive's technical lead for the text mining project, Tim Yates