Data cleaning and harmonisation: A catalyst for innovation

Messy data remains a persistent challenge in modern drug discovery. The complementary processes of data cleaning and data harmonisation play a pivotal role in optimising workflows, each addressing distinct but interconnected challenges

By EP News Bureau On Jun 11, 2025

Scientific discovery has always relied on good, clean data. Bad data leads to ir- reproducible outcomes, prob- lematic solutions, and, ultimately, the need to revisit and acquire better data. Now that AI-driven predictive models are becoming more commonplace in early drug discovery workflows, the importance of data that is ac- curate and consistent across en- tire research workflows has never been greater.

Messy data remains a persistent challenge in modern drug discovery, causing scientists to spend considerable time resolving inconsistencies in entity naming, catching mismatched database formats, and correct- ing errors. Machines struggle to handle the nuances of complex datasets because they lack the contextual understanding needed to interpret ambiguity and inconsistencies. Human sci- entists are tasked frequently with identifying then rectifying issues hidden in data that technology is not equipped to address sufficiently.

For drug discovery teams, messy data creates a ripple efFect, leading researchers to de- sign experiments and build models based on flawed assumptions or incomplete information, ultimately wasting valuable resources. This is where the complementary processes of data cleaning and data harmonisa- tion play a pivotal role in optimising workflows, each addressing distinct but interconnected challenges.

◆ Data cleaning: The process of having experts identify and correct errors, fill in missing values, and remove irrelevant information to ensure accuracy and reliability within individual datasets

◆ Data harmonisation: The human integration of information from multiple sources, standardising and unifying it into a cohesive framework to enable seam- less analysis, comparability, and collaboration. The key to successful clean- ing and harmonisation lies in human curation.

Data cleaning, the critical first step?
Before scientists can harmonise data, they must clean it. By elim- inating the errors that arise from data entry, miscalculations, sensor malfunctions, or system glitches, scientists can build the data foundation required for harmonisation. Teams that rely on properly cleaned data gain:

◆ Improved accuracy, ensuring hat the findings drawn from this data are reliable and reproducible, leading to high-quality drug candidates.

◆ Increased efficiency, reducing the time spent on troubleshooting and re-running analyses due to errors.

◆ Enhanced collaboration, facil- itating better teamwork be- tween teams, institutions, and industries by eliminating dis- crepancies between datasets.

◆ Refined predictive power, using the forming clean data to build the foundation for predictive models that can accurately forecast drug-target interactions, disease relationships, and more.

Harmonisation without proper data cleaning is like building on quicksand—fragile and unsustainable. When scientists clean data, they ensure that only the most relevant, reliable information is incorporated into downstream processes, provid- ing an accurate view of the research landscape and a sturdy data foundation.

A structured approach for successful data harmonisation
Data harmonisation begins with establishing authority constructs and naming standards. For example, this painstaking work ensures that entities like proteins are uni- formly named and categorised across all sources and datasets for drug development purposes to enable target identification.

The next phase in harmonisation is substance linking, in which scientists identify and connect references to the same chemical substance across disparate datasets or databases. This process unifies different substance representations, synonyms, and identifiers into a single, consistent entity. This effort is essential to pharmacology and drug discovery, where varying conventions across sources often result in the same compound being described in different ways.

Further along in the harmonisation process, data scientists identify and manage exact and related documents, preventing data duplication and ensuring that only the most relevant information is retained. The final step focuses on ensuring consistency of data definitions across datasets, which is essential for producing a cohesive foundation built with data from multiple sources.

Drug discovery teams can navigate complex data land- scapes confidently by implementing a harmonisation work- flow powered by human curation, resulting in reliable datasets primed for advanced analytics and predictive model- ling. This systematic approach minimises errors and ensures that all subsequent research is built on a solid, clean data foundation.

Harmonised data enhances the accuracy of predictive models
One of the most tangible benefits of data harmonisation is its impact on predictive models. To demonstrate the positive effect on prediction accuracy, CAS scientists used a newly harmonised dataset to retrain an existing ensemble model that predicts the activity of a ligand target pair.

The retrained model demonstrated significant accuracy improvements, reducing the standard deviation between predicted and experimental results by 23 per cent and decreasing the discrepancy in predicted versus experimental ligand target interactions by 56 per cent. By normalising the target name and improving substance linking, scientists en- hanced the data to describe the relationship between a substance and its target more consistently and accurately.

This predictive modelling underscores the essential role of human data harmonisation in refining model performance. By identifying and focusing on the most promising candidates earlier in the screening process, teams may move faster through the hit-to-lead phase and proceed with development and trials.

Harmonised data fuels advanced analytics
Data harmonisation also optimises predictive models and advanced analytical tools like knowledge graphs and interac- tion networks that drive innovative drug discovery workflows. These tools help re- searchers explore relationships between targets, substances, and biological pathways to identify disease associations and new therapeutic modalities.

A unified, human-curated data foundation allows scientists to trace complex interactions across various biological levels, such as gene expression, protein interactions, and metabolic pathways, providing insights otherwise obscured by fragmented data sources.

This approach improves the precision of drug discovery and accelerates the identification of potential drug repurposing opportunities, as it reveals hidden connections between established compounds and emerging therapeutic targets.

Human curation forms the foundation of innovation

Without the ability to appreciate contextual nuances, machines struggle to handle the ambiguity and inconsistencies inherent in biological datasets appropriately. Skilled professionals play a vital role by recognising subtle variations, resolving errors, and aligning data to ensure accuracy and relevance in ways that automated systems cannot.

This process is vital for those doing scientific research and organisations providing related services. For example, hundreds of CAS scientists clean, harmonise, and curate the data used to build the CAS Content CollectionTM, the world’s largest collection of human curated scientific knowledge.

This effort pays off with enhanced reliability of downstream analyses and the acceleration of discovering potential drug tar- gets and effective disease treatments. When applied to predictive models and advanced tools like network diagrams, human curated, harmonised data drives breakthroughs in the life sciences and beyond. As organisations continue to prioritise data clean- ing in research, they can ensure the quality of their findings and accelerate innovation.