Skip to Main Content

Research Data Management

Make informed choices for research data. RDM, policy, practical guidelines, software and tools at VU Amsterdam. FAIR data, archiving, storage, publication

Data Cleaning

The process of detecting and correcting (or removing) corrupt or inaccurate information or records, is called data cleaning. In essence, it refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting this data (Wikipedia). Depending on the type of analysis that is done, different pieces of software can be used to do this data cleaning. More often than not, the same software can also be used to perform the analysis. Licensed software may sometimes also be installed on personal computers or laptops.

Software especially designed to clean re-used data is OpenRefine. It cleans starting and trailing blank spaces in cell field, clusters values based on similarities (e.g. in free text fields: Alphen a/d Rhijn, alfen ad rijn, etc. can be easily clustered), normalise data fields into one standard, etc. See below for several tutorials.

In some cases, researchers write their own scripts (in programming languages such as Python, R or SQL) to clean data, in which case the process must be documented. Researchers should include their scripts when they archive the datasets to allow for replication and verification.

Extra background information:

For every step of your data cleaning, good documentation and clarifying the data provenance is necessary.