Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Data Management

When you are doing research, good data management practices and transparency are essential. This toolbox provides practical information and guidelines for both PhD students and researchers when working with research data.

Data cleaning

The process of detecting and correcting (or removing) corrupt or inaccurate information or records, is called data cleaning. In essence, it refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting this data (Wikipedia). Depending on the type of analysis that is done, different pieces of software can be used to do this data cleaning. More often than not, the same software can also be used to perform the analysis. Licensed software may sometimes also be installed on personal computers or laptops.

Software especially designed to clean re-used data is OpenRefine. It cleans starting and trailing blank spaces in cell field, clusters values based on similarities (e.g. in free text fields: Alphen a/d Rhijn, alfen ad rijn, etc. can be easily clustered), normalise data fields into one standard, etc. See below for several tutorials.

The software offered and licensed by the university currently includes: Stata, SPSS, and Atlas.TI. Some of the software is available for download at:

In some cases researchers write their own scripts to analyse the data. Programming languages that are used include R, SQL or Python. Scripts can also be used to clean data, in which case the process must be documented. Researchers should include their scripts when they archive the datasets to allow for replication and verification.

Extra background information:

For every step of your data cleaning, good documentation is necessary.