LibGuides: Research Data Management (out of date). Please visit rdm.vu.nl: Data Processing

This LibGuide is being phased out and the information in it is no longer up to date. The new RDM Handbook is now available at https://rdm.vu.nl/

Data cleaning

The process of detecting and correcting (or removing) corrupt or inaccurate information or records, is called data cleaning. In essence, it refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting this data (Wikipedia). Depending on the type of analysis that is done, different pieces of software can be used to do this data cleaning. More often than not, the same software can also be used to perform the analysis. Licensed software may sometimes also be installed on personal computers or laptops.

Software especially designed to clean re-used data is OpenRefine. It cleans starting and trailing blank spaces in cell field, clusters values based on similarities (e.g. in free text fields: Alphen a/d Rhijn, alfen ad rijn, etc. can be easily clustered), normalise data fields into one standard, etc. See below for several tutorials.

In some cases, researchers write their own scripts (in programming languages such as Python, R or SQL) to clean data, in which case the process must be documented. Researchers should include their scripts when they archive the datasets to allow for replication and verification.

Extra background information:

EMGO Quality Handbook on data cleaning
Making sense of data I: a practical guide to exploratory data analysis and data mining / Glenn J. Myatt, Wayne P. Johnson, 2014 (eBook)
Free your metadata website
Open Refine
- Introduction to Open Refine on the Open Refine website
- Data Carpentry Open Refine website
- Tutorial by the Programming Historian
- Tutorial by Digitalnomad
- Introduction to Digital Humanities with Open Refine

For every step of your data cleaning, good documentation and clarifying the data provenance is necessary.

Data transcription

It is common in many fields to hold interviews, focus group sessions, or make other observations that were recorded - video or audio. If indeed you have done so, and you need to have the text transcribed, there are several ways to do this. One option is to do this by hand, although this is very time-consuming.

Another option is to pay a transcription service to make the transcription or to use specialised software. The VU has drawn up processing agreements with one transcription service, Transcript Online, and one transcription software service, Amberscript.

You can find more information on the VU Library page on what these transcription options do, how they work, how much they cost, and how they can be used.

Anonymisation/Pseudonymisation

Processing of personal data requires you as a researcher to make sure that any personal data collected from a human subject is according to the EU GDPR regulation. Anonymisation and Pseudonymisation are two ways to make personal data less easy to identify, in other words, it allows you to de-identify personal data.

There are various online tools that may help facilitate these processes. The VU has therefore recommended Amnesia as one of the tools to assist in the anonysmisation/pseudonymistaion of data.

VU Amsterdam is preparing a decision guide on anonymisation and pseudonymisation.

You can find more information here on how Amnesia works, how it can be used and how you can make your data compliant with the GDPR regulations.