Data Stewardship – Module 1: MOOC: 6.2 Data cleaning: Quality assurance and control

When we aggregate and collect data from different data sources, such as data repositories, data quality issues like duplicate records, empty values, spelling inconsistencies and inconsistent formats will pop up. Also, for sharing of data, it is important that we prepare our data in a consistent format, as discussed in the previous section (‘Data cleaning: Tidy data formats’). This section applies to both quantitative and qualitative data, mainly in tabular form. In the second part of this section, we will also discuss tools that will help you to transform text data into tabular form.
After you have structured your data table following the tidy data format, you can use several techniques to ensure that the data is free of errors. These approaches include techniques that are implemented prior to entering data (quality assurance) and techniques that are used after entering data to check for errors (quality control).
It is important to keep in mind that different datasets will have different data cleaning needs. However, no matter what kind of data you are working on, there is one Golden Rule: Never make changes to your original data file. Always store the original file and make your changes in a copy of this file. Store this copy with an appropriate name and version number.

Quality assurance
Some techniques to implement and explore when entering data:
Data validation – Most spreadsheet applications allow you to set up rules for the data entered. This way you can make sure that only values within a certain range are accepted and that words are spelled correctly. This reduces the chance of entering erroneous values when you are collecting data in haste.
«NULL» -values – e.g 0, -999, 999, blank, NA, NULL, None, No Data. Null values represent missing values and it can be essential to use a clearly defined and consistent null indicator. There might be different reasons why the data is not there. This is also important information to capture. However, do not capture this in the same column. If appropriate, create a new column like ‘data_missing’ and use that column to capture the reason. See also section 5 of White et al., 2013.
Data formats – Pay attention to how your spreadsheet application handles date formats. Often dates are misread when datasets are imported from different sources.

Quality control
Some techniques to consider and explore when checking your data for errors:
Redundant data – Identify and potentially remove irrelevant observations (e.g. data points that do not fit the specific problem you are trying to solve) and duplicate entries. However, always keep in mind that observations that are irrelevant to you might be of interest to others. It is therefore important to consider archiving these kind of data points as well.
Deviating and missing values – Identify deviating and missing data entries. Should these entries be corrected for, or can they be removed? At the same time, be careful when considering data points as outliers. It could be that it is not an outlier.
Text errors – including grammatical errors, inconsistent use of upper and lower cases, inconsistent column titles and overlapping naming of variables and values.
Separate values – When more than one data entry point is entered into one column, these should be split into separate columns.

Tools

Although many of the above points can be performed in, for example, Excel and Libre Office Calc, there are software tools that are better options. These include the R environment and Python programming language, which will be introduced in more detail in Section 6.4 ‘Data analysis: Analysis tools’.

If you prefer a more graphical interface that does not require scripting, then OpenRefine might be a tool to explore. OpenRefine can be used for quantitative and qualitative data in spreadsheet form, and is used in a wide range of disciplines, for example in social sciences, ecology and history. No matter the discipline, OpenRefine can help you to get an overview of large datasets, identify and correct inconsistencies, and extend your data by combining datasets or fetching online data. It allows you to automate the process, by saving a script that can be rerun on new datasets. Make sure to save your script as part of the dataset. This, in turn, will make your work more replicable. See also Section 6.4 ‘Data analysis: Analysis tools’.

If you are working with sensitive data, it might be reassuring to know that OpenRefine uses your web browser as a graphical interface, but the software runs only locally. The data will therefore not leave your computer unless you want it to.

If you are not working with tabular data, OpenRefine can still help you parse text files from web sources and help you transform it into tabular form. However, it is not possible to parse other types of non-numeric or unstructured data. In that case, tools like Nvivo might be a better solution as a CAQDAS (Computer-Assisted Qualitative Data Analysis Software).

The videos below will introduce you to OpenRefine, previously known as Google Refine.

Explore and combine data with OpenRefine:

"Google Refine 2.0 - Introduction (1 of 3) (video version 2)". YouTube, uploaded by GoogleRefine, 19 July 2011, https:// www. youtube.com/watch?v=B70J_H_zAWM. Permissions: YouTube Terms of Service

Clean and transform data with OpenRefine, including parsing text files into tabular form:

"Google Refine 2.0 - Data Transformation (2 of 3)". YouTube, uploaded by GoogleRefine, 19 July 2011, https:// www. youtube.com/watch?v=cO8NVCs_Ba0&t=1s. Permissions: YouTube Terms of Service

Additionally, there is a lot of excellent learning material and help to find online to learn at least the basics of the tool.

Lessons learned

Ensure that your data is in a consistent format and free of errors for quality assurance and control.
Never make changes in your original data file.
Keep track of the changes made during the cleaning process.

Food for thought

Which data cleaning steps might apply to your data?
How can your data benefit from using a tool like OpenRefine?
If you are not working with tabular data: What tools might be useful for you to clean your data? Are there any free open-source alternatives available?

Lessons learned

Lessons learned

Food for thought

Recommended reading if you want to learn more: