6.2 Data cleaning: Quality assurance and control
When we aggregate and collect data from different data
sources, such as data repositories, data quality issues like duplicate records,
empty values, spelling inconsistencies and inconsistent formats will pop up. Also,
for sharing of data, it is important that we prepare our data in a consistent
format, as discussed in the previous section (‘Data
cleaning: Tidy data formats’). This section applies to both quantitative and qualitative data, mainly in tabular form. In the second part of this section, we will also discuss tools that will help you to transform text data into tabular form.
After you have structured your data table following the tidy data format, you can use several techniques to ensure that the data is free of errors. These approaches include techniques that are implemented prior to entering data (quality assurance)
and techniques that are used after entering data to check for errors (quality control).
It is important to keep in mind that different
datasets will have different data cleaning needs. However, no matter what kind
of data you are working on, there is one Golden Rule: Never make changes to your original data file. Always
store the original file and make your changes in a copy of this file. Store
this copy with an appropriate name and version number.
Quality assurance
Some
techniques to implement and explore when entering data:
Data
validation – Most
spreadsheet applications allow you to set up rules for the data entered. This
way you can make sure that only values within a certain range are accepted and
that words are spelled correctly. This reduces the chance of entering erroneous
values when you are collecting data in haste.
«NULL»
-values – e.g 0, -999,
999, blank, NA, NULL, None, No Data. Null values represent missing values and
it can be essential to use a clearly defined and consistent null indicator.
There might be different reasons why the data is not there. This is also
important information to capture. However, do not capture this in the same
column. If appropriate, create a new column like ‘data_missing’ and use that
column to capture the reason. See also section 5 of White et al., 2013.
Data formats –
Pay attention to how your spreadsheet application
handles date formats. Often dates are misread when datasets are imported from
different sources.
Quality control
Some
techniques to consider and explore when checking your data for errors:
Redundant
data – Identify and
potentially remove irrelevant observations (e.g.
data points that do not fit the specific problem you are trying to solve) and duplicate entries. However, always keep in mind that observations
that are irrelevant to you might be of interest to others. It is therefore important to consider archiving these kind of data points as well.
Deviating
and missing values –
Identify deviating and missing data entries. Should these entries be corrected for,
or can they be removed? At the same time, be careful when considering data points as outliers. It could be that it is not an outlier.
Text
errors – including
grammatical errors, inconsistent use of upper and lower cases, inconsistent
column titles and overlapping naming of variables and values.
Separate
values –
When more than one data entry point is entered into one column, these should be split into separate columns.
Tools
Although many of the above points can be performed in, for example, Excel and Libre Office Calc, there are software tools that are better options. These include the R environment and Python programming language, which will be introduced in more detail in Section 6.4 ‘Data analysis: Analysis tools’.
If you prefer a more graphical interface that does not require scripting, then OpenRefine might be a tool to explore. OpenRefine can be used for quantitative and qualitative data in spreadsheet form, and is used in a wide range of disciplines, for example in social sciences, ecology and history. No matter the discipline, OpenRefine can help you to get an overview of large datasets, identify and correct inconsistencies, and extend your data by combining datasets or fetching online data. It allows you to automate the process, by saving a script that can be rerun on new datasets. Make sure to save your script as part of the dataset. This, in turn, will make your work more replicable. See also Section 6.4 ‘Data analysis: Analysis tools’.
If you are working with sensitive data, it might be reassuring to know that OpenRefine uses your web browser as a graphical interface, but the software runs only locally. The data will therefore not leave your computer unless you want it to.
If you are not working with tabular data, OpenRefine can still help you parse text files from web sources and help you transform it into tabular form. However, it is not possible to parse other types of non-numeric or unstructured data. In that case, tools like Nvivo might be a better solution as a CAQDAS (Computer-Assisted Qualitative Data Analysis Software).
The
videos below will introduce you to OpenRefine, previously known as Google
Refine.
Explore and combine data with OpenRefine:
"Google Refine 2.0 - Introduction (1 of 3) (video version 2)". YouTube, uploaded by GoogleRefine, 19 July 2011, https:// www. youtube.com/watch?v=B70J_H_zAWM. Permissions: YouTube Terms of Service
Clean
and transform data with OpenRefine, including parsing text files into tabular
form:
"Google Refine 2.0 - Data Transformation (2 of 3)". YouTube, uploaded by GoogleRefine, 19 July 2011, https:// www. youtube.com/watch?v=cO8NVCs_Ba0&t=1s. Permissions: YouTube Terms of Service
Additionally, there is a lot of excellent learning material and
help to find online to learn at least the basics of the tool.
Lessons learned
- Ensure that your data is in a consistent format and free of errors for quality assurance and control.
- Never make changes in your original data file.
- Keep track of the changes made during the cleaning process.
Food for thought
- Which data cleaning steps might apply to your data?
- How can your data benefit from using a tool like OpenRefine?
- If you are not working with tabular data: What tools might be useful for you to clean your data? Are there any free open-source alternatives available?