6.0 About: Data cleaning, analysis and visualisation
Analysing
your data is crucial to find trends and patterns in them, and to obtain the
information that you need to answer your research questions and objectives.
This section will mainly apply to data in tabular or spreadsheet form, as the analyses and visualisation tools discussed in this section use tabular data as input. However, the basic principles of quality assurance and control (section 7.2), data analyses (section 7.3) and data visualisation (section 7.4 to 7.9), will also be applicable for those of you working with other types of data than tabular data, e.g. textual data.
Before you can analyse your data, you must make sure your spreadsheets are well organised, with valid entries and manageable variables. In the first part of this section, you will learn about best practices for organising spreadsheets and the basics of cleaning data.
In the second part of this section, you will learn about how to choose useful tools to analyse your data. It is not our intent to give you suggestions on how to proceed in your analysis, as this will depend on the type of data you have collected, the methods you have used, and the research question that forms the basis of your research project. Our goal is rather to put you on a good track by discussing how to choose useful tools to analyse your data.
Visualisation of your data is the first critical step of interpretation of data, and can have a huge impact on your data analysis and results. It may also be very useful in communicating the information that your data holds. In the third and final part of this section, you will learn the basics of good visualisation of your data, about different plot types, and useful tools.
While working through this section, it is important to keep reproducibility in mind. For the sake of reproducibility, it is important that you make your research workflow available to others so that they can redo the analysis and see whether or not they reach the same results as are claimed in your outputs. Remember, therefore, to share your materials, methods, and workflow, in addition to your data.
For the purpose of reproducibility, you will probably need to learn some scripting in Python or R. Our aim is not to focus on how to use those programming languages. Instead, we will hopefully provide you with a starting point for you to explore such tools. In addition to making research reproducible, such tools will dramatically speed up data analysis and visualisation tasks and make the data available for data mining. Again, these tools are mainly useful for statistically solid, quantitative data in tabular form. Good basic knowledge in statistics is mandatory, and this will not be compensated by any tools. However, these tools are based on principles that are relevant to keep in mind for all types and forms of data, including having a transparent and reproducible workflow and continuously keeping track of changes made to your data.
After working through this section,
you should:
- Be familiar with the notion of the Tidy Data format and data cleaning.
- Know why it is important to have a transparent and reproducible workflow, and to keep track of changes in your data.
- Know how to proceed in the selection of reproducible data analysis tools.
- Understand the design process and the design principles of data visualisation.