Data Stewardship – Module 1: MOOC: 6.4 Data analysis: Analysis tools

Replicable data analysis is an important element of reproducible and replicable research. Replicable data analyses will allow others to replicate the analyses of our data. This requires that we share not only our data, but also our analysis code and information about the system on which the analysis was made.

It is important to realise that there is no one-size fits all software package, analyses tool or method that will fit for every scientific discipline or kind of data. In addition, new analyses methods and tools continuously emerge. However, there are some important points to keep in mind when choosing a tool to analyse our data:

1) Free open-source software
If we want others to be able to reproduce our data analyses, they will need the software packages or tools that we have used. By choosing free and open-source software, we not only enhance the accessibility to our data, but we will also allow others to reproduce our analyses without having to buy the software. Some examples of such software or tools are discussed below.

2) Automated analyses
Software packages or tools that allow for scripted analyses are preferred, rather than software that only allows manual steps, since the process will then be easier to document. It may, however, be possible to document the analysing steps in the form of a script or program or file also when using graphical tools. The key is to try to make the analyses reproducible.

Scripted analyses will automatically update statistical analyses when you add more data to a dataset or update data. For example, let us say we analysed our data after collecting 50 data points, and then find we need to collect another 50 data points to make our dataset statistically significant. In that case, our scripted analyses may make it easier to automatically update our analyses after adding more data points. This will not only make our workflow more efficient, but it may also allow others to adapt our analyses at a later point in time.

Another advantage of automated analyses is that they can make it easier to detect mistakes early and make it less likely that inconsistencies are introduced. This is because we are then less likely to forget a step for a file among dozens or hundreds of input files.

3) Versioning
It is beneficial to choose software or a tool that allows for version control. Version control allows tracking of all changes to our raw data over time. This will allow users to go back to a previous version of the data. This is especially useful if a mistake occurs or if you made a change that you wish to undo. If the software package does not allow for version control, it is important that you document changes yourself, for example with the help of a ReadMe file. Additionally, you should save each version as a separate data file.

An example of an open-source platform that allows for version control is GitLab. GitHub also allows for version control. However, GitHub is not open-source. A large benefit of GitHub though is that it integrates well with many open-source software tools for data analyses.

Some examples
Some examples of commonly used software packages that allow for automated analyses, and in addition are free and open-source, are tools from the R environment and Python. Note that these tools are mainly useful for quantitative, structured and tabular data. If you are working with unstructured or qualitative data, then tools like Nvivo might be useful. However, note that this is not a free open-source software tool.

R and Python require you to learn some scripting to take into use, which might be time consuming in the start. However, there will be a steep learning curve and performing your analyses in an automated way will save time later on. Additionally, there is a lot of excellent learning material and help to find online, to learn at least the basics in one of these languages. A great resource to get started with programming languages is Data Carpentry, a lesson program with The Carpentries.

In the video below you will learn more about Data Carpentry:

Transcript of video "Data Carpentry"

Links used in video "Data Carpentry"

Lessons learned

Aim to make your data analysis reproducible and replicable
Use free open-source software for your data analyses
Choose analyses tools that allow for automated analyses and versioning

Food for thought

What are the benefits of making your data analyses reproducible, replicable and transparent? For others and for yourself?
How can you benefit from the Data Carpentry program?

Naposledy změněno: pondělí, 2. ledna 2023, 15.32