Foundations to lay when teaching computational and data skills

By Miranda Prynne, 7 July, 2022
View
Traditional teaching in data analysis focuses on statistics and visualisation but an emphasis on foundational data and computational skills is needed to prepare students to work with real data, explains Philip Leftwich
Article type
Article
Main text

Responsible research data management is essential to good scientific practice and provides a solid foundation for excellent research and careers in data science. While researchers' training in data management skills has been established in many universities, teaching these explicitly at an undergraduate level is rare.

We often assume that students are digital natives”, that growing up under the ubiquitous influence of the internet and other modern information technologies makes students more computer literate than their instructors. Unchecked assumptions on computer literacy and data management can lead to gaps in student understanding and impact student confidence in their studies.

I offer five discussion points for educators drawn from my experience teaching programming and data science to biological sciences students. Checking these fundamentals will ensure that you and your students have the foundations you need to teach and work with data.

1. Understanding files

Can your students define what a file is? What about file formats and file-naming conventions?

Consider a short introduction to each file format you expect your students to know and discuss sensible file-naming conventions. In the example below, I start with the date, a short description of the assignment and a version number:

2022-8-11-report-text-cell-biology-v1.docx

2. Understanding directories

Summarised neatly in the excellent article File not found, increasingly modern operating systems encourage the “laundry basket” model for file storage and retrieval, where everything is stored together and instantly searchable. This way of thinking does not require knowledge of directory structure or file paths, but for now, this remains crucially important when teaching programming where code that runs at the command line needs to be told precisely where to access files. This extends to providing files for external repository storage or cloud-computing tools. In my experience, when using cloud servers, students benefit from being taught the distinction between their computer and a cloud server – files accessible to one are not immediately accessible to the other.

3. Organising a spreadsheet

Simple rules for reproducible data management can start with tips and structures for organising a spreadsheet. Practical recommendations encourage students to consider their data organisation, emphasising consistency and readability. A simple exercise is presenting students with archived data or templates they have produced for themselves and scoring them according to 12 simple rules, which are:

  • be consistent
  • write dates using YYYY-MM-DD
  • do not leave any cells empty
  • put just one thing in a cell
  • organise the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row)
  • create a data dictionary
  • do not include calculations in the raw data files
  • do not use font colour or highlighting as data
  • choose good names for things
  • make backups
  • use data validation to avoid data entry errors
  • save the data in plain text files.

Well-organised data are less error-prone, more accessible for computers to process, and easier to share.

4. Keep data raw

By raw data, I mean the original data that have been collected from a source and not yet processed or analysed. Raw data will provide the foundation for any downstream analyses. In many cases, the captured or collected data may be unique and impossible to reproduce, such as measurements in a lab or field observations. For this reason, they should be protected from any possible loss. Explain that changes made to a raw data file threaten the integrity of that information every time an edit is made.

In practice, this can mean that if calculations must be made in Excel, there is a locked read-only version of the data spreadsheet that is never edited, all data is copied to a new file if calculations are to be added. Better yet, run analyses in an entirely different software and output to a new file.

5. Keeping data safe and using version control

Most UK university systems provide students access to cloud storage and version control with techniques such as OneDrive. Version control is the practice of tracking file changes as they are made. Explicitly taking students through managing multiple versions of files with cloud storage allows the advantages of using version control and multiple-user access to be explained. This can help improve data integrity and prepare students for learning software such as Git at the university level.

These are just a few brief examples of where we can build foundational skills in data handling. By taking account of the fundamentals, teachers can better support students learning for the increasingly in-demand skills of programming, data science and reproducible research.

Philip Leftwich is a lecturer in genetics and data science in the School of Biological Sciences at the University of East Anglia.

This advice is based on a presentation given at a HUBS-funded workshop, Fundamental Biosciences, hosted by the University of East Anglia.

If you found this interesting and want advice and insight from academics and university staff delivered direct to your inbox each week, sign up for the THE Campus newsletter.

Standfirst
Traditional teaching in data analysis focuses on statistics and visualisation but an emphasis on foundational data and computational skills is needed to prepare students to work with real data, explains Philip Leftwich

comment