Data selection guidelines

You must have a dataset to work with over the course of this quarter. It should ideally be a dataset that is urgently important to you and that you want to write up into a scientific report now/soon. Those in this “ideal” situation are certain to get the most out of this course.

What constitutes an appropriate, ready dataset is hard to comprehensively define. But here are a few core features that the dataset should have for the purposes of this course (and for quantitative/automated analysis and open, replicable science in general, though not strictly):

Complete (at least a complete subpart)

In order to think through the steps from raw data to analysis-integrated manuscript, you need a dataset that covers all aspects of your analysis. That is easy enough if you have completed all data collection and preparation for a study. But what if data collection is ongoing, or even just planned? You should plan to then have a mini version of your complete dataset, whether that is just a complete subset (e.g., pilot data prepped with all the features of the real data to-come-in) or a simulated version of your dataset (i.e., to be substituted later with the real data).

Let’s say, for example, you are planning to run an experiment with the hypothesis that you will see a difference between children of different ages in two conditions. You will at least need (a) real or simulated behavioral/response data for a few participants in each condition and (b) metadata about the participants (e.g., their ages) and experimental sessions (e.g., the condition they were in, the order of the items, etc.). For many people that information might be spread across multiple different files, e.g., one output file from the experiment software for each participant plus a spreadsheet with participant metadata and a spreadsheet with experiment session metadata. For some people that information might already be integrated into one document. Either way is fine, but we need some complete picture of the intended dataset, even if it’s a mini one.

Consistent in labeling and structure

Computers are wonderful, but stupid. If you have a data file for one participant with the field/column header “Name” and another called “name” and another called " name", the computer will treat them as different as “apples”, “oranges”, and “bananas”. If you have any inconsistencies in how you have labeled or structured your data, you will surely come across some of them in the process of cleaning up your data for analysis. The scarier part is that you probably won’t come across all of them, leaving dreaded “silent” errors in your analysis. So, early on, you should brainstorm where these sources of inconsistency could possibly arise in your dataset, list them out, and do as much double- and triple-checking as needed to satisfy you that you have, to the best of your abilities, eliminated these issues.

Plain text

R works with plain text. Though it may not seem like it to us human users, documents like Word and Excel files have a lot of extra information in them beyond the text content (e.g., cell and border colors, conditional formatting, etc.). While there are some special R packages that have been built to read and write common special formats (like Excel), R itself will convert the contents to a plain text format and work with them in that way. For that reason, you should consider whether your dataset can be easily converted to a plain text format. If you aren’t sure, check with Dr. Casillas. We may be able to program a custom solution for you. If not, though, your dataset may be unsuitable for direct analysis in R.


Your data should be organized into a table (i.e., with rows and columns). The canonical format in tidyverse is that every column is a variable and every row is an observation, but you might have good reasons for organizing your data the other way around or in some other fashion. As long as your data are convertible to a tabular format, you’ll be fine!

Structured around a research question

In principle, you do could just download and play around with any dataset that meets the above criteria. But in order for this course to be useful to you in learning about the process of generating a scientific report, you should choose a dataset for which (a) you have a motivated research question and (b) the contents are structured in such a way that you are able to conduct a study to address your research question.


In this course we’ll be learning how to develop a replicable scientific report via GitHub. Because GitHub records everything you ever commit in its history, you want to be 100% certain that you only ever commit anonymized data. For some students, anonymizing the data is as simple as making sure that the participant metadata is anonymous (e.g., using random strings of letters for participant IDs with a set of log files from experiment software). For other students, it might take a little effort to anonymize the data (e.g., comprehensively scanning and inserting pseudonymns where necessary or selectively removing parts of the files used). If you’re unsure how to go about this process, try and think through some options and then ask Dr. Casillas to brainstorm with you about what would be practical.

Whether you are sure or unsure about your dataset…

You must confirm with Dr. Casillas that it’s suitable for use in the course.