“There is a time, much greater in amount than commonly allowed, which should be devoted to free and unguided exploratory work (call it play if you wish; I call it work).” – David Hawkins, 1974*
We often get a feel for things by messing around with them. When you toss a paper airplane in the air you’re probably not thinking about the physics of aerodynamics, but you get a feel for air resistance, gravity, projectile motion. When you push a toy boat into a pond, you’re not calculating hydrodynamics, but you experience the effects of mass, fluidity, surface tension. However, in both cases, this preliminary stage of “messing about” is a necessary prerequisite to the pursuit of an intellectual understanding of the science and engineering principles involved in these simple pleasures. Students who mess about in this way are learning, whether they are aware of it or not.
Why is it important for students to mess about? Why not get them started using a guided exploration or investigation right away? Because, when messing about, students get a feel for the subject, and they have opportunities for surprise and discovery, which, when they happen, help students establish ownership of an investigation and a belief in themselves as explorers. As they follow their noses from one question to the next, they gain proficiency with the tools of exploration and an intuitive understanding of the discovery process, especially when they share their discoveries and techniques with each other.
When deciphering a dataset, messing about is a particularly informative first step. The availability of large, interesting datasets that describe everything from elections to climate change has exploded online. Finding the facts within the figures has become a pressing need, as well as a profession. But how does a student mess about with data, and why should they?
Let’s consider the “why” first. During the messing about phase with data, a student might learn where the data came from, how it was gathered and how much junk it contains, how many attributes (columns) and cases (rows) there are, whether the data are confusing or straightforward, and if it seems like there are exciting discoveries to be made. All of this preliminary information that results from messing about helps make sense of the data, and provides students with a deeper understanding of the complexities, detours, and unexpected results that may lie ahead.
“[Students need] to build an apperceptive background, against which a more analytical sort of knowledge could take form and make sense.” – Hawkins, 1974
Now let’s look at the “how.” How can students independently undertake an open-ended data exploration without getting frustrated or feeling as though they’re not learning anything?
A good tool helps. The Common Online Data Analysis Platform (CODAP) is free, easy to use, runs in a browser, and is highly interactive. We created CODAP with messing about as well as detailed data analysis in mind. Let’s look at an example.
The first question a student might consider when messing about with a large dataset is which attributes they want to look at. In this example database, students can choose the number of people they want to look at and various attributes from age and income to military status and language. There’s a lot to choose from! Where to start?
When messing about, one student might be curious about age and military status and whether there is a connection to level of education in a sample of 5,000 people. They could compare the results to a smaller, more select group of people to see if the results are different. Or they can look at a completely different set of attributes. Driven by their curiosity, students will have multiple questions, and at this stage, no clear answers . . . yet.
In the simple example below, we use CODAP to look at 1,000 people and seven attributes.
Using the example database, we select a sample size and the attributes to investigate. The initial graph plots one point per person, distributed randomly.
One of the best ways to mess about with data in CODAP is to create graphs. The initial graph always plots one point for each person, distributed randomly, giving a feel for how much data there is. Let’s put Age on the x-axis and see what happens. We could just as easily choose another attribute, but Age is easy to understand
Assigning an attribute to the x-axis provides a useful overall distribution.
With Age on the x-axis, we have a sense of the overall distribution of ages we’re looking at: generally high counts until middle age and then a quick decrease for ages over 60. Let’s add another attribute to the y-axis. Maybe Marital Status. Since we’re just messing about, we could have used any attribute we’re curious about. Separating the points in this direction gives a sense of the variation in number and age range for each category.
Assigning an attribute to the y-axis provides further information about the data.
Are the results what you might expect? What questions arise?
Now if we drop Sex into the middle of the graph, what do you see? What about adding Education or Military Status? What questions arise? What observations?
Assigning additional attributes brings up more questions and possibly unexpected results.
Are there any unexpected results when looking at the male versus female results? Why are there so many widowed females after age 60? What other questions arise from this preliminary messing about with the data?
Each time students create a new visualization of the data, they learn something about the dataset—even as they follow their curiosity rather than a prescribed set of steps. They’re asking questions: What am I seeing? How has the data changed? What is surprising or interesting? Ideally they’re talking with other students about what they see, what draws their attention.
When messing about, students are exploring. They’re not yet investigating. They’re becoming familiar with the data, finding out which attributes are easy or hard to work with, where there is missing data, what interesting relationships arise. Spending time “playing” with data is a critical step in providing students a feel for what the data might tell them—and very different from many traditional activities in science or math class.
Go ahead and mess about with data in CODAP. Let us know what you learn.
* In 1974 the educational philosopher David Hawkins published the essay “Messing About in Science” in his book The Informed Vision: Essays on Learning and Human Nature. His belief that children learn best when they follow their natural curiosity has inspired our work in data science education.