Tuesday 13 October, 3:00 - 4:00, Library 128

In this session we will look at the types and format of data, and the contextual details needed to make it meaningful. We will also explore a useful tool for extraction and wrangling of difficult data.

Data, and Metadata

What is data? What types of data exist, and in what formats? Are you sure that something is or isn't data? Data can take on many meanings, as we will learn in our talk. We will also take an examination of the contextual details used to make data understandable to others (and yourself!), this is commonly called metadata. These details are critical in producing research that is reproducible and understood by others.


One tool for your work with data extraction and wrangling is OpenRefine. Formerly known as GoogleRefine, OpenRefine is a tool that will assist users in the critical step of data maintenance. Advertised as "a free, open source, powerful tool for working with messy data", OpenRefine could help you format those hard-to-understand datasets into something that you can better use. You can find a download for OpenRefine at the link provided. We won't be able to download OpenRefine to the lab computers, but we will explore what it is and how it can help you.

What will we do today?

Today's class will be a lecture mixed between a PowerPoint slideshow and the OpenRefine application, and we will have a chance to directly use OpenRefine to complete our example. But first we will take about data and metadata to give you a grounding as to what you can expect.

Our OpenRefine test document is the Powerhouse Museum collection, from Sydney, Australia. This dataset is large enough for us to play with, yet not so big that we won't get lost in it. A link to find out more about the Museum is here, and the data we will use is located below.

Lecture Outline

Lecture materials from Powerpoint (Gross)
Introduction to OpenRefine (Painter)
Common principles connected from OpenRefine and theory of the concept (Painter, Gross)
Hands-on demonstration of OpenRefine with a chance to complete a simple exercise (Painter)

Slides and Materials

Metadata Presentation Slides

New England Collaborative Data Management Curriculum (revised) - Modules 2 & 3

OpenRefine Sample Data Set (This file needs to be unpacked before use. It is quite large).