dunnhumby
Source files

Real-world data to put your theory into practice

(Nearly) Real-world data

Here at dunnhumby, we understand the importance of great data and the analysts who make sense of it. Uncovering patterns, predicting trends, validating theories — insight gained through analysing customer data is the foundation of our business and key to the success of every one of our clients.

But more than that, we just really love data. We love connecting the dots. We love the human stories data can help you tell. And we love the people who love data as much as we do. That’s why we created Source Files, a platform for sharing datasets inspired on the real-world, where fellow data geeks – from professors to students to data scientists – can easily access rich data sources. Whether you’re teaching a course, completing a class project, testing an algorithm, or running a hack-a-thon, Source Files is the place to go to put your theory into practice.

Breakfast at the Frat

What’s inside?

  • A representation of sales and promotion information on five products from three brands within four categories (mouthwash, pretzels, frozen pizza, and boxed cereal) over 156 weeks.
  • Unit sales, households, visits, and spend data by product, store, and week
  • Base Price and Shelf Price, to determine a product’s discount, if any
  • Promotional support details (e.g. sale tag, in-store display), if applicable

What’s it for?

This dataset is designed to facilitate time series analyses, including:

  • Price sensitivity analysis
  • Promotional effectiveness analysi
  • Comparing/contrasting results across products, categories or store geographies
Download 'Breakfast at the Frat'

Carbo-Loading

What’s inside?

  • A representation of household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix

What’s it for?

  • Classroom projects and case studies
  • Understanding the process required to mine data
  • Learning how to merge data tables and aggregate data

How should I use it?

Professors have had success asking students questions such as:

  • What is the household penetration of Product X? That is, out of all customers purchasing Pasta Sauce, what percent purchase Product X or Brand Z?
  • Did any customers first purchase an item or category using a coupon? If so, how many of these customers made additional purchases of the item or category?
  • In two complementary categories (e.g. Pasta and Pasta Sauce), what products, if any, are commonly purchased together?

Special considerations

Don’t forget, you’re dealing with Big Data! Large file sizes may take 5+ minutes to download, and importing the millions of rows of data contained within will require specialised software such as R, Microsoft Excel with PowerPivot, Microsoft Access, SAS, SPSS, SQL, etc.

Download 'Carbo-Loading'

The Complete Journey

What’s inside?

  • A representation of household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer
  • All of a household’s purchases within the store, not just those from a limited number of categories
  • Customer attributes and direct marketing contact history for select households

What’s it for?

  • More advanced classroom settings
  • Academic research on the effects of direct marketing to customers

How should I use it?

Professors have had success asking students questions such as:

  • How many customers are spending more/less over time?
  • Which customer attributes appear to affect spend of the customer?
  • Is there evidence to suggest that direct marketing improves overall customer engagement?

Special considerations

Don’t forget, you’re dealing with Big Data! Large file sizes may take 5+ minutes to download, and importing the millions of rows of data contained within will require specialised software such as R, Microsoft Excel with PowerPivot, Microsoft Access, SAS, SPSS, SQL, etc.

Download 'The Complete Journey'

Let’s Get Sort-of-Real

What’s inside?

By the numbers

  • 117: Weeks of transactions at till dummy data
  • 300M: Total number of transactions
  • 47M: Total number of baskets
  • 400,000: Average number of baskets per week
  • 2.6M: Average number of transactions per week
  • ~500,000: Distinct number of customers
  • ~5,000: Distinct number of products
  • ~760: Distinct number of stores

What’s it for?

We’ve replicated the typical patterns found in real in-store data to help data scientists test their techniques and algorithms in a (nearly) real-world environment.

A note on download times

Please remember, you’re dealing with Big Data! Large file sizes can result in download times of five minutes or more. Please be patient.

Samples available
  • Data preview
  • 2,000 baskets, randomly selected, over a period of two weeks
  • All transactions for a randomly selected sample of 5,000 customers
  • All transactions for a randomly selected sample of 50,000 customers
Download 'Data Sample'

Download 'Sample 2K baskets'

Download 'Sample 5K customers'

Download 'Sample 50K customers'

Full dataset

Ready to get real? Grab the full 4.3GB dataset below (in nine ~500MB files, for your downloading convenience).

Download 'Part One'

Download 'Part Two'

Download 'Part Three'

Download 'Part Four'

Download 'Part Five'

Download 'Part Six'

Download 'Part Seven'

Download 'Part Eight'

Download 'Part Nine'