10 vital ingredients for the dunnhumby data scientist
At dunnhumby, our team of 500 data scientists are a vital part of our services for retailers, brands and customers. Julie Sharrocks, Head of Science for Category Management and Price & Promotions, looks at the ten key traits that are a must for a good dunnhumby data scientist.
- A strong background in mathematics
Mathematics is the foundation of any modern field of science, and that includes machine learning. While some of your colleagues in other organisations might be happy to apply standardised algorithms and approaches to answer their business question, you will have the edge if you can build a deeper understanding of what’s going on. Many data scientists will study mathematics, science or engineering at university but some of these courses fall short of covering the right level of detail. It’s equally important to have a good understanding of statistics, linear algebra and calculus to understand some of the techniques you’ll go on to learn.
- Flexible programming skills
Only a few years ago, a combination of R, Hadoop, Impala and Mahout was considered cutting edge, whereas nowadays these are considered old fashioned. Today we talk about combining Python for machine learning and Spark for its processing power. In the future we can expect even more change. A data scientist must have the ability and appetite to pick up new tools quickly, so that they’re not left behind.
To get a job as a data scientist you’ll need to demonstrate some experience with programming and manipulating large data sets. Even more important will be the ability to quickly learn new coding languages. The technologies and software that we make available to our data scientists through our Data Science Toolkit will evolve over time as they bring new capability to the business.
- The ability to communicate to a variety of audiences
Soft skills have become increasingly important, and the best data scientists differentiate themselves on this basis. Employers look for professional skills, including managing timelines, priorities and stakeholders, as well as the ability to communicate difficult concepts to a non-technical audience. Good customer data scientists understand the commercial realities of retail and know what retailers are trying to achieve, and the issues they need to solve etc. A solid understanding of the sector pains and gains will help scientists choose the right strategy to solve retail problems.
At dunnhumby we have a comprehensive training plan, including courses in critical thinking, problem solving and effective communication including data visualisation as a powerful channel of communication.
- A high awareness of privacy and security
New privacy legislation is rightly giving consumers more rights on how their data is used. This includes GDPR in Europe and CCPA in California. As well as keeping consumer data secure and private we must also understand the implications of the new legislation on data science. As a data scientist, you can’t simply blame the computer: the author needs to take full responsibility for the outputs. New legislation gives people the right to be forgotten; this also needs to be considered for copies of data used for building science. And we all have a responsibility to keeping individual data private, employing techniques to analyse data without individual data ever being identifiable.
- A thirst for the new…
At dunnhumby we stay at the cutting edge of scientific method and machine learning. We benefit enormously from our work with world class universities to introduce new thinking, allowing us to ‘bring the outside in’. We couple it with our insatiable appetite for the new. Our Data Science Club is a global movement, engaging colleagues across the world and extends to:
- Sessions to celebrate the projects undertaken as part of our academic partnerships
- Updates from global conferences to gain insight into new machine learning techniques used in multiple sectors
- Advanced onsite training programmes led by our academic partnership students
- Local reading groups deep-diving into research papers and investigating new machine learning techniques
- …but also a thirst for the old
While staying up to date with new data science trends, we see some of our best loved paradoxes popping up across a whole host of our data science solutions. These paradoxes are far from new, with some more than 100 years old. Two of our favourites include Simpson’s Paradox (where trends appear to be reversed when groups are combined) and Hitchhiker’s Paradox (look this one up the next time you are waiting for a bus!).
- An eye for false detail
As data scientists we must know the limits of our science. We are often asked to do something that would supply a level of precision that may look great, but we know that in practice, it would disadvantage its accuracy. For example, if we are to build a predictive model at too granular a level, we may find the results will on average look better, but in fact there will be a high level of spread in the results. So each individual prediction will be worse.
- Knowledge of the value of an explainable model
Standard machine learning techniques of cross-validation, and regularisation will help reduce overfitting. But at dunnhumby our advice is to go one step further to ensure your model is explainable. In short, we need to fully-understand what is going on under the bonnet. A black-box model may contain many spurious correlations that will break when fed a new dataset.
Our science must be robust enough to be effective across the dozens of retailers that we work with. Our unified demand model for price and promotions, for example, is econometric at its core to ensure we capture the most important and explainable effects.
- The ability to build at scale
Our data architecture and platforms have come a long way over the last 25 years. Today we work with vast quantities of data (into the petabytes) across a flexible combination of multi cloud and dunnhumby hosted data platforms. Despite this new environment, there will always be limitations to what is practical to run in a productionised environment. Storage may be infinite but costly, while processing excessively large volumes of data is not likely to curry favour with your infrastructure team, or your team members when the lights start to dim.
As we build science products, we need to consider the balance between complexity and value. New science will be productionised through the science engineering team and made available as reusable science modules for the wider dunnhumby teams.
- Awareness that the ultimate goal is automation
We want our data scientists to be able to easily plug-in and use each other’s code. Doing this saves time and helps us to focus our energy on true innovation. Code lines need to be well written, efficient and commented so they can be shared and picked up elsewhere. Equally important is proactively sharing or publishing your approach and code lines, to ensure there is no simultaneous invention happening within the same community.
At dunnhumby, we have long been applying automated science through our products. Recent examples include our assortment planning software where we have automated the process of identifying shopper need states and customer decision trees, allowing our assortment science to be easily accessible for retailers, whatever their size or sophistication.
Find out more
Interested in joining dunnhumby’s data science teams? Find out more about our science here and learn how we apply data science to help retailers and brands put their customers first.
Julie Sharrocks is Head of Science for Category Management and Price & Promotions at dunnhumby. She has been a member of the data science community at dunnhumby for more than 15 years and currently manages a large team of expert research data scientists and science engineers.