It is good data science practice to visualise and explore your data before trying to answer specific research questions. Doing this exploration will inform your methods and identify potential pitfalls which could lead us to the wrong conclusions later on. It may also help you to generate new research questions.
As we are aiming for transparency and collaboration, I think it is a good idea to do some of this data exploration openly. Doing so will help us build a shared understanding of what the data looks like, making it easier for us to collaborate. Think of it like this: if we at Counting What Counts are the only ones with a clear idea of what the data looks like, how can we expect anyone else to make good suggestions for what to do with it?
Volume and Velocity
Big Data is a commonly used term nowadays. In the simplest terms it means that there is so much data that your personal computer or laptop wouldn’t be able to fit it in, let alone be able to analyse it.
Using this definition, even if we gather together all the data collected by NPOs using the Toolkit (AKA the aggregate dataset), it is still not ‘Big Data’. The entire dataset could be downloaded and opened in a spreadsheet (although it wouldn’t be pretty!).
More detailed descriptions of Big Data talk about ‘the four Vs’ of the data. These are Volume, Velocity, Variety and Veracity.
- Volume refers to the amount of data there is.
- Velocity refers to the speed with which the data is collected.
- Variety refers to data coming from a range of different sources. For example, as well as simple text and numbers in spreadsheets, you might have images, videos or emails in the thousands.
- Veracity refers to how accurate or trustworthy the data is.
In this blogpost we’ll look at the Volume and Velocity of the aggregate dataset.
Velocity of the data
The Toolkit mostly collects survey data. That means that every data point is a question which a person has taken the time to read and respond to. Considering this manual process, how fast could data possibly be collected?
The charts below show the number of surveys completed by members of the public per day (in the top chart in orange), and the number of surveys completed per month (in the bottom chart in blue).
- On average, 142 surveys were completed every day since the project began.
- In the 2019 evaluation year, the average was 223 surveys per day.
- In the 2020 evaluation year so far, the average has been 83 surveys per day.
- The most collected on a single day has been 2,153.
- The most collected in a single month was 15,741!
Looking at the charts we can see that surveying activity increased from the start of the project and peaked in January 2020. This gradual increase in activity is likely a combination of two things:
- The project only started in March 2019, and so it took time for people to familiarise with the tools and the processes
- The approach of the end of the evaluation year prompted more data collection
Unfortunately for everyone, come March we entered a national lockdown, represented by the black line in the top chart. This understandably almost completely stopped data collection.
It’s possible that we will see a yearly peak and trough of the Velocity of collected data, although not as pronounced as we see for the first year as people will now have more familiarity with the system.
Volume of the data
To date, we have collected answers to 781,669 questions asked of 105,218 individuals. These individuals were each sent one of 3431 different surveys, representing 1445 different arts and cultural works. All of this has been carried out by 281 different NPOs.
To get a sense of how much that is and whether this counts as a big dataset we can compare it to some other things of a similar scale:
- The Office for National Statistics (ONS) carries out a yearly survey of the population. This has the largest coverage of any household survey carried out in the country. In 2019 the survey collected data from around 320,000 people.
- A large theatre audience is approximately 1,200 people.
The circles below are scaled in size to represent the number of people involved in each of the above.
Whilst the ONS survey is 3 times the size of the Toolkit survey so far, it is worth noting that Toolkit data collection took place for the first time in March 2019, and it took time for the rate of data collection to increase from a cold start (as we see in the Velocity section above). A Toolkit evaluation year with existing momentum and no interruptions might collect significantly more data than we’ve seen so far.
At the beginning of this blogpost I said it’s a good idea to explore your data. Having done some exploration, what have we learned?
Looking at the data speed, we see a data collection cycle with slower collection for part of the year followed by a build-up through to March. Because of this, we might not see much change in aggregate patterns until after this peak data collection period.
We can monitor this to confirm the peak/trough pattern over the next year once the sector opens back up and cultural activities restart. If the pattern persists, it means that ‘new’ patterns in the data are best searched for every year after the peak data collection period.
Looking at the initial snapshot of total data volume, it appears that we have a lot of data to work with. However, we would need to break it down more to understand whether we can use it all to address specific research questions. For example, if we want to analyse a specific artform then only a fraction of this data might be of use.
We can do further exploration to get a more in-depth understanding of the aggregate dataset and how we might use it.