When working with big data it’s important to understand what makes data ‘big’, the limitations of working with it, and what can be done to combat these limitations. For the uninitiated, ‘big data’ doesn’t only refer to how much storage data may occupy, but it refers to a whole host of qualities.
Many people define big data as having large amounts of one or more of the four V’s:
- Volume. The size of the data is so big that it can’t fit on a single computer (e.g. > 1TB).
- Velocity. The speed with which data is being created is fast and accumulates quickly (e.g. 100GB per hour).
- Variety. The data formats are complex and varied (e.g. text, images, video)
- Veracity. Uncertainty around the reliability and quality of the data.
The four V’s are not the only attributes of big data, but are the most fundamental ones. Each of these qualities increase every year. Let’s take volume for example. In 2010, the amount of stored data in the world was approximately 1.2 trillion gigabytes. The estimate for 2020 is that that number has now increased to 44 trillion gigabytes!
In constructing a model to identify traffic signs, we’ve taken video footage of our state highway network and extracted each video frame as an image. A single pass of the network consists of over 4 million images. Using all of these images in a model will be time consuming, use a lot of computing power, and will be difficult to store. We can combat some of these issues with cloud-based computing and storage, but for a practical solution, we don’t want to use all of the data. Leading to the very logical question: how much data do I need in practice?
There’s no easy answer to this question. The solution will look different depending on the data format and the machine learning method you want to use. More complex tasks like computer vision require larger amounts of data than other solutions. Especially when you consider how much bigger an image is compared to a spreadsheet. A single colour image from our data contains over 3 million pieces of information. The most effective solution for managing the vast amount of data in this project is to implement a sampling strategy to select a portion of the data to use in developing the solution. By reducing the number of images used, we can drastically reduce the amount of time and computing power required.
As we implement the model, we can evaluate how the performance responds to these changes. We can visualise how much the model improves when we introduce more images. These visualisations are called learning curves graphs. While more data tends to show a corresponding improvement in performance, there will be a point where the costs outweigh the benefits. A 1% improvement in accuracy could equate to multiple hours of additional time the computer runs for which often has a direct cost implication.
If we’re going to use only a portion of the data to construct the model, then we need to be cautious in how we select our sample to retain as much information as possible. These sampling strategies range from randomly sampling image files to more sophisticated sampling using machine learning instead. On my next blog, I’ll cover these strategies in the context of my project and the potential implications on the model.