Skip to content
Community & Place

We deliver active modes design, sustainable transport and community engagement.

Digital & Spatial Technology

We create efficiencies with our spatial, software development, and digital engineering solutions.

Road Safety

We support positive safety outcomes from the street to the transport network.

Strategy & Planning

Delivering business cases, traffic modelling, economic assessments, and public transport innovation.

Transport Design & Engineering 

We deliver designs through collaboration with practitioners to shape transport solutions.

Land Development

We apply our transport expertise to support clients through the land development process.

CarbonWise

Measure your employee’s commuting emissions.

SafeSystem

A data-driven approach for road safety practitioners to identify risks.


TrafficFlow

Quickly and easily get detailed traffic and mobility data.

Partner Products

We partner with TomTom and HERE to provide transport and traffic data solutions.


More Products

Discover more of our unique products

Our Insights

Read our insightful blogs providing the latest information and trends.

Featured Projects

The work we do helps inspire positive change.

News

Find out what we are up to. 

Research

Applying our research expertise to provide practical based solutions.

Webinars

We deliver a range of webinars covering industry trends.

Our Team

Our team of skilled professionals provide insightful solutions and empowering advice.

Our Story

Since 2003, we’ve been providing transport solutions in New Zealand and internationally. 

Our Commitment

We’re connected and committed to our people, the community and the environment. 

Our Partners

We work closely with our partners to make a meaningful impact.

Our Awards

We showcase our awards to celebrate our people and clients.

Abley Apr 20242 min read

The realities of “big data” in computer vision

When working with big data it’s important to understand what makes data ‘big’, the limitations of working with it, and what can be done to combat these limitations. For the uninitiated, ‘big data’ doesn’t only refer to how much storage data may occupy, but it refers to a whole host of qualities.

Many people define big data as having large amounts of one or more of the four V’s:

  • Volume. The size of the data is so big that it can’t fit on a single computer (e.g. > 1TB).
  • Velocity. The speed with which data is being created is fast and accumulates quickly (e.g. 100GB per hour).
  • Variety. The data formats are complex and varied (e.g. text, images, video)
  • Veracity. Uncertainty around the reliability and quality of the data.

The four V’s are not the only attributes of big data, but are the most fundamental ones. Each of these qualities increase every year. Let’s take volume for example. In 2010, the amount of stored data in the world was approximately 1.2 trillion gigabytes. The estimate for 2020 is that that number has now increased to 44 trillion gigabytes!

In constructing a model to identify traffic signs, we’ve taken video footage of our state highway network and extracted each video frame as an image. A single pass of the network consists of over 4 million images. Using all of these images in a model will be time consuming, use a lot of computing power, and will be difficult to store. We can combat some of these issues with cloud-based computing and storage, but for a practical solution, we don’t want to use all of the data. Leading to the very logical question: how much data do I need in practice?

There’s no easy answer to this question. The solution will look different depending on the data format and the machine learning method you want to use. More complex tasks like computer vision require larger amounts of data than other solutions. Especially when you consider how much bigger an image is compared to a spreadsheet. A single colour image from our data contains over 3 million pieces of information. The most effective solution for managing the vast amount of data in this project is to implement a sampling strategy to select a portion of the data to use in developing the solution. By reducing the number of images used, we can drastically reduce the amount of time and computing power required.

As we implement the model, we can evaluate how the performance responds to these changes. We can visualise how much the model improves when we introduce more images. These visualisations are called learning curves graphs. While more data tends to show a corresponding improvement in performance, there will be a point where the costs outweigh the benefits. A 1% improvement in accuracy could equate to multiple hours of additional time the computer runs for which often has a direct cost implication.

If we’re going to use only a portion of the data to construct the model, then we need to be cautious in how we select our sample to retain as much information as possible. These sampling strategies range from randomly sampling image files to more sophisticated sampling using machine learning instead. On my next blog, I’ll cover these strategies in the context of my project and the potential implications on the model.

big data
Source: https://openautomationsoftware.com/open-automation-systems-blog/what-is-big-data/