Five signs your spreadsheet might smell (and what you can do about it)

In the world of software development, a code smell is a readily identified surface indication that often corresponds to a deeper problem in a system. Code smells are not bugs but can reveal a codebase that is fragile or may be difficult to maintain in the long term. With a little bit of practice code smells, which have entertaining names like “feature envy” and “shotgun surgery”, are generally easy to spot but understanding how to fix smelly code can take a lot of knowledge and experience.

Spreadsheets can be an incredibly powerful tool for quickly testing concepts, analysing data and visualising outputs, but their ubiquity and apparent ease of use often means they get taken beyond their point of efficiency. The following “spreadsheet smells”, which you may even find combining in the same spreadsheet, are intended to help you quickly identify when things might be getting out of control and present a few possible solutions.

Confounding concerns

Because spreadsheet programs are kind of a “Swiss army knife” of tools, including everything from data operations and advanced mathematical functions to charting, it is dangerously easy to muddle input data, processing, analysis and outputs within the same spreadsheet. Although this flexibility makes spreadsheets so powerful, it can become nearly impossible to update data or unpick problems in a complex spreadsheet. Examples of this smell can include business processes or calculation “tools” which are implemented in spreadsheets.

Solutions

Having a very clear workflow from input data through calculations and intermediate data to outputs makes any process a lot more legible. Not only is it quicker to find and fix errors, but changes or data updates are a lot simpler to implement. There are a wide range of potential options available for better separation of concerns, which may be suitable alone or in combination depending upon the specific use case:

  • A carefully designed spreadsheet, which clearly delineates inputs, calculations, intermediate data and outputs, may be appropriate in some situations.
  • Displaying outputs: analytics platforms like Power BI or Tableau provide a rich set of visualisation tools, including charts and tables and enabling handy tools like smart filtering. They can read from various data sources (including spreadsheets), provide some calculation functionality, and can readily be shared.
  • Calculation and analysis: programming languages, such as R or Python, can connect to a wide variety of data sources, include powerful statistical and numerical functions, can leverage a huge range of community libraries and produce compelling visual outputs.
  • Moving data between systems: an data integration tool, such as FME, can read from and write to a wide variety of systems and perform various data transformations.
  • Business processes or calculation tools: a web application can often be a better solution, as concerns can be separated by design and data requirements, process steps and validations can be rigorously enforced.

 

Prodigious proportions

Although Excel will let you create a spreadsheet with a million rows, sixteen thousand columns and as many sheets as your computer’s memory can handle, you probably shouldn’t… There is no hard and fast rule for when a spreadsheet is too bloated, but there does come a point where Excel will take forever to open, slow to a crawl on any change and the file will become increasingly difficult to share.

Solutions

Migrate your data into a relational database. These are specifically designed to handle huge amounts of data and perform highly optimised queries using a Structured Query Language (SQL). Getting started with databases does require a bit of learning and set up but in the end promotes a much stronger separation of concerns. Databases broadly fall into two categories, and picking the appropriate one depends on the use case:

  • Single-user databases: a single file stores all the data, only requiring an application to open the database file, popular options include SQLite and Microsoft Access. These kinds of databases are great if only one user needs to access the data, but can fall into some of the pitfalls outlined below if there are multiple users and different data versions.
  • Multi-user relational database management system (RDMS): these typically run on a server and are designed to support lots of people viewing and editing the data at the same time, common examples include PostgreSQL, MySQL or Microsoft SQL Server.

Databases focus on data storage, and SQL enables queries and simple calculations on the data, but typically the presentation of this data will require something like Power BI or a web application.

Version hell

Because spreadsheets are so ubiquitous and portable it is almost inevitable that version management becomes an issue: the file gets emailed and changed by the recipient or someone already has it open when you need it and suddenly there are a raft of versions. In larger projects identifying the appropriate version and reconciling differences can become a major headache and trying to find the correct version a few months down the track can be almost impossible.

Solutions

Defining strict file naming conventions can help in some instances, although in practice these are typically difficult to adhere to and don’t solve issues around multi-user access and reconciliation. Cloud document management tools like Google Drive, Dropbox or OneDrive enable shared editing and all provide some form of revision control, although I’ve rarely seen this used well. Additionally, they are still prone to people creating new versions, presenting the same reconciliation problems.

If concerns are well separated and some form of code is used for the analysis/calculation, distributed software version control tools like Git can make version management easier. Git uses a “branching” model that simplifies workflows when multiple people are working on the same project, clarifying the merging process, it also stores a permanent record of every single change committed to the codebase.

Fiendish formulae

Short formulae in Excel can be useful – once you match up the colours and cells you can usually tell what is going on without too much head scratching. However, once a formula gets beyond a certain size it can become extremely difficult to decipher, especially in confounded spreadsheets without a good process flow. You will see this smell exhibiting itself through length of formulae, number of formulae, formulae pointing all over the place and deeply nested logic (such as chained IF, AND, OR statements).

Spreadsheet formulae are easy to create, but can quickly get out of hand, becoming difficult to debug and test.

Solutions

Sometimes a spreadsheet with fiendish formulae can be simplified by breaking parts of the formula into smaller sequential calculation steps, each with a clearly labelled column, although this can introduce new problems if it results in an inappropriate number of columns or sheets. The best solution is typically to migrate the calculations to something like Python scripts, using spreadsheets as data input/output only (or, if applicable, a database). Python scripts can make long or complex calculations a lot more legible, as it is much easier to visually step through the calculation process, with the added advantage that scripts can readily be version controlled. The Python ecosystem also provides great tools for testing your code, so you can ensure that each of your calculation steps is behaving as it should and have confidence that you aren’t introducing new bugs as you modify it.

Tests in Python help give you confidence your code is doing what you expect.

Spatial leanings

Although mapping technology has become increasingly accessible, spatial data and the tools and techniques used to solve various spatial problems are still a relatively specialised domain. Recent versions of Excel have incorporated the ability to show maps and plot spatial data, but caution should be exercised in doing much more than simple visualisations. You may see this smell exhibited in a few ways:

  1. A spreadsheet trying to deal with a spatial problem but representing inherently spatial information in a non-spatial way, which can disastrously lose precision or introduce ambiguity. For example, storing locations by descriptions rather than coordinates (there can be valid reasons for this kind of approach, but it needs to be carefully considered and the pros/cons well understood for the specific use case).
  2. Performing any kind of spatial analysis in a spreadsheet: although coordinates are usually just a series of numbers, they typically won’t map cleanly onto the “cartesian” X/Y coordinates we are used to thinking about, and working with them properly involves a fair amount of math. Although it can be tempting to chuck these numbers into a formula, the results are likely to be somewhere on the spectrum between subtly and disastrously wrong, depending upon the coordinate system of the data and analysis being conducted.
  3. Using spreadsheets to transfer data between different systems can seem convenient, but can be risky, during the pandemic Public Health England lost nearly 16,000 Covid test results due to Excel’s row number limitations. Even if data isn’t lost, transferring through Excel typically introduces a loss of precision and the risk that relevant metadata (information about the data) can be lost. Additionally, because spreadsheets blur the lines between data storage and presentation, it is common to end up with problems like numbers shown as text, and vice versa, and broken dates or times.

Solutions

For the examples above:

  1. Representing spatial information in a non-spatial way: carefully think through your solution – if you may later want to show the data on a map or do spatial analysis on it: use spatial tools from the outset. Data can always be presented without a spatial component down the line, but it can be difficult to add spatial information retrospectively.
  2. Spatial analysis: spatially aware applications, like ArcGIS or QGIS, are fundamentally designed to project between different coordinate systems, include a wide range of spatial analysis tools “out of the box” and can be readily customised to chain operations together.
  3. Data transfer: in some situations it may be possible to export directly between systems (or there may be a data exchange format available that is more appropriate than spreadsheets), otherwise spatially-enabled tools such as FME specialise in this “extract, transform, load” workflow, reading from and writing to a wide variety of systems and data sources, and enabling a wide array of transformations to be performed on data in the process.

Whatever your spreadsheet woes it is always worth re-evaluating what you are doing. Is a spreadsheet still working for you or can you detect some of these smells?  There are an array of tools and software that can help make sense of data. Do you need a web application to help guide a data flow or enforce data standards? Contact Stacy Rendall to discuss options with our software team. Would visualising your data help? Contact Fiona Mohan for spatial analyses. Do you need to move large quantities of data into a single source and manipulate it from there? Contact Gavin Jeter for specialised FME support. Or just ring up for a chat – your spreadsheet might just be fine as is with a little bit of tweaking from our experts.