Don’t Let Monitoring Get You Down

Note: Before diving into the topic, this is not sponsored by UptimeRobot. Abley has been an UptimeRobot customer for over five years, and we have found their services invaluable, however the principles regarding system-monitoring are applicable to other monitoring platforms as well.

Abley works with a wide range of clients across many different industries – some of whom we manage aspects of their infrastructure and systems. While these industries are diverse professionally, they rely on similar backend spatial and integration-orientated systems that we have expertise in, such as Esri ArcGIS Enterprise or FME Server.

Internally, ArcGIS Enterprise and FME Server are designed to reduce downtime by having robust error handling, recovery, and redundancy. However sometimes things do go wrong – issues can arise due to internal system factors, or external dependencies such as networking infrastructure or database services. As the managers responsible for these systems we need to work proactively and not reactively, and not wait until a user tells us there is a problem. So it is paramount that we are alerted promptly to such issues, so we can investigate and action any required resolutions and minimise any disruption.

Using UptimeRobot

This is where a monitoring tool such as UptimeRobot comes into its own. The simplest explanation of UptimeRobot is that it periodically checks an internet-facing service (such as a website) to ensure it is up and running as expected. For example, we use this basic test (on the right) on our own website to check its availability.

If UptimeRobot is unable to reach the Abley website during one of the scheduled attempts, an alert is sent through to us via our specified channels. These include notifications by email, Microsoft Teams messages, SMS and voice calls through Twilio, and many more! In fact, the options are almost endless as UptimeRobot can also call webhooks to initiate processes in other services.

As well as simply checking that a website returns a successful response, a keyword monitor can be used to check that a URL is returning an expected result. We often use this functionality to check the status of FME Servers by calling the ‘healthcheck’ endpoint within the REST API service. There is also a similar call you make to check Esri ArcGIS Enterprise servers.

Recently we wanted to check that FME Server was running jobs as expected within predefined parameters. The purpose was to allow us to proactively find potential processing slowdowns or large queue-volumes. Whilst we could achieve this functionality natively within FME Server using automations or scheduled jobs, it needed to be kept as a separate system to improve redundancy.

Using FME with UptimeRobot

To achieve the desired monitoring outcome, a combination of FME and UptimeRobot was used. Every 8 hours, Uptime Robot calls a streaming service on FME Server. This FME process checks the validity of the previously run jobs and returns either True or False. Uptime Robot picks up this response. If it is False, or the response takes longer the 30 seconds, Uptime Robot will alert our specified notification channels to say that something may not be working as expected on the monitored FME Server.

Dealing with outages

Previously, outages in these systems may not have been noticed for some time. However, by implementing and building these monitoring services, we are alerted to any outages as they occur, and can respond to the situation in a much timelier manner.

So, if you have business-critical systems that aren’t being monitored, please reach out – we’d be happy to discuss how we can help you improve your own monitoring capabilities.