
Outlier detection is all the rage now when it comes to analytics and business intelligence (BI), given its usefulness in spotting important signals in business data. It’s hardly a new field though, as outlier detection techniques have been used for decades in fields such as imaging and image processing, where identifying outlier pixel values from real data can aid astronomers in removing noise, resolving features or increasing the accuracy of data from the heavens.
Back on Earth, however, when outlier detection is applied to time series data of business metrics, companies can spot important events occurring in their business, and respond quickly, avoiding astronomical losses. In order to fully enjoy the benefits of outlier detection, though, you must first have a working automated outlier detection system in place.
Many companies begin the process of building their own outlier detection system, only to encounter the complexities, challenges, and nuances of accurate, real-time, large-scale outlier detection which engender delays, costs overruns – and missed signals. These companies learn all too late that successfully applying outlier detection algorithms on many different types of data is very different from understanding how they work in clear, clean examples from statistics textbooks.
To understand why, let’s go back for a moment to outlier detection in imaging. Simple examples of identifying outliers include using multiple images or histograms to detect pixels with unusually high values due to non-optical transient events like electronic noise or even particles of background radiation hitting the pixel on the imaging sensor chip.
Similar to images, scatter plots are also essentially two-dimensional arrays, one key difference being that points on scatter plots often don’t have their own value associated with them (other than present or absent), as opposed to images which have color intensity values associated with each location on the grid (thus each pixel has an x coordinate, y coordinate and an intensity). In essence, scatter plots are like the traffic accident maps you often see on the local news, while images are more like the color-coded maps you see on state and regional weather reports (you know, like the ones where Phoenix is always in the red zone).
Although they are different types of data sets, scatter plots and images are similar enough to allow the same kinds of outlier detection methods to be used on both, namely methods which identify outliers as those data points move farther than some threshold distance from the centroid in one or more clusters in the data (those clusters in turn are determined by an algorithm such as k-means clustering).
Each business metric can be a lot like scatter plots and digital images: different, but similar enough to allow the effective use of the same outlier detection tools. Why then is building automated outlier detection systems for business metrics such a challenge? The short answer is because there are many outlier detection tools to choose from, depending on the characteristics of the data you’re working with, and also because each tool can have parameters which require fine tuning.
Outlier calculating methods: specialized tools effective in trained hands
Both clustering and histogram-based approaches rely on thresholds which can be a bit arbitrary. Although 3 Sigma is a standard statistical threshold for determining an outlier, it’s also not divinely handed down. The 3 sigma test also assumes a Gaussian distribution, where the mean is also the median. Other distributions are possible, even quite common. As author Malcolm Gladwell has pointed out, many real world problems are easier to solve once you discover they’re not tied to a bell curve, but instead to some other type of distribution. One must first determine the type of distribution a particular metric exhibits before selecting the best statistical tools for finding outliers in that metric.
Let’s put it this way – time series plots are to business data analysts what raw intelligence reports are to the CIA: an indispensable source of insight and information if the proper analysis is performed by trained experts.
The many outlier detection algorithms out there often require preliminary visualization and analysis by humans before the most appropriate algorithm can be applied and tuned to give the best results. This may work great when your company can afford to have a single analyst spend an hour on one dataset, but it’s a non-starter if you’re a business with thousands or millions of metrics, any one of which could hold clues to the business equivalent of a terrorist plot unfolding.
The challenges of automating outlier detection
If you’re already used to applying these outlier determining methods manually, it may seem easy to automate human eyes out of the loop. As outlier detection system vendor Anodot explains, however, building your own system is almost always far more expensive than buying one off the shelf. One of the reasons for this is because nuanced distinctions between metrics and between outlier detection methods arise at every step of designing such a system, not unlike our above discussion about the scatter plots and images. Getting those nuances and distinctions wrong can lead to bad design choices which let important outliers slip by (false negative) and cause nominal data points to trigger an alert (false positive). The real ‘gotcha’ is that the cost incurred by only a few of those false positives and negatives can be more than enough to wipe out any savings gained by developing your own outlier detection system.
The data science behind outlier detection has generated many different algorithms for many specific types of data, many of them useful for helping companies spot the problems and opportunities in their data. While the benefit of automated anomaly detection to your business is real, the apparent benefits of designing, implementing, testing and refining your own solution are not.