After a brief introduction to outliers and getting answers to question like what are outliers? What is needed to detect outliers? Application of Outlier detection etc. Let us familiarize ourselves with Outlier Detection Methods.
Outlier Detection Methods
There are many outlier detection methods in practice. We will get ourselves acquired with two main Methods i.e. based on learning models and based on computation model.
Based on Learning model
If we have expert-labeled examples of normal and/or outlier objects, they can be used to build outlier detection models. They can be further divided into supervised, semi-supervised, and unsupervised methods.
Supervised Methods
In supervised methods data is modeled on normality and abnormality. Expert of required domain analysis and label sample of underlying data. Then outliers can be detected using a classification problem. The task would be to learn to classify outliers.
In supervised models, we must be careful while training and interpreting classification rates as outliers are rare in comparison to normal data objects.
Unsupervised Methods
In some of the cases, a labeled data set might not be available (normal, outlier). Hence, an unsupervised learning method can be used.
In the unsupervised learning model, we make an assumption that:
Normal objects are somewhat clustered i.e. they are together.
In other words, for the unsupervised outlier detection method, normal objects must follow a pattern more frequently than outliers. Many clustering techniques can be used for unsupervised outlier detection. The underlying idea is to find clusters first, and then data objects that are not included in any cluster are detected as outliers.
Semi-Supervised Method
In many cases getting some labeled examples is feasible but the number of such labeled data objects is quite low. Semi-Supervised methods are developed for such scenarios.
When some labeled normal objects are available, we can use them, together with some unlabeled objects close by to train models for normal objects. Then that model can be used to detect outliers. This method is trickier.
Based on Computation model
Outlier Detection methods make assumptions about outliers vs the rest of the data. Based on assumptions made, we can use either of three methods viz. statistical methods, proximity-based methods, and clustering-based methods.
Statistical Methods
Statistical methods consider data normality as a base. They assume data objects are generated using a statistical model and data falling outside of the model is an outlier.
Proximity-Based Methods
Proximity-based Methods consider a data object as an outlier if the nearest neighbors of an object are far away in feature space. The proximity of the objects to their neighbors significantly deviates from the proximity of the other objects to their neighbors in the same data set.
Clustering-Based Methods
Clustering-based methods consider that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.
Example
Intrusion detection by clustering-based outlier detection
Consider a method that was developed to detect intrusions in TCP connection data by considering the similarity between data points and the clusters in a training data set.
- A training data set is used to find patterns of normal data. TCP connection data is segmented according to any attributes for this example consider dates. Frequent itemset can found in each segment. The itemset that are in majority of segments are pattern for of normal data set i.e. base connections.
- Connections in the training data set that contain base connections are treated as attack-free. Such attack-free connections are clustered into groups.
- The data points in the original data are compared with achieved clusters. Any point that is deviating is an outliers and is declared as possible attack.
Outlier detection methods for high-dimensional data can be divided into three main approaches i.e. extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outliers.
Try building an outlier detection model that detects false purchases from a record of the purchase.
Hope it helps!
May the force be with you ๐