Data Mining is a process of discovering patterns from a large data set by implementing machine learning and statistics. It is also call it Knowledge Discovery in Data (KDD). One of the most vital feature in data mining is outlier analysis or detection. In statistics or data science, an outlier is a point which is quite distant from other points. It can be feature-specific or have multiple attributes as a whole. It may also be an extreme value of a variable. In other words, outlier is a pattern which is dissimilar from other patterns of the data. Outlier Detection a.k.a. Outlier Mining helps in data purification by making the data set unbiased.
Methods for Outlier Detection
There are many approaches for analyzing outliers. Some of them being:
a) Extreme Value Analysis. The most basic form of outlier analysis method is via extreme value analysis. In this method, we assume that the values which are too small or too large compared to others are outliers.
b) Data Visualization. Pictorial representation or visual analytics is always a good way to put up a result. This process comes handy when we deal with one or two dimensional data.
c) Univariate Techniques. A univariate outlier is one which comprises of extreme values for a single variable. Some ways to identify it is by calculating the z-scores, linear regression models, principal component analysis, etc. You can fix it by replacing with either of mean/ median/ mode, or deleting a record if possible or substituting with a domain-appropriate value for the variable.
d) Multivariate Techniques. Here the outliers are not extreme in one dimension, rather they deviate into several dimensions from the main data structure. Some ways to detect it is by DBScan clustering, isolation forest, etc.
Application of Outlier Analysis
Outlier analysis is an important step in data mining. The nature of attributes determine the applicability of these techniques. Outlier detection is important because it helps in fraud detection, detection of anomalies in computer networks and so on. But there are some challenges primarily because of the large volume of high dimensional data associated with most data mining applications and also due to the performance requirements. Outright elimination of outliers sometimes lead to loss of hidden information. Due to this, analyzing them is very important as it could decipher many unknown trend which might help in business.
Data science helps business extract newer insights by combining with feedback data and also finding trends and patterns. Data mining helps in capturing these because through manual analysis it won’t be possible to capture this humongous amount of data. Outlier Analysis is thus important to capture truthfulness of the situation. Noise handling is always a challenge in outlier detection. Deletion or retention of outliers always depend on the type of data set and the domain. Sometimes new outliers may emerge which were previously masked by other predominant outliers. Thus, re-run of outlier analysis becomes necessary to reduce the influence of outliers.
.