Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – . Personally, I believe that we need to tune also according to data distribution. LOCI works with following steps
- Compute density, of data point as the number of neighbours within distance . Here, density is known as counting neighbourhood of data point
- Compute average density, of data point as the MEAN(density of neighbours of within distance, ). Here, is known as sampling neighbourhood of
The value of is always set to be half of in order to enable fast approximation. Therefore, we need to tune for accuracy without touching
- Compute Multi-Granularity Deviation Factor (MDEF) at distance, as
This factor shows the deviation of from for . Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.
- In this step, the deviation of from is converted into binary label, i.e., whether is outlier or not. For this, we use metric as
Here, STD refers to standard deviation.
- A data point, is declared as an outlier if its MDEF value is greater than , where is chosen to be 3.
- I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal