Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – . Personally, I believe that we need to tune
also according to data distribution. LOCI works with following steps
- Compute density,
of data point
as the number of neighbours within distance
. Here, density is known as counting neighbourhood of data point
- Compute average density,
of data point
as the MEAN(density of neighbours of
within distance,
). Here,
is known as sampling neighbourhood of
The value of
is always set to be half of
in order to enable fast approximation. Therefore, we need to tune
for accuracy without touching
- Compute Multi-Granularity Deviation Factor (MDEF) at distance,
as
This factor shows the deviation of
from
for
. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of
to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.
- In this step, the deviation of
from
is converted into binary label, i.e., whether
is outlier or not. For this, we use
metric as
Here, STD refers to standard deviation. - A data point,
is declared as an outlier if its MDEF value is greater than
, where
is chosen to be 3.
Reference:
- I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal