# Local Correlation Integral (LOCI) – Outlier Detection Algorithm

Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – $\delta$. Personally, I believe that we need to tune $k$ also according to data distribution. LOCI works with following steps

1.  Compute density, $M(X,\epsilon)$ of data point $X$ as the number of neighbours within distance $\epsilon$. Here, density is known as counting neighbourhood of data point $X$
$M(X,\epsilon) = COUNT_{(Y:dist(X,Y) \leq \epsilon; Y \in datapoints )} Y$
2.  Compute average density, $AM(X,\epsilon,\delta)$ of data point $X$ as the MEAN(density of neighbours of $X$ within distance, $\delta$). Here, $\delta$ is known as sampling neighbourhood of $X$
$AM(X,\epsilon,\delta) = MEAN_{(Y:dist(X,Y) \leq \delta)} M(Y,\epsilon)$

The value of $\epsilon$ is always set to be half of $\delta$ in order to enable fast approximation. Therefore, we need to tune $\delta$ for accuracy without touching $\epsilon$

3.  Compute Multi-Granularity Deviation Factor (MDEF) at distance, $\delta$ as                                                                                                                                                                    ${MDEF}(X,\epsilon,\delta) = \frac{AM(X,\epsilon,\delta) - M(X,\epsilon)}{AM(X,\epsilon,\delta)}$

This factor shows the deviation of $M$ from $AM$ for $X$. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of $\delta$ to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.

4. In this step, the deviation of $M$ from $AM$ is converted into binary label, i.e., whether $X$ is outlier or not. For this, we use $\sigma(X,\epsilon,\delta)$ metric as
\begin{aligned}{\sigma}(X,\epsilon,\delta) = \frac{STD_{(Y:dist(X,Y) \leq \delta)}M(Y,\epsilon)}{AM(X,\epsilon,\delta)} \end{aligned}
Here, STD refers to standard deviation.
5.  A data point,$X$ is declared as an outlier if its MDEF value is greater than $k. \sigma(X,\epsilon,\delta)$, where $k$ is chosen to be 3.

Reference:

1.  I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal