Local Correlation Integral (LOCI) – Outlier Detection Algorithm

Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – \delta . Personally, I believe that we need to tune k also according to data distribution. LOCI works with following steps

  1.  Compute density, M(X,\epsilon) of data point X as the number of neighbours within distance \epsilon . Here, density is known as counting neighbourhood of data point X
    M(X,\epsilon) = COUNT_{(Y:dist(X,Y) \leq \epsilon; Y \in datapoints )} Y
  2.  Compute average density, AM(X,\epsilon,\delta) of data point   X as the MEAN(density of neighbours of X within distance,   \delta). Here, \delta is known as sampling neighbourhood of X
    AM(X,\epsilon,\delta) = MEAN_{(Y:dist(X,Y) \leq \delta)} M(Y,\epsilon)

    The value of \epsilon is always set to be half of \delta in order to enable fast approximation. Therefore, we need to tune  \delta for accuracy without touching \epsilon

  3.  Compute Multi-Granularity Deviation Factor (MDEF) at distance,  \delta as                                                                                                                                                                    {MDEF}(X,\epsilon,\delta) = \frac{AM(X,\epsilon,\delta) - M(X,\epsilon)}{AM(X,\epsilon,\delta)}  

    This factor shows the deviation of  M from  AM for  X. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of  \delta to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.

  4. In this step, the deviation of  M from  AM is converted into binary label, i.e., whether  X is outlier or not. For this, we use  \sigma(X,\epsilon,\delta) metric as
    \begin{aligned}{\sigma}(X,\epsilon,\delta) = \frac{STD_{(Y:dist(X,Y) \leq \delta)}M(Y,\epsilon)}{AM(X,\epsilon,\delta)} \end{aligned}
    Here, STD refers to standard deviation.
  5.  A data point, X is declared as an outlier if its MDEF value is greater than  k. \sigma(X,\epsilon,\delta) , where  k is chosen to be 3.

Reference:

  1.  I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s