# Illustration of k value effect on outlier score

Continuing with the previous post, here, I will illustrate how outlier scores vary while considering different k values. The context of below figure is already explained in my previous post.

After running the LOF algorithm with following R code lines


library(Rlof) # for applying local outlier factor
library(HighDimOut) # for normalization of lof scores
set.seed(200)
df <- data.frame(x = c( 5, rnorm(2,20,1), rnorm(3,30,1), rnorm(5,40,1), rnorm(9,10,1), rnorm(10,37,1)))
df$y <- c(38, rnorm(2,30,1), rnorm(3,10,1), rnorm(5,40,1), rnorm(9,20,1), rnorm(10,25,1)) #pdf("understandK.pdf", width = 6, height = 6) plot(df$x, df$y, type = "p", ylim = c(min(df$y), max(df$y) + 5), xlab = "x", ylab = "y") text(df$x, df\$y, pos = 3, labels = 1:nrow(df), cex = 0.7)
dev.off()
lofResults <- lof(df, c(2:10), cores = 2)
apply(lofResults, 2, function(x) Func.trans(x,method = "FBOD"))



We get the outlier scores for 30 days on a range of k = [2:10] as follows:

Before explaining results further, I present the distance matrix as below, where each entry shows the distance between days X and Y. Here, X represents row entry and Y represents column entry.

Let us understand how outlier scores get assigned to day 1 on different k’s in the range of 2:10. The neighbours of point 1 in terms of increasing distance are:

Here the first row represents neighbour and the second row represents the distance between point 1 and the corresponding point. While noticing the outlier values of point 1, we find till k = 8, outlier score of point 1 are very high (near to 1). The reason for this is that the density of  k neighbours of point 1 till k = 8 is high as compared to point 1. This results in higher outlier score to point 1. But, when we set k = 9, outlier score of point 1 drops to 0. Let us dig it deep further. The 8th and 9th neighbours of point 1 are points 18 and 17 respectively. The neighbours of point 18 in increasing distance are:

and the neighbours of point 17 are:

Observe carefully, that 8th neighbour of point 1 is point 18, and the 8th neighbour of point 18 is point 19. While checking the neighbours of point 18 we find that all of its 8 neighbours are nearby (in cluster D). This results in higher density for all k neighbours of point 1 till 8th neighbour as all these points are densest as compared to point 1, and hence point 1 with lesser density gets high anomaly score. On the other hand, 9th neighbour of point 1 is point 17 that has 9th neighbour as point 3. On further checking, we find that for all the points which are in cluster D now find their 9th neighbour  either in cluster A or cluster B. This essentially decreases the density of all the considered neighbours of point 1. As a result, now all the points including point 1 and its 9 neighbours have densities in the similar range and hence point 1 gets low outlier score.

I believe that this small example explains how outlier scores vary with different k’s. Interested readers can use the provided R code to understand this example further.

# Intuitive​ meaning of k range in Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) is a well-known outlier detection algorithm. In the previous post, I noted down the steps of LOF and here I will discuss its k parameter.  The k parameter often lands the users of LOF into difficulty, but while looking at the meaning of k parameter and the respective application domain, I find it is easy to select a k range. The authors of LOF suggest to use a range of k values instead of using a selective value. This is because we cannot generalise a particular value of k over various datasets following diverse underlying data distributions. Now, let us understand how to select lower (lwrval) and upper (uprval) values of the k range.

To explain it further, let us consider a simple scenario shown in below figure

This figure shows the energy consumption of some imaginary home for one month (30 days). Each small circle represents energy consumption of a particular day, where a number above the circle shows the corresponding day of the month.  Nearby circles marked within red clusters  (A, B, C, D, E) represent days that follow a similar pattern in energy consumption as compared to remaining days.

To use LOF on such a dataset, we need to set the range of k values instead of a single k value. Note that lwrval and uprval are domain dependent. According to LOF paper, lwrval and uprval are defined as:

• lwrval: This refers to the minimal cluster size which consists of similar behaving points, and we believe this similarity is not due to some random cause. This means that we assume a cluster with a size lower than lwrval represent outliers. For example, if I consider lwrval = 3, then clusters A and B represent outliers because none of the points within these clusters has three more similar points/neighbours. At the same time, points within clusters C, D, and E represent normal points because each of them has three more like neighbours.
• uprval: This suggests to the upper optimal number of points to be similar. In other words, we believe that uprval number of points must be similar in the considered application domain. For example, In the energy domain, I know that at least for 6 days (working days of a week) energy consumption is similar due to the occupancy behaviour. So, I set the uprval = 6. No doubt there can be a cluster with size greater than uprval, but our reasoning on a specific dataset motivates us for some optimal uprval. Consider an another example where we assume that occupants of a home change on a weekly basis – say there were 5,  10, 15, and 20 occupants on the first, second, third and fourth week of a month respectively. Consequently, the energy consumption on four different weeks should be similar intra-week and different inter-week. This example suggests that we should get four clusters corresponding to four weeks and the size of each cluster should be 7 (number of weekdays). So, our uprval is 7 in this example.

I believe now lwrval and uprval limits can be easily interpreted for any application domain. Therefore, according to original LOF paper now we can calculate LOF outlier values on a set of k values defined by lwrval and uprval. In the next post, I will explain the above figure further and show how a particular k value effects outlier score.

# Area Under the Curve (AUC): A performance metric

Consider a scenario where we have observations defined by the associated attributes. Further, assume that we know the actual class label of each observation as shown in below figure. [Note: this data is random and there is no relation between I/O]

Mostly, classifiers predict output in the form of categorical labels, but there are instances where a classifier outputs final result in the form of a score, say a score in the range of [0, 1]. The above snapshot figure shows such a instance, where our classifier predicts output in the form of a score (shown in predicted Label column).

Till this, everything is ok, but the question is how can we compute the performance of such classifiers. I mean, well-known metrics like precision and recall are impossible to compute for such scenarios! Although, humans can attach meaning to theses numbers/score, say we consider a threshold value on predicted label, and the values higher than the threshold are labelled as Z and the values below the threshold are labelled as Y.  On assuming a threshold of 0.8, we get something like this

Now, we have categorical labels both in the prediction column and in actual label column. Is it fair now to compute metrics like precision and recall ? Wait, we might get different results for precision and recall if we consider different threshold. Now, it is really troublesome to compute existing well-known metrics for these type of scenarios.

For such type of scenarios, we use another metric known as Area Under the Curve (AUC). The curve is known by the name of Receiver Operating Characteristics (ROC). The ROC curves corresponding to four (A, B, C, and D) different classifiers/methods are shown in below figure as

The true positive rate  (TPR) give information about correctly identified instances while as false positive rate (FPR) gives information about misclassified instances. The ideal ROC curve have a TPR of 1 and FPR of 0. To extract the meaningful information from these ROC curves, we use AUC value which represents the area under the considered ROC curve. The AUC value ranges between [0, 1]. A value of 1 represents ideal/perfect performance and value of 0.5 represents random (50/50) performance.

The AUC value is computed at various thresholds. So, we can say that the final AUC value is not biased by a single threshold.

# Local Outlier Factor

Local outlier Factor (LoF) is another density based approach to identify outliers in a dataset. The LoF is applicable to identify outliers in a dataset, which has a mixture of data distributions.

The above figure shows two different distributions, a dense cluster of points and a sparse distribution of points.  In such datasets, for each specific distribution within a dataset, we should perform outlier detection locally, i.e., points within one distribution should not affect outlier detection in another cluster. The LoF algorithm follows the same intuition and calculates anomaly score for  each point within a distribution as:

1. For each data point $X$, let $D^k(X)$ represent distance of point $X$ to its $k^{th}$ neighbor, and $L_{k}(X)$ represent set of points within $D^k(X)$
2. Compute reachability distance for each data point, $X$  as                                     $R_{k}(X, Y) = max(dist(X,Y), D^k(Y))$
3. Compute Average reachability distance $AR_{k}(X)$ of data point $X$ as                         $AR_{k}(X) = MEAN_{Y \in L_{k}(X)} R_{k}(X, Y)$
4. In the final step, LOF score for each point, $X$ is calculated as:                                                            $LOF_{k}(X) = MEAN_{Y \in L_{k}(X)} \frac{AR_{k}(X)}{AR_{k}(Y)}$

To find the best value of $k$, it is always good to follow ensemble approach, i.e., use a range of $k$ values to calculate LOF scores and then use a specific method to combine the outlier scores.

References:

1. Book: Outlier Analysis by Charu Aggarwal
2.  Wikipedia

# Local Correlation Integral (LOCI) – Outlier Detection Algorithm

Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – $\delta$. Personally, I believe that we need to tune $k$ also according to data distribution. LOCI works with following steps

1.  Compute density, $M(X,\epsilon)$ of data point $X$ as the number of neighbours within distance $\epsilon$. Here, density is known as counting neighbourhood of data point $X$
$M(X,\epsilon) = COUNT_{(Y:dist(X,Y) \leq \epsilon; Y \in datapoints )} Y$
2.  Compute average density, $AM(X,\epsilon,\delta)$ of data point $X$ as the MEAN(density of neighbours of $X$ within distance, $\delta$). Here, $\delta$ is known as sampling neighbourhood of $X$
$AM(X,\epsilon,\delta) = MEAN_{(Y:dist(X,Y) \leq \delta)} M(Y,\epsilon)$

The value of $\epsilon$ is always set to be half of $\delta$ in order to enable fast approximation. Therefore, we need to tune $\delta$ for accuracy without touching $\epsilon$

3.  Compute Multi-Granularity Deviation Factor (MDEF) at distance, $\delta$ as                                                                                                                                                                    ${MDEF}(X,\epsilon,\delta) = \frac{AM(X,\epsilon,\delta) - M(X,\epsilon)}{AM(X,\epsilon,\delta)}$

This factor shows the deviation of $M$ from $AM$ for $X$. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of $\delta$ to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.

4. In this step, the deviation of $M$ from $AM$ is converted into binary label, i.e., whether $X$ is outlier or not. For this, we use $\sigma(X,\epsilon,\delta)$ metric as
\begin{aligned}{\sigma}(X,\epsilon,\delta) = \frac{STD_{(Y:dist(X,Y) \leq \delta)}M(Y,\epsilon)}{AM(X,\epsilon,\delta)} \end{aligned}
Here, STD refers to standard deviation.
5.  A data point,$X$ is declared as an outlier if its MDEF value is greater than $k. \sigma(X,\epsilon,\delta)$, where $k$ is chosen to be 3.

Reference:

1.  I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal

# Box Cox Transformation

At most times, while dealing with data, I assume that the underlying distribution is normal. Also, I have found that most common statistical measures assume normal distribution of data. But we know that data distributions are not always normal. In simple words, it means that we need to plot the data always so as to confirm the underlying distribution. With plotting,  sometimes we also find that a small transformation ( like $x^2 , log(x)$) results in normal distribution. This means that data transformations can make our life simple and allow us to use statistical measures intended form normally distributed data.

Box Cox Transformation: George Box and Sir David Cox came out with a transformation formula which uses different values between -5  and 5  of a parameter ($\lambda$) to perform transformation. In other words, this formula finds the best value at which data can be represented normally.

$\displaystyle { x }_{ \lambda }^{ ' } = \frac { { x }^{ \lambda } - 1 }{ \lambda }$

$\lambda=0$ results in log transformation.  It is not guaranteed that data will always get transformed to normal distribution.

Reference:

# ARIMA, Time-Series Forecasting

We use ARIMA (Auto-Regressive Integrated Moving Average) to model time-series data for forecasting. ARIMA uses three basic concepts:

1. Auto-Regressive: The term itself points to regression, i.e., it predicts new value using regression over the previous lagged values of the same series. The lags used define its order
2. Integrated: This concept is used to remove trend (continuously increasing/decreasing time-series) from the time series. This is done by differencing consecutive values of time-series.
3. Moving Average: In this we perform regression by using the error terms at various lags. The lags used define its order

ARIMA works only on stationary data. If the input data is not stationary (detected via automated tests, i.e, different unit tests like famous Dickey-Fuller test), then stationary is achieved via differencing approach. The ARIMA forecasting equation for a stationary time-series is regression type equation in which predictors consist of previous response values at different lags. This also includes forecast errors at different lags.

$Predictor (y)\quad =\quad C\quad +\quad Weighted\quad sum\quad of\quad previous\quad y\quad and\quad previous\quad errors\quad at\quad various\quad lags$

Auto-regressive models and exponential smoothing are all special cases of ARIMA models

References: