First two weeks were terrible for me, where I felt like an alien on campus. There were various reasons for this felling – i) Everyone was engrossed in their red velvet cubicles either tweaking smart meter, playing with sensors, soldering boards or doing some incompressible stuff, which only they can decode. ii) administrative related issues, iii) and Mumbai residential flat rules. But all thanks to weekly “group meetings” and “Smart ICT” classes, which broke the frozen state and I felt like at my home institute. The best part of the class was that almost every week an eminent figure used to come, deliver his best and made us crazy for one and half hour. I rate this course as my best course I have ever attended, in which the instructor used to deliver lectures in the form of stories and made us awestruck. Remaining ten weeks were enjoying during which work went smoothly.
Well, I stayed outside the campus, but I survived for the first month with the food of Phoenix (H10), second month with Woodland (H8) and the last month with Queen of the campus, the enlightened abode (H1).
Some NP questions which I was never able to solve include:
The things which I am going to miss include:
Finally, three months stay finished and it was my last working day on campus. On this day, I had a ten minutes of wisdom sermon from my guide; then all lab members assembled and had a home made Cake (Thanks for the delicious cake). And to my surprise, I was requested to give a speech of above details, which I did. Oh, I forgot to thank Eduroam facility, otherwise, I might have suffered a lot on campus.
Our IIT Bombay team at one of the lab lunches
Concluding with the statement of my guide, “Once a SEILER, always a SEILER”.
</pre> #org_dfs represents object with missing readings timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled temp = xts(rep(NA,length(timerange)),timerange) complete_xts = merge(org_xts,temp)[,1] <pre>
Removing Duplicate values: Here, we will identify duplicate entries on the basis of duplicate time-stamps.
</pre>
# dummy time-series data
timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled
temp = xts(rep(NA,length(timerange)),timerange)
# identify indexes of duplicate entries
duplicate_enties = which(duplicated(index(temp)))
# data without duplicates
new_temp = temp[-duplicate_entries,]
<pre>
Resample Higher frequency data to lower frequency: In this function, we will resample the high-frequency data to lower frequency data. Note that there are some tweaks done according to timezone, currently set to “Asia/Kolkata”
</pre>
resample_data <- function(xts_datap,xminutes) {
library(xts)
#xts_datap: Input timeseries xts data, xminutes: required lower frueqeuncy rate
ds_data = period.apply(xts_datap,INDEX = endpoints(index(xts_datap)-3600*0.5, on = "minutes", k = xminutes ), FUN= mean) # subtracting half hour to align hours
# align data to nearest time boundary
align_data = align.time(ds_data,xminutes*60-3600*0.5) # aligning to x minutes
return(align_data)
}
<pre>
After running the LOF algorithm with following R code lines
library(Rlof) # for applying local outlier factor library(HighDimOut) # for normalization of lof scores set.seed(200) df <- data.frame(x = c( 5, rnorm(2,20,1), rnorm(3,30,1), rnorm(5,40,1), rnorm(9,10,1), rnorm(10,37,1))) df$y <- c(38, rnorm(2,30,1), rnorm(3,10,1), rnorm(5,40,1), rnorm(9,20,1), rnorm(10,25,1)) #pdf("understandK.pdf", width = 6, height = 6) plot(df$x, df$y, type = "p", ylim = c(min(df$y), max(df$y) + 5), xlab = "x", ylab = "y") text(df$x, df$y, pos = 3, labels = 1:nrow(df), cex = 0.7) dev.off() lofResults <- lof(df, c(2:10), cores = 2) apply(lofResults, 2, function(x) Func.trans(x,method = "FBOD"))
We get the outlier scores for 30 days on a range of k = [2:10] as follows:
Before explaining results further, I present the distance matrix as below, where each entry shows the distance between days X and Y. Here, X represents row entry and Y represents column entry.
Let us understand how outlier scores get assigned to day 1 on different k’s in the range of 2:10. The neighbours of point 1 in terms of increasing distance are:
Here the first row represents neighbour and the second row represents the distance between point 1 and the corresponding point. While noticing the outlier values of point 1, we find till k = 8, outlier score of point 1 are very high (near to 1). The reason for this is that the density of k neighbours of point 1 till k = 8 is high as compared to point 1. This results in higher outlier score to point 1. But, when we set k = 9, outlier score of point 1 drops to 0. Let us dig it deep further. The 8th and 9th neighbours of point 1 are points 18 and 17 respectively. The neighbours of point 18 in increasing distance are:
and the neighbours of point 17 are:
Observe carefully, that 8th neighbour of point 1 is point 18, and the 8th neighbour of point 18 is point 19. While checking the neighbours of point 18 we find that all of its 8 neighbours are nearby (in cluster D). This results in higher density for all k neighbours of point 1 till 8th neighbour as all these points are densest as compared to point 1, and hence point 1 with lesser density gets high anomaly score. On the other hand, 9th neighbour of point 1 is point 17 that has 9th neighbour as point 3. On further checking, we find that for all the points which are in cluster D now find their 9th neighbour either in cluster A or cluster B. This essentially decreases the density of all the considered neighbours of point 1. As a result, now all the points including point 1 and its 9 neighbours have densities in the similar range and hence point 1 gets low outlier score.
I believe that this small example explains how outlier scores vary with different k’s. Interested readers can use the provided R code to understand this example further.
To explain it further, let us consider a simple scenario shown in below figure
This figure shows the energy consumption of some imaginary home for one month (30 days). Each small circle represents energy consumption of a particular day, where a number above the circle shows the corresponding day of the month. Nearby circles marked within red clusters (A, B, C, D, E) represent days that follow a similar pattern in energy consumption as compared to remaining days.
To use LOF on such a dataset, we need to set the range of k values instead of a single k value. Note that lwrval and uprval are domain dependent. According to LOF paper, lwrval and uprval are defined as:
I believe now lwrval and uprval limits can be easily interpreted for any application domain. Therefore, according to original LOF paper now we can calculate LOF outlier values on a set of k values defined by lwrval and uprval. In the next post, I will explain the above figure further and show how a particular k value effects outlier score.
Mostly, classifiers predict output in the form of categorical labels, but there are instances where a classifier outputs final result in the form of a score, say a score in the range of [0, 1]. The above snapshot figure shows such a instance, where our classifier predicts output in the form of a score (shown in predicted Label column).
Till this, everything is ok, but the question is how can we compute the performance of such classifiers. I mean, well-known metrics like precision and recall are impossible to compute for such scenarios! Although, humans can attach meaning to theses numbers/score, say we consider a threshold value on predicted label, and the values higher than the threshold are labelled as Z and the values below the threshold are labelled as Y. On assuming a threshold of 0.8, we get something like this
Now, we have categorical labels both in the prediction column and in actual label column. Is it fair now to compute metrics like precision and recall ? Wait, we might get different results for precision and recall if we consider different threshold. Now, it is really troublesome to compute existing well-known metrics for these type of scenarios.
For such type of scenarios, we use another metric known as Area Under the Curve (AUC). The curve is known by the name of Receiver Operating Characteristics (ROC). The ROC curves corresponding to four (A, B, C, and D) different classifiers/methods are shown in below figure as
The true positive rate (TPR) give information about correctly identified instances while as false positive rate (FPR) gives information about misclassified instances. The ideal ROC curve have a TPR of 1 and FPR of 0. To extract the meaningful information from these ROC curves, we use AUC value which represents the area under the considered ROC curve. The AUC value ranges between [0, 1]. A value of 1 represents ideal/perfect performance and value of 0.5 represents random (50/50) performance.
The AUC value is computed at various thresholds. So, we can say that the final AUC value is not biased by a single threshold.
Customise the width of any cell in a table: First include package pbox. Then, cell contents of a table will go like this \pbox{5cm}{blah blabh}. Even the contents of cell can be forced to next line by using double backslash, i.e., \\
Customise width of entire table column: In this case, instead of mentioning l,c,r options of tabular environment for the said column, use p with width . For example \begin{tabular}{|r|c|p{4cm}} [Reference]
Present table in landscape style [Ref]: Add necessary packages and the syntax to show table in landscape mode is as
\usepackage{floats,lscape} \begin{landscape} \begin{table} …table stuff… \end{table} \end{landscape}
Place text and Figure side by side:
\usepackage{wrapfig} \begin{wrapfigure}{r}{4cm} //first option is placement (l,r), second width \includegraphics[]{abc.pdf} \caption{} \label{} \end{wrap figure}
Show complete paper reference (title, author name, etc) without citation:
\usepackage{bibentry} \nobibliography*
Write above two lines in the document heading in the same order, and then in the main document, for citing purpose write \bibentry{paperkey}
Shrink a table if it moves outside the text area:
Use resizebox as explained in this Stackoverflow answer.
To find the best value of , it is always good to follow ensemble approach, i.e., use a range of values to calculate LOF scores and then use a specific method to combine the outlier scores.
References:
At times we need to insert Mathematical formulae in presentations (Powerpoint or Keynote), and both Powerpoint and Keynote allow this by default. But, for guys, who are much comfortable with latex, then a simple and time-saving way is to copy the latex equation in LaTeXiT utility. Another advantage with this is that you can open the past file for editing at any time with this utility. Two simple steps required are:
Here, is the screenshot of LaTeXiT utility
Please find the detailed steps at this page of the blog.
The value of is always set to be half of in order to enable fast approximation. Therefore, we need to tune for accuracy without touching
This factor shows the deviation of from for . Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.
Reference:
The book concludes with the following lines: A wise man will be a master of his mind. A fool will be its slave.