Local Correlation Integral (LOCI) – Outlier Detection Algorithm

Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter – \delta . Personally, I believe that we need to tune k also according to data distribution. LOCI works with following steps

  1.  Compute density, M(X,\epsilon) of data point X as the number of neighbours within distance \epsilon . Here, density is known as counting neighbourhood of data point X
    M(X,\epsilon) = COUNT_{(Y:dist(X,Y) \leq \epsilon; Y \in datapoints )} Y
  2.  Compute average density, AM(X,\epsilon,\delta) of data point   X as the MEAN(density of neighbours of X within distance,   \delta). Here, \delta is known as sampling neighbourhood of X
    AM(X,\epsilon,\delta) = MEAN_{(Y:dist(X,Y) \leq \delta)} M(Y,\epsilon)

    The value of \epsilon is always set to be half of \delta in order to enable fast approximation. Therefore, we need to tune  \delta for accuracy without touching \epsilon

  3.  Compute Multi-Granularity Deviation Factor (MDEF) at distance,  \delta as                                                                                                                                                                    {MDEF}(X,\epsilon,\delta) = \frac{AM(X,\epsilon,\delta) - M(X,\epsilon)}{AM(X,\epsilon,\delta)}  

    This factor shows the deviation of  M from  AM for  X. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of  \delta to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.

  4. In this step, the deviation of  M from  AM is converted into binary label, i.e., whether  X is outlier or not. For this, we use  \sigma(X,\epsilon,\delta) metric as
    \begin{aligned}{\sigma}(X,\epsilon,\delta) = \frac{STD_{(Y:dist(X,Y) \leq \delta)}M(Y,\epsilon)}{AM(X,\epsilon,\delta)} \end{aligned}
    Here, STD refers to standard deviation.
  5.  A data point, X is declared as an outlier if its MDEF value is greater than  k. \sigma(X,\epsilon,\delta) , where  k is chosen to be 3.

Reference:

  1.  I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal

The Magic of Thinking Big

I got a chance to read the book titled “The Magic of Thinking big” by David J. Schwartz. I started to read on 21 April 2016 and finished on 20 May 2016. I find this book helpful as it guides – how should we develop/improve our innate to the best. This book supports each claim with real-world examples. I believe it is impossible to contradict the author. While reading, I tried to note down some anecdotes and here I pen down the same for my revision. Each bullet heading refers to a different chapter title and sub-bullets are the anecdotes found in the respective chapter

  • Belief
    • Believe in yourself that you can grow/improve with time if you put an effort
    • Believe you are going to make a difference in this world if you would like
  • Cure yourself on Excusists
    • Thinking guides Intelligence –  Think always. Take some time out of your busy schedule and think what you are doing, how you can improve the things, are you really doing something meaningful or it is wastage of time
    • Avoid excusitis: Types – Health, Intelligence, age, luck. Don’t give excuses of any of the mentioned types. Any of these excuses is just like a disease which starts because of small issue but it covers your entire body/life till death. Better avoid giving any excuse. Accept your fault at first hand. 
  •  Build Confidence and Destroy Fear 
    • Action cures fear. Indecision and postponement fertilise fear
    • Hope is a start. It needs action to win victories
    • Destroy negative thoughts before they become mental monsters
    • Recall only best moments of your life
    • Don’t do things that result in guilt. Guilt has serious repercussions
    • Motions/actions are the precursors of emotions
  • How to think Big  
    • Know thyself/yourself
    • Never sell yourself short
    • Conquer the crime of self-deprecation. Concentrate on your assets
    • To think big, use words that produce big positive mental images
    • See Image as – Bright, hope, success, fun, and victory
    • Use big, positive, cheerful words and phrases to describe how you feel
    • Use bright, cheerful, favourable words and phrases to describe other people
    • Use positive language to encourage others
    • Use positive language to outline plans to others
    • Visualisation adds value to everything. A big thinker always visualizes what can be done in the future. He isn’t stuck with the present
    • Practice adding value to things, to people, and to yourself
    • Keep your eyes focused on big objective
    • Ask, “Is it really important”
    • Don’t fall into the triviality trap
  • How to think and dream creatively
    • When you believe in something, your mind automatically finds ways to get it 
    • Where there is a will, there is a way
    • Develop weekly improvement programme
    • Devote 10 minutes every day before work: What better I can do today? What best I can achieve today?
    • Capacity is a state of mind! How much we can do, depends on  how much we think we can do
    • Don’t let traditions paralyze your mind. Be experimental!
    • Ask yourself, “How can I do more?”
    • Practice asking and listening. 
    • Stretch your mind. Associate with people who can help you think of new ideas
  • You are what you think you are 
    • Dress properly. It defines you to others and most importantly to your innate. It gives you respect and defines your identity. Your appearance talks to others
    • Look important and think your work is important
    • The way we think about our jobs determines how our subordinates think toward their jobs
    • Always give a pep-talk [confidence building]
    • Practice uplifting self-praise. Don’t  practise belittling self-punishment
    • Sell yourself on yourself. Create a Commercial of yourself and repeat this commercial every day number of times
    • Always ask yourself, “Is this the way an important person thinks?”
  • Manage your environment 
    • You are a product of your environment
    • Always associate with positive people, and indulge with diverse groups. This gives a first-hand experience and increases your horizon
    • Never get involved in gossips. It is a thought poison
  • Make your attitudes your allies 
    • Build enthusiasm. Possible means: (i) Dig Deeper, (ii) Broadcast good news
    • Practice Appreciation: No matters what, to whom, always appreciate
    • Practice calling people by their names
    • Ask everyday yourself, “ What can I do today to make the day better, add-on scientific career”  
  • Get the Action Habit 
    • Don’t wait for perfect actions. Got an idea – work over it as soon as possible
    • To fight fear, act. To increase fear – wait, postpone, put-off
    • Action cures fear
    • Take pencil and paper; and start writing, figuring your ideas on paper
    • Benjamin Franklin: Don’t put off until tomorrow what you can do today
  • How to turn defeat into action 
    • Persistence does not guarantee victory, but adding experimentation make things happen. Possible approach: Apply, fail and, re-learn 
    • Study your setbacks to make your future bright
    • Stop blaming luck. Blaming luck will never help you to reach your full potential
    • Find the good side of each situation
  • Use goals to Help you Grow 
    • Goal should be clear and crisp, what you are going to do
    • Without setting goals, people work on hazy things without knowing where they spend time
    • Goals are essential to success as air is to life
    • Success requires heart and soul effort and you can put both of them into what you really desire. So, always work on the things which you like, no matters what other people think
    • Energy increases and even multiplies when you set a desired goal  and resolve to work towards that goal.
    • Use goals to live longer
    • Achieve your goals one at a time. Build 30-day goals
    • Invest in yourself. Purchase things that build your mental power and efficiency.
  • How to think like a leader 
    • Think progress, believe in progress and push for progress
    • Over time, we learn different ways to do things from experiences and from colleagues. Copy only high standards in your daily routine. Be sure that the master carbon copy is worth duplicating
    • Take time from every day/every weekend to confer with yourself, to introspect and to tap  your supreme thinking. Spend some time alone  every day just for thinking.

The book concludes with the following lines: A wise man will be a master of his mind. A fool will be its slave.

Box Cox Transformation

At most times, while dealing with data, I assume that the underlying distribution is normal. Also, I have found that most common statistical measures assume normal distribution of data. But we know that data distributions are not always normal. In simple words, it means that we need to plot the data always so as to confirm the underlying distribution. With plotting,  sometimes we also find that a small transformation ( like x^2 , log(x) ) results in normal distribution. This means that data transformations can make our life simple and allow us to use statistical measures intended form normally distributed data.

Box Cox Transformation: George Box and Sir David Cox came out with a transformation formula which uses different values between -5  and 5  of a parameter (\lambda) to perform transformation. In other words, this formula finds the best value at which data can be represented normally.

\displaystyle { x }_{ \lambda  }^{ ' }  =  \frac { { x }^{ \lambda  } - 1 }{ \lambda  }

\lambda=0 results in log transformation.  It is not guaranteed that data will always get transformed to normal distribution.

 

Reference:

  1. https://www.youtube.com/watch?v=EJ6EhfenqNs
  2. https://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/

ARIMA, Time-Series Forecasting

We use ARIMA (Auto-Regressive Integrated Moving Average) to model time-series data for forecasting. ARIMA uses three basic concepts:

  1. Auto-Regressive: The term itself points to regression, i.e., it predicts new value using regression over the previous lagged values of the same series. The lags used define its order
  2. Integrated: This concept is used to remove trend (continuously increasing/decreasing time-series) from the time series. This is done by differencing consecutive values of time-series.
  3. Moving Average: In this we perform regression by using the error terms at various lags. The lags used define its order

ARIMA works only on stationary data. If the input data is not stationary (detected via automated tests, i.e, different unit tests like famous Dickey-Fuller test), then stationary is achieved via differencing approach. The ARIMA forecasting equation for a stationary time-series is regression type equation in which predictors consist of previous response values at different lags. This also includes forecast errors at different lags.

Predictor (y)\quad =\quad C\quad +\quad Weighted\quad sum\quad of\quad previous\quad y\quad and\quad previous\quad errors\quad at\quad various\quad lags

Auto-regressive models and exponential smoothing are all special cases of ARIMA models

 

References:

  1. http://ucanalytics.com/blogs/arima-models-manufacturing-case-study-example-part-3/
  2. http://people.duke.edu/~rnau/411arim.htm

Stationary Time-series Data

Before fitting any existing model to time-series data we check for different things

  • Stationarity of time series: Stationarity  of time series is ensured by three properties [1].  A pictorial depiction of these properties is shown at link
    • The mean of time-series (xt) is the same for all xt .
    • The variance of time-series (xt) is the same for all xt .
    • The covariance (and also correlation) between xt and xt-h is the same for all t.

We care about stationarity because most time-series models  assume stationarity. Also, a time-series can be stationarized by detrending and differencing.

 

 

References :

  1. https://onlinecourses.science.psu.edu/stat510/node/60
  2. http://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/

 

Non-parametric Regression/Modelling

In parametric regression/modelling we set different parameters to model our data. Say if w know the underlying data distribution is normal then we can set different combination of mean and standard-deviations until the model will reflect underlying data perfectly. All right! But how will you model the data if you don’t know the underlying data distribution. For example, in the below figure, underlying data distribution is uniform but on assuming that data is normal we got the estimate as shown, which is normal. This shows that we cannot rely on such parametric models if are not sure of underlying data distribution.Screen Shot 2016-05-11 at 14.40.13

Non-Parametric Regression/Modelling or Kernel Density Estimation: In this, we use underlying data and some kernel function to model the data. The main idea is that at the point of interest we observe the neighboring points and  estimate value as a function of there positions/values. The neighboring points get weights according to kernel function used. The most common kernel function used is Gaussian kernel. Please find remaining  kernels used at this wiki link. An important tuning parameter in this technique is the width of the neighboring region considered for estimation, commonly known as bandwidth. This parameter controls the smoothness of estimation. Lower values results in more tightness but also include some noise, while as higher values result in more smoothness. But computing optimal value of bandwidth is not our concern as there exist number of approaches which calculate this value. The effect of bandwidth using  Gaussian kernel is shown in below figure.

Screen Shot 2016-05-11 at 14.52.59.png

This shows that automatic selection of bandwidth parameter is good enough to represent to underlying data distribution.

R code

set.seed(1988)
x = runif(700,min=0,max=1)
x.test = seq(-2,5,by=0.01) 
x.fit = dnorm(x.test,mean(x),sd(x))
x.true = dunif(x.test,min=0,max=1)
bias = x.fit-x.true
#plot(bias~x.test)
plot(x.fit~x.test,t="l", ylim= c(0,1.4),lty=2)
lines(x.true~x.test,lty=1)
legend("topleft",legend = c("Estimate","Actual"), lty = c(2,1))
###########kernel density plots######
 set.seed(1988)
 ser = seq(-3,5, by=0.01)
 x = c(rnorm(500,0,1),rnorm(500,3,1)) # mixture distributions
 x.true = 0.5*dnorm(ser,0,1) + 0.5*dnorm(ser,3,1)
 plot(x.true~ser,t="l")
par(mfrow=c(1,4))
# Guassian Kernel
 plot(density(x,bw=0.1),ylim=c(0,0.3),lty=2, main="Gaussian Kernel, bw=0.1")
 lines(x.true~ser,lty=1)
 legend("topleft",legend = c("Actual","Estimated"),lty=c(1,2))

 plot(density(x,bw=0.3),ylim=c(0,0.3),lty=2, main="Gaussian Kernel, bw=0.3")
 lines(x.true~ser,lty=1)
 legend("topleft",legend = c("Actual","Estimated"),lty=c(1,2))

 plot(density(x,bw=0.8),ylim=c(0,0.3),lty=2, main="Gaussian Kernel, bw=0.8")
 lines(x.true~ser,lty=1)
 legend("topleft",legend = c("Actual","Estimated"),lty=c(1,2))

 plot(density(x),ylim=c(0,0.3),lty=2, main="Gaussian Kernel, bw=automatic")
 lines(x.true~ser,lty=1)
 legend("topleft",legend = c("Actual","Estimated"),lty=c(1,2))

References:

  1. https://www.youtube.com/watch?v=QSNN0no4dSI
  2. https://en.wikipedia.org/wiki/Kernel_(statistics)