R scripts to handle dirty data

Filling missing values with NA:  I assume that org_xts in the following code represent a given time-series object in which we need to handle missing readings

#org_dfs represents object with missing readings
timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled
temp = xts(rep(NA,length(timerange)),timerange)
complete_xts = merge(org_xts,temp)[,1]


Removing Duplicate values: Here, we will identify duplicate entries on the basis of duplicate time-stamps. 

# dummy time-series data
timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled
temp = xts(rep(NA,length(timerange)),timerange)
# identify indexes of duplicate entries
duplicate_enties = which(duplicated(index(temp)))
# data without duplicates
new_temp = temp[-duplicate_entries,]


Resample Higher frequency data  to lower frequency: In this function, we will resample the high-frequency data to lower frequency data. Note  that there are some tweaks done according to timezone, currently set to “Asia/Kolkata”  

resample_data <- function(xts_datap,xminutes) {
#xts_datap: Input timeseries xts data, xminutes: required lower frueqeuncy rate
ds_data = period.apply(xts_datap,INDEX = endpoints(index(xts_datap)-3600*0.5, on = "minutes", k = xminutes ), FUN= mean) # subtracting half hour to align hours
# align data to nearest time boundary
align_data = align.time(ds_data,xminutes*60-3600*0.5) # aligning to x minutes


Illustration of k value effect on outlier score

Continuing with the previous post, here, I will illustrate how outlier scores vary while considering different k values. The context of below figure is already explained in my previous post.

Screen Shot 2016-08-22 at 11.02.08

After running the LOF algorithm with following R code lines

library(Rlof) # for applying local outlier factor
library(HighDimOut) # for normalization of lof scores
df <- data.frame(x = c( 5, rnorm(2,20,1), rnorm(3,30,1), rnorm(5,40,1), rnorm(9,10,1), rnorm(10,37,1)))
df$y <- c(38, rnorm(2,30,1), rnorm(3,10,1), rnorm(5,40,1), rnorm(9,20,1), rnorm(10,25,1))
#pdf("understandK.pdf", width = 6, height = 6)
plot(df$x, df$y, type = "p",  ylim = c(min(df$y), max(df$y) + 5), xlab = "x", ylab = "y")
text(df$x, df$y, pos = 3, labels = 1:nrow(df), cex = 0.7)
lofResults <- lof(df, c(2:10), cores = 2)
apply(lofResults, 2, function(x) Func.trans(x,method = "FBOD"))

We get the outlier scores for 30 days on a range of k = [2:10] as follows:

Screen Shot 2016-08-22 at 11.11.00

Before explaining results further, I present the distance matrix as below, where each entry shows the distance between days X and Y. Here, X represents row entry and Y represents column entry.

Screen Shot 2016-08-22 at 11.22.44

Let us understand how outlier scores get assigned to day 1 on different k’s in the range of 2:10. The neighbours of point 1 in terms of increasing distance are:

Screen Shot 2016-08-22 at 16.50.30

Here the first row represents neighbour and the second row represents the distance between point 1 and the corresponding point. While noticing the outlier values of point 1, we find till k = 8, outlier score of point 1 are very high (near to 1). The reason for this is that the density of  k neighbours of point 1 till k = 8 is high as compared to point 1. This results in higher outlier score to point 1. But, when we set k = 9, outlier score of point 1 drops to 0. Let us dig it deep further. The 8th and 9th neighbours of point 1 are points 18 and 17 respectively. The neighbours of point 18 in increasing distance are:

Screen Shot 2016-08-22 at 17.02.37

and the neighbours of point 17 are:

Screen Shot 2016-08-22 at 17.03.13

Observe carefully, that 8th neighbour of point 1 is point 18, and the 8th neighbour of point 18 is point 19. While checking the neighbours of point 18 we find that all of its 8 neighbours are nearby (in cluster D). This results in higher density for all k neighbours of point 1 till 8th neighbour as all these points are densest as compared to point 1, and hence point 1 with lesser density gets high anomaly score. On the other hand, 9th neighbour of point 1 is point 17 that has 9th neighbour as point 3. On further checking, we find that for all the points which are in cluster D now find their 9th neighbour  either in cluster A or cluster B. This essentially decreases the density of all the considered neighbours of point 1. As a result, now all the points including point 1 and its 9 neighbours have densities in the similar range and hence point 1 gets low outlier score.

I believe that this small example explains how outlier scores vary with different k’s. Interested readers can use the provided R code to understand this example further.

Unload your R packages

Very few times, we want to unload either some of the selected packages or all of the packages. Although there is a simple way around it,  i.e., to restart the R session, but this gives a lot of pain by re-running each functions on reload. While dealing with the same situation, I found a simple three line code  from Stack Overflow link. Here is the code

pkgs = names(sessionInfo()$otherPkgs) # finds all the loaded pkgs
pkgs_namspace = paste('package:',pkgs,sep = &quot;&quot;)
# please find ?detach for meanings of option used
lapply(pkgs_namespace,detach, character.only=TRUE,unload=TRUE,force=TRUE)

To remove a specific package say xyz, we use

detach("package:xyz") # please use ?detach for options



Please excuse, if any typos found


Parallel Programming In R

In R, often times we get stuck by the limited processing power of our machines.  This can be easily solved by using parallel processing. In R, there are various libraries which enable parallel processing, but here I will use only parallel library.

Example: Here, I will explain a simple scenario of parallel package usage. Consider I have a data frame with thousand rows and two columns. Now, I need to compute the sum of each of 100 subsequent rows, i.e, I want to compute sums of rows c(1:100), c(101:200) ,…, c(901:1000) in parallel. This means I won’t compute sums in serial manner.


# Create a dummy data frame with 1000 rows and two columns
df = data.frame(x=rnorm(1000),y=rnorm(1000))
no_cores = detectCores()-1# No. of cores in your system
cl = makeCluster(no_cores) # Make cluster
# Generate list of indexes for row summation of data frame
indexs = seq(from=1,to=1000, by =100)
clusterExport(cl,'df') # pass parameters on-fly to the threads
start_time = Sys.time() # start time of parallel computation
parallel_result = parLapply(cl,indexs,sumfunc)
total_time = Sys.time()-start_time # total time taken for computation
cat ('Total_parallel_time_taken',total_time)

sumfunc = function(ind) {
# Computs row sum of 100 rows starting with the index, ind
rowsum = colSums(df[ind:ind+99,])
return (rowsum)

# More than one parameter can be sent in the form of a list as
clusterExport(cl,list('xx','yy','zz') # parameters sent on-fly

Other Related Blogs:

  1. How-to go parallel in R – basics + tips
  2. A brief foray into parallel processing with R
  3. Parallel computing in R on Windows and Linux using doSNOW and foreach

RStudio Console Output Redirection

In RStudio Console, most times we need to execute some commands a number of times for testing purposes. All though it serves the purpose of tracing a bug or refining the output, but still printing the output on the same console window makes it congested. To keep the console window clean, a R package “rite” creates a separate window for output. Moreover, you can move this window to another monitor as well. Although it provides various options, but for a minimal purpose you can start with this as

  1. Install package –   install.packages(“rite”)
  2. Load the package
  3. Start a separate window as – sinkstart ()
  4. Stop  this window as – sinkstop()