Unload your R packages

Very few times, we want to unload either some of the selected packages or all of the packages. Although there is a simple way around it,  i.e., to restart the R session, but this gives a lot of pain by re-running each functions on reload. While dealing with the same situation, I found a simple three line code  from Stack Overflow link. Here is the code

pkgs = names(sessionInfo()$otherPkgs) # finds all the loaded pkgs
pkgs_namspace = paste('package:',pkgs,sep = "")
# please find ?detach for meanings of option used
lapply(pkgs_namespace,detach, character.only=TRUE,unload=TRUE,force=TRUE)

To remove a specific package say xyz, we use

detach("package:xyz") # please use ?detach for options

 

 

Please excuse, if any typos found

 

Identify Collinearity in Multiple Predictor Variables

Fitting a correct model to data is a laborious task – it needs to try various models to find the relation between predictor variables and a response variable. Before fitting a model to data we need to ensure that data is standardised. Standardisation ensures that each predictor variable is on the same scale. For example, although human heights do not vary too much in range but the weight varies too much among individuals. This means the scale of heights predictor is small as compared to weight. We should not use them directly for modelling. Instead, both of the predictors should be brought on the same scale. This is what standardization ensures. Generally, in standardization, we center and scale our predictors, i.e.,

  • Center: In this step, we subtract the mean of a predictor variable from each respective observation of the same variable. This ensures that mean  of resultant predictor variable is zero. Hence, our predictor variable gets centered to 0 (zero)
  • Scale: In this step, we divide each predictor observation by the standard deviation of the predictor variable. This ensures that the standard deviation of resultant predictor observation become 1.

Therefore, with standardization, all of our predictors have 0 mean and standard deviation of 1. All right, now all the predictors are on the same scale and we can apply any of your favourite machine learning algorithms. After standardization, we should check for collinearity – is there any correlation between predictor variables? If there exists correlation, that means both of your predictors are explaining the same thing. In other words, if the two predictors are correlated then one variable over the other is not explaining anything extra of the response variable.

With this, a simple question arises, if any predictor is not explaining anything extra about a response variable over the other predictor, then why should we include extra predictor. Using correlated predictors unnecessarily makes our model complex and wastes extra computing cycles. Therefore, It is always encouraged to identify such correlated predictors and remove one of the predictors from the pair. A simple algorithm used for removing highly correlative predictors is mentioned in book “Applied Predictive Modelling” as

  1. Calculate the correlation matrix of the predictors.
  2. Determine the two predictors associated with the largest absolute pairwise
    correlation (call them predictors A and B).
  3. Determine the average correlation between A and the other variables.Do the same for predictor B.
  4. If A has a larger average correlation, remove it; otherwise, remove predictor B.
  5. Repeat Steps 2–4 until no absolute correlations are above the threshold.

Another simple way is to draw scatter plot and see if you can spot any linearity effect between any of two variables. In R, pairs() command is the best to find the collinearity effect.

Reference:

  1. Book – Applied Predictive Modelling by Max Kuhn
  2. Book – An Intro. to Statistical learning by Gareth James et al.

 

What Best College Students Do

The title of this post is the name of a book written by Ken Bain. I heard about this book while attending the workshop on “Effective Teaching” by Dr. Pankaj Jalote at IIT-Delhi. I read this book  cover to cover and here I note down the important points I found within the book. Each bullet point mentions different qualities of best students

  • Goal oriented: Best students know what they want to do. Do you know?
  • Deliberate Practice: To achieve mastery in any specialized field, best students involve themselves in deliberate practice sessions. This practice should be done with consistency
  • Interconnections: Whenever you learn or see anything new, try to correlate it with the existing knowledge you have
  • Growth Mindset: Always believe intelligence is not fixed, but it can be improved with effort
  • Self-conversation: Always talk with yourself. Who am I? What am I doing here? Do I really making any difference or am I living someones’s else life
  • Reflecting on experiences: Best learners not only use their experience but they reflect on their experiences, i.e., they ponder while observing the game
  • Self-compassion: They are open and hence take responsibility of good or bad. They face failures, but they know this is a part and hence keep moving towards their goal
  • Mindfulness: Is a state of awareness. You see your problems clearly  and accept mental and social phenomenon.
  • Encouragement: Self-comfort, self-examination, and dedication.
  • Liberal arts: Do always different works/courses apart from your focused field. This always helps you to create connections and broaden the horizons
  • Self-control, responsibility, deadlines

Reference:

  •  What Best College Students book by Ken Bain

Parallel Programming In R

In R, often times we get stuck by the limited processing power of our machines.  This can be easily solved by using parallel processing. In R, there are various libraries which enable parallel processing, but here I will use only parallel library.

Example: Here, I will explain a simple scenario of parallel package usage. Consider I have a data frame with thousand rows and two columns. Now, I need to compute the sum of each of 100 subsequent rows, i.e, I want to compute sums of rows c(1:100), c(101:200) ,…, c(901:1000) in parallel. This means I won’t compute sums in serial manner.

 

library(parallel)
# Create a dummy data frame with 1000 rows and two columns
set.seed(100688)
df = data.frame(x=rnorm(1000),y=rnorm(1000))
no_cores = detectCores()-1# No. of cores in your system
cl = makeCluster(no_cores) # Make cluster
# Generate list of indexes for row summation of data frame
indexs = seq(from=1,to=1000, by =100)
clusterExport(cl,'df') # pass parameters on-fly to the threads
start_time = Sys.time() # start time of parallel computation
parallel_result = parLapply(cl,indexs,sumfunc)
total_time = Sys.time()-start_time # total time taken for computation
cat ('Total_parallel_time_taken',total_time)
stopCluster(cl)

sumfunc = function(ind) {
# Computs row sum of 100 rows starting with the index, ind
rowsum = colSums(df[ind:ind+99,])
return (rowsum)
}

# More than one parameter can be sent in the form of a list as
clusterExport(cl,list('xx','yy','zz') # parameters sent on-fly

Other Related Blogs:

  1. How-to go parallel in R – basics + tips
  2. A brief foray into parallel processing with R
  3. Parallel computing in R on Windows and Linux using doSNOW and foreach