Clustering and classification with applications in R

This is a one day intensive course on statistical learning techniques. The course is structured as a set of lecture sessions and computer practicals. The main focus is to introduce cluster analysis (unsupervised learning) and classification (supervised learning) approaches. Using R, participants will analyse real as well as example data sets and compute estimates of the misclassification rate, of the area under the receiver-operating characteristic (ROC) and of other relevant measurements. This course might be benfecial for a wide range of participants e.g., statisticians, bioinformaticians, data scientists, engineers and postgraduate students.


This course assumes all participants have basic knowledge of R, as covered by the 'Introduction to R' course. Basic concepts in statistics, in particular correlation and linear regression techniques are an advantage but not necessarily required


  • Multivariate concepts: Introducing basic statistical notions of multivariate analysis, such as covariance, correlation and Euclidean distance.
  • Similarity measures: Various measurements of similarity are discussed and applied in R.
  • k-means clustering: Concepts of k-means clustering and its visualising solutions are covered.
  • Hierarchical clustering: Linkage methods, agglomerative hierarchical clustering and additive trees are presented and applied in R.
  • Logistic regression: Introducing logistic regression technique for binary classification problems with R.
  • Bayes classifier and LDA: Other classification techniques included the Bayes classifier and linear discriminant analysis are introduced as further examples of statistical learning methods.
  • CART & Random forest: Descriping tree-based classifiers and introducing the random forest technique.
  • Bootstrap & Cross-validation: Methods for assessing classifiers by estimating missclasification error rates are Presented and applied in R.
  • ROC: Estimating the area under the receiver-operating characteristic (ROC) using R.

Delivered at:

Feedback from previous participants:

  • Dr. Prashant Joshi: I learnt a lot and enjoyed as well. Terrific clarity and excellent delivery, Superb!
  • A participant: A fantastic course with great teaching.


Dr. Osama Mahmoud, Senior Research Associate in Medical Statistics, School of Social and Community Medicine, University of Bristol, UK.

Prof. Berthold Lausen, Professor of Statistics and Head of Department of Mathematical Sciences, University of Essex, UK.


Each course is associated with an R package tailored to combine together the practical sheets with solutions, course notes and training data sets. The R package associated with this course, named 'essexBigdata', can be simply installed by running the following code lines into your R session.

install.packages("essexBigdata", type="source")

The package can then be loaded via: