Introduction to Open Data Science
- 1. Tools and methods for open and reproducible research
2.Regression and model validation
3. Logistic regression
4. Clustering and classification
5. Dimensionality reduction technique

Introduction to Open Data Science

This course is a brief introduction to data science - what to do with the big amount of data, how to model them and extract important information, which could be used later on for analyses and predictions. The course will teach us how to use GitHUb (system for version control) to share results. The modeling of data will be performed using R.

1. Tools and methods for open and reproducible research

First step is to learn basic GitHub commands as fork, copy, commit. We have to make our first steps in R as well. The goal of creating this document is to learn how to create markdown documents and share them on the net using GitHub.

Here is a link to to the GitHub page, related to my work on this course:

https://github.com/noykova/IODS-project

2.Regression and model validation

2.1. Reading the data

Here we use part of the sociological data described here. The data concern learning process of students. The full data consists of 183 observations of 60 variables. After wrangling the data into a format that is easy to analyze we have obtained 166 observations of 7 variables.

The wrangling includes the following changes:

The column Attitude in original data is a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5). Here we have scaled the combination variable back to the 1-5 scale.
Next multiple questions are combined into combination variables so that tree groups of questions related to deep, surface and strategic learning are formed. The columns deep, surf and stra are formed by averaging the corresponding groups of questions.
In the original data the variable ‘points’ denotes the student exam points in a statistics course exam. If the student did not attend an exam, the value of ‘points’ will be zero. These observations are removed from the data.

Thus the data set, used in this analysis, includes 166 observations of the following variables: gender,age, attitude, deep, stra, surf and points, where gender and age are the gender (F/M) and age of the students, and collected points>0.

The R code about wrangling described above is given here

The data are read using the following command:

lrn14 <- read.table("learning2014.txt", header=TRUE)
dim(lrn14)

## [1] 166   7

str(lrn14)

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.2 Graphical overview of the data and summaries of the variables

Access the GGally and ggplot2 libraries

library(GGally)
library(ggplot2)

Next we create a more advanced plot matrix with ggpairs() The purpose of this function is to display two grouped data in a plot matrix. Analysis of this plot is used for roughly determining if we have a linear correlation between data.

p <- ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

In this function the elements are:

mapping is an aestetic mapping between variables aes - defines how the variables in the data are mapped via parameters in geom_bar

col should be a categorical variable, for which different colors are used.

alpha - for large datasets with overlapping the alpha aestetic will make the points more transparent.

lower (or upper) are lists that may contain the variables continuous, combo, discete or na. Each element may be a function or a string.

combo - this option is used either continuous X and categorical Y or categorical X and continuous Y. It could take values box, box-no-facet, dot, dot-no-facet, facetist, facet density, blank.

If lower (or upper or diagonal) lists are supplied, the combo option should be presented using a function(data, mapping). If a specific function fn needs its parameter set, then the function with its parameters should be of the form wrap(fn, param1 = val1, param2 = val2)

facets - create a subplot of each group and draw the subplots side by side.

facets_wrap - subplots are paid out horizontally and wrap around.

stat_bin - bins (binwidth) are bars used for representation of categorical data. By default bins are 1/30 of the range of data.

This code produce the following plot:

The graphs on the diagonal reflect the marginal distributions of the variables.

On the first top row the boxplot of the categorical variable gender depending on each of the other corresponding variables is shown. The marginal distribution of gender shows that the observed female students group is bigger, while male students tend to be a little older, have higher attitude and get a little bit better points.

The histograms in first column represent dependence of points, surf, stra, deep, attitude and age on categorical variable gender for its two values (M/F) separately.

The scatter plots under the diagonal represent the dependence of each pair of corresponding variables. As the matrix is symmetric, on the upper part of it the correlation coefficients for every pair and also separated by gender M/F are calculated and shown.

From the graphics of diaconal elements we see that the marginal distribution of variables stra and deep are very near to normal.

Thus we can make a hypothesis that the variable points depends on attitude (Cor = 0.437), stra (Cor = 0.146), and surf(Cor = -0.144) because the absolute value of the correlation coefficient is higher for these variables.

In order to exclude the effect of multicollinearity during next step of multiple regression we check the correlations among the pairs variables, which are candidates for dependen variables. If variables are correlatd, it becomes very difficult for the model to determine the true effect of independent variables on dependent variable.

From the matrix plot above we observe that Cor(stra, attitude) = 0.0617, Cor(surf,attitude) = -0.176 (quite significant), and Cor(stra,surf) = -0.161 (quite significant). Usually, correlation above 80% (subjective assumption) is considered higher. Therefore at thi stage we do not consider for removing any variable.

2.3 Multiple regression: summary of results, analyses, conclusions.

Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. Mathematically regression uses a linear function to approximate the dependent variable. When there is one dependent variable, we use simple regression. In this task we have to use 3 different dependent variables, which address the problem to multiple regression.

The function lm in R is used for linear reglession. We create a regression model with multiple explanatory variables attitude, stra, surf. The choice of independent variables is based on the analysis in 2. Points is chosen as dependent variable.

my_model <- lm(points ~ attitude + stra + surf, data = lrn14)

Print the summary of the model:

summary(my_model)

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From this statistic we observe that the residuals median is quite high(0.51). If the residuals are normally distributed, this indicates that the mean of the difference between the predictions and actual values is close to 0.

Next we analyse the parameter estimates and their standard errors. The intersect is beta0 value. It’s prediction is made by model when all independent variables are set to 0. We observe that the standard error of regression coefficient for surf is bigger than the parameter estimate itself. Also the standard error of regression coefficient for stra is relatively high.

The last column pf coefficients statistics show the p-value for the corresponding regression coefficient. When p values < 0.05 it is assumed that the corresponding coefficient is significant. We observe p>0.05 for regression coefficients for stra and surf.

From the analysis of the statistics we observe that variables surf and stra could be excluded from the model.

Next perform multiple regression for two variables. First include (attitude, stra), and next - (attitude,surf)

Finally we provide simple regression excluding both stra and surf variables.

my_model21 <- lm(points ~ attitude + stra, data = lrn14)

summary(my_model21)

## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

my_model22 <- lm(points ~ attitude + surf, data = lrn14)

summary(my_model22)

## 
## Call:
## lm(formula = points ~ attitude + surf, data = lrn14)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.277  -3.236   0.386   3.977  10.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.11957    3.12714   4.515 1.21e-05 ***
## attitude     0.34264    0.05764   5.944 1.63e-08 ***
## surf        -0.77895    0.79556  -0.979    0.329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1854 
## F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08

my_model1 <- lm(points ~ attitude, data = lrn14)

summary(my_model1)

## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the summary of last 3 models we conclude that the multiple regression points ~ attitude + stra could be chosen as pretty good model. Adding the regression coefficient for stra has a little bigger than boundary p value = 0.05, because of which we can include in the multiple regression. On the other hand the standard deviatoon of the estimate is quite big. But F-statistic shows higher value and lower p-value. We remember that the correlation between attitude and stra according the pair plot matrix is low.

Therefore we decide to continue with both simple (points ~ attitude) and multiple (points ~ attitude + stra) regression models.

2.4 Analyses and interpretation of the model parameters and the multiple R squared of the model.

Interpretation of model parameters was provided above, in section 3.

R Square (Coefficient of Determination) is a metric used to determine goodness of model fit. This metric explains the percentage of variance explained by covariates in the model. It ranges between 0 and 1. Usually, higher values are desirable but it depends on the data quality and domain. For example, if the data is noisy, we can accept a model at low R Square values.

R-Squared can be graphically illustrated via plots of observed responses Versus fitted responses.

For single regression this plot is:

library(lattice)
xyplot(points ~ attitude, lrn14, grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "r"), col.line="red")

2.5 Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Model assumptions and their validity

Next we check residual plots and try to understand the pattern and derive actionable insights.

Set graphic output:

par(mfrow=c(2,2))

Create residual plots for simple and multivariate models

plot(my_model1)

plot(my_model21)

The four residual plots show potential problematic cases with the row numbers of the data in the dataset.

Residuals versus Fitted values plot should not show any patterns. This is a case for both investigated models. Few outliers could be fixed on these graphics for both models: observations #35, #56 and #145 are fixed as possible outliers for both regression models. If we see any shape (curve, U-shape), it suggests non-linearity in the data set. If a tunnel shape pattern occur, this means that data suffer from heteroskedansticity, where error terms have non-constant variance.

Normality Q-Q plot is used to determine the normal distribution of errors. It uses standardized values (normal quantile) of residuals. In ideal case this plot should show a straight line. Otherwise the residuals have a non-normal distribution. In both investigated models this plot could be approximated to straight line. Again observations #35, #56 and #145 are fixed as possible outliers.

Scale-location plot is also useful to determine heteroskedansticity. In ideal case it should not show any pattern. In both investigated models this plot does not show any pattern. gain observations #35, #56 and #145 are fixed as possible outliers.

Residuals versus leverage plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis. Unlike the other plots, here patterns are not relevant. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.

In both models there are such cases outside the Cook’s distance. For the simple regression model these are observations numbered #35, #56 and #71.
The situation is similar for multivariate model. Here the observations numbered #35, #141 and #71 are outside the Cook’s distance. We observe that two points are the same for both models - #35 and #71.

The analysis of four residual plots suggest a few points as outliers - #35, #56 and #145 for first 3 plots, and additionally point #71 for simple model, and #145 for multivariate model. If there is some additional information, supporting linear regression for these data, we can just exclude these observations and continue model analysis and investigations.

Before to make decision about these points we have to check how influential they are on the model by excluding them from data set and performing parameter estimation again. Next we have to compare the summary results and conclude about the impact of these points on parameter estimation.

As we see from residual plots for both simple and multiple regression models, the results are very similar for the single regression points ~ attitude, and multiple regression points ~ attitude + stra. As including more variables is assumed to be a better approach, the final modeling choice is a multiple regression points ~ attitude + stra.

Useful links

[Meaning of all coefficients, given in summary for linear regression (lm)] (http://blog.yhat.com/posts/r-lm-summary.html)

[Article about ggplot and ggpairs] (http://vita.had.co.nz/papers/gpp.pdf)

[Beginners guide to regression] (http://blog.hackerearth.com/beginners-guide-regression-analysis-plot-interpretations)

Understanding Diagnostic Plots for Linear Regression Analysis

3. Logistic regression

3.1 Reading the data

Here we use modified student alcohol consumption data described here.

The joined data set used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustments have been made:

The variables not used for joining the two data have been combined by averaging (including the grade variables)
‘alc_use’ is the average of ‘Dalc’ and ‘Walc’
‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

The R code about wrangling described above is given here

The data are read using the following command:

alc<- read.table("alc.txt", header=TRUE)
dim(alc)

## [1] 382  35

str(alc)

## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

head(alc)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason nursery internet guardian traveltime studytime failures
## 1     course     yes       no   mother          2         2        0
## 2     course      no      yes   father          1         2        0
## 3      other     yes      yes   mother          1         2        2
## 4       home     yes      yes   mother          1         3        0
## 5       home     yes       no   father          1         2        0
## 6 reputation     yes      yes   mother          1         2        0
##   schoolsup famsup paid activities higher romantic famrel freetime goout
## 1       yes     no   no         no    yes       no      4        3     4
## 2        no    yes   no         no    yes       no      5        3     3
## 3       yes     no  yes         no    yes       no      4        3     2
## 4        no    yes  yes        yes    yes      yes      3        2     2
## 5        no    yes  yes         no    yes       no      4        3     2
## 6        no    yes  yes        yes    yes       no      5        4     2
##   Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1    1    1      3        5  2  8  8     1.0    FALSE
## 2    1    1      3        3  7  8  8     1.0    FALSE
## 3    2    3      3        8 10 10 11     2.5     TRUE
## 4    1    1      5        1 14 14 14     1.0    FALSE
## 5    1    2      5        2  8 12 12     1.5    FALSE
## 6    1    2      5        8 14 14 14     1.5    FALSE

colnames(alc)

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

3.2. Choice of depending variables, which affect alcohol consumption

The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.

From phisiology is known that the alcohol has different influence on male and females. Therefore we assume sex as important dependent variable to be investigated further.

If the student has a problem with the alcohol, he or she surely will tend to skip the school classes more often. So, the secont dependent variable we choose to be absence.

Alcohol consumption surely will influence the sdudent grades, which quantify the success of study process. Thus we chose the grades as third dependent variable.

At this stage we find logical to assume that the alcohol consumption is influenced by the freetime. If the students have more free time,they will have more possibilities to go to the parties or to have other activities with alcohol consumption.

3.3. Cross-tabulations, bar plots and box plots for investigating the relationships between alcohol consumptions and the chosen dependent variables.

Box plots

The box plot (box, whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum. Values outside the whiskers can be considered as outliers, unusually distant observations.

Here we use box-plot analyses, provided in DataCamp exercise, and add two additional plots aiming to clarify the influence on free time on alcohol consumptions, grades and absences.

library(GGally)
library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex))
g1 + geom_boxplot() + ylab("grade")

We see that the distributions of grades for male and female students who have high consumption of alcohol,have different medians and deviations.

g2 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + ggtitle("Student absences by alcohol consumption and sex")

g2 + geom_boxplot() + ylab("absences")

This plot shows that the distributions of absences for male and female students who have high consumption of alcohol, have different medians and deviations.

g3 <- ggplot(alc, aes(x = high_use, y = G3, col=freetime)) + ggtitle("Student grasdes by alcohol consumption and free time")

g3 + geom_boxplot() + ylab("grades")

g4 <- ggplot(alc, aes(x = high_use, y = absences, col=freetime)) + ggtitle("Student absences by alcohol consumption and free time")


g4 + geom_boxplot() + ylab("absence")

From last two plots we can conclude that the grades and absences for students with different alcohol consumption are influenced by ammount ofstudents free time.

From these plots becomse clear that it is logical to assume sex, absence, grade and free time as variables, which could be assumed to have high impact on alcohol consumption.

Bar plots

In previous chapter 2. we have used a plot matix to roughly estimate the dependences between different variables in the model. This was possible because the output variable was continuous.

Here the output varable high_use is binary and theefore we should explore another techniques.

Bar plots are one of the most commonly used kind of data visualization. They are used to display numeric values (on the y-axis), for different categories (on the x-axis).

Bar plots need not be based on counts or frequencies. We can create bar plots that represent means, medians, standard deviation. Then we have tuo use the aggregate( ) function and pass the results to the barplot( ) function.

Here we have examined the counts of different variables on y-axes. The grouped plots below present the counts of four dependent variables. The alcohol consumption is given as fill:

bp <- ggplot(alc, aes(freetime, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Grouped Barplot of free time counts for high and low alcohol usage")
bp

bp2 <- ggplot(alc, aes(sex, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Grouped Barplot of free time counts for high and low alcohol usage")
bp2

bp3 <- ggplot(alc, aes(absences, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Grouped Barplot of absence counts for high and low alcohol usage")
bp3

bp4 <- ggplot(alc, aes(G3, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Grouped Barplot of grades counts for high and low alcohol usage")
bp4

The bar graph of free time counts show that the use of alcohol is higer for gigher values of free time values. Az expected for higher number of absences the use of alcohol is higher.

Next bar plot shows the sependence of grades on high use of alcohol and free time. It is clear that students who have more free time tends to use alcohol and get lower grades.

b2 <- ggplot(alc, aes(x=high_use, y=G3, fill=freetime)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Student grasdes by alcohol consumption and free time")
b2

Similar graph is drawn for the abcebses. Here is more clear that the students, who use alcohol have both more free time and bigger number of ancenses.

#dependence of absences on high_use and sex. 
b3 <- ggplot(alc, aes(x=high_use, y=absences, fill=freetime)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Student absences by alcohol consumption and free time")
# draw the plot
b3

Cross-tabulations (contingency table)

Cross tabulations are type of tables in a matrix format that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

In CrossTable() function we can control whether row percentages (prop.r), column percentages (prop.c), or table percentages (prop.t) show up by making them TRUE in the call to CrossTable. Chi - square distribution is also shown.

We construct cross tables for all dependent variables.

library(descr)
CrossTable(alc$freetime,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =====================================
##                 alc$high_use
## alc$freetime    FALSE    TRUE   Total
## -------------------------------------
## 1                  15       2      17
##                 0.792   1.862        
##                 0.882   0.118   0.045
##                 0.056   0.018        
##                 0.039   0.005        
## -------------------------------------
## 2                  49      16      65
##                 0.253   0.595        
##                 0.754   0.246   0.170
##                 0.183   0.140        
##                 0.128   0.042        
## -------------------------------------
## 3                 113      40     153
##                 0.298   0.702        
##                 0.739   0.261   0.401
##                 0.422   0.351        
##                 0.296   0.105        
## -------------------------------------
## 4                  70      41     111
##                 0.796   1.872        
##                 0.631   0.369   0.291
##                 0.261   0.360        
##                 0.183   0.107        
## -------------------------------------
## 5                  21      15      36
##                 0.717   1.686        
##                 0.583   0.417   0.094
##                 0.078   0.132        
##                 0.055   0.039        
## -------------------------------------
## Total             268     114     382
##                 0.702   0.298        
## =====================================

We again see the dependence between bigger ammount of free time and high consumption of alcohol.

CrossTable(alc$sex,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ================================
##            alc$high_use
## alc$sex    FALSE    TRUE   Total
## --------------------------------
## F            156      42     198
##            2.102   4.942        
##            0.788   0.212   0.518
##            0.582   0.368        
##            0.408   0.110        
## --------------------------------
## M            112      72     184
##            2.262   5.318        
##            0.609   0.391   0.482
##            0.418   0.632        
##            0.293   0.188        
## --------------------------------
## Total        268     114     382
##            0.702   0.298        
## ================================

We see that consumption of alcohol of male and female students is quite the same.

3.3. Logistic regression

Next we model the output variable high_use as a function of chosen dependent variables using logistic regression.

Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1–failure, success) data. When the response is a binary (dichotomous) variable, and x is numeric, logistic regression fits a logistic curve to the relationship between x and y.

We define the students free time as a factorial variable (it takes values 1-5). The model and results (summary and coefficients) are given using following functions:

alc$freetime <- factor(alc$freetime)
m <- glm(high_use ~ G3 + absences+sex+freetime, data = alc, family = "binomial")

summary(m)

## 
## Call:
## glm(formula = high_use ~ G3 + absences + sex + freetime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6795  -0.8329  -0.6122   1.0647   2.1349  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.45690    1.00686  -2.440  0.01468 *  
## G3          -0.07553    0.03653  -2.068  0.03866 *  
## absences     0.09651    0.02276   4.240 2.24e-05 ***
## sexM         0.88658    0.24847   3.568  0.00036 ***
## freetime2    1.30818    0.94479   1.385  0.16616    
## freetime3    1.41907    0.91790   1.546  0.12210    
## freetime4    1.81731    0.92203   1.971  0.04873 *  
## freetime5    1.78978    0.96758   1.850  0.06435 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 418.34  on 374  degrees of freedom
## AIC: 434.34
## 
## Number of Fisher Scoring iterations: 5

coef(m)

## (Intercept)          G3    absences        sexM   freetime2   freetime3 
## -2.45689613 -0.07553214  0.09650664  0.88657847  1.30818455  1.41906976 
##   freetime4   freetime5 
##  1.81730551  1.78977515

The deviance residuals are a measure of model fit. All deoendent variables are statistically significant because their p value < 0.05. The grades G3 and freetime are less significant than absences and sex. As free time is a factorial variable, and shows significant level only for one category, its overall significance shoudl be investigated further.

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

For every one unit change in grades, the log odds of high alcohol consumption (versus non-alcohol consumption) decreases by - 0.07.
For a one unit increase in absences, the log odds of high alcohol consumtion increases by 0.09.
The categorial variable for sex have a slightly different interpretation. Male students versus female students changes the log odds of alcohol consumption by 0.88.
The factorial variable for free time have a similar explanation. For example, students with free time labeled as 6, versus students who have less fee time labeled as 2, changes the log odds of alcohol consumption by 0.28.

We can use the confint function to obtain confidence intervals for the coefficient estimates. For logistic models, confidence intervals are based on the profiled log-likelihood function. We can also get confidence intervals based on just the standard errors by using the default method.

confint(m)

## Waiting for profiling to be done...

##                   2.5 %       97.5 %
## (Intercept) -4.78973005 -0.713705961
## G3          -0.14778245 -0.004113323
## absences     0.05366041  0.143382613
## sexM         0.40452432  1.380580560
## freetime2   -0.32049218  3.515322927
## freetime3   -0.13114527  3.597014983
## freetime4    0.25536845  4.000588764
## freetime5    0.10649632  4.032065978

Confidence intervals using standard errors:

confint.default(m)

##                   2.5 %       97.5 %
## (Intercept) -4.43031196 -0.483480302
## G3          -0.14712585 -0.003938437
## absences     0.05189363  0.141119657
## sexM         0.39958049  1.373576441
## freetime2   -0.54356698  3.159936066
## freetime3   -0.37997251  3.218112035
## freetime4    0.01016471  3.624446311
## freetime5   -0.10664973  3.686200033

These tables show that confidence intervals for all estimated parameters are acceptable.

In this model we use one factorial variable (free time), where coefficients denote the difference of each class to this reference class. for investigating the overall effect of this variable we should perform a Wald test (package aod, function wald.test). The goal is to test whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. The argument b of wald.test function supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, 6 and 7, are the four terms for the levels of students free time.

library(aod)
wald.test(b = coef(m), Sigma = vcov(m), Terms = 4:7)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 19.8, df = 4, P(> X2) = 0.00056

The chi-squared test statistic of 19.8, with four degrees of freedom is associated with a p-value of 0.00011 < 0.05 indicating that the overall effect of students free time on alcohol consumption is statistically significant.

We can also exponentiate the coefficients and interpret them as odds-ratios. To get the exponentiated coefficients, we tell R that we want to exponentiate (exp), and that the object we want to exponentiate is called coefficients and it is part of model m (coef(m)). We use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.

exp(coef(m))

## (Intercept)          G3    absences        sexM   freetime2   freetime3 
##  0.08570054  0.92724993  1.10131690  2.42681201  3.69945143  4.13327372 
##   freetime4   freetime5 
##  6.15525084  5.98810589

After adding confidence intervals:

exp(cbind(OR = coef(m), confint(m)))

## Waiting for profiling to be done...

##                     OR       2.5 %     97.5 %
## (Intercept) 0.08570054 0.008314702  0.4898256
## G3          0.92724993 0.862618756  0.9958951
## absences    1.10131690 1.055126232  1.1541713
## sexM        2.42681201 1.498589485  3.9772100
## freetime2   3.69945143 0.725791726 33.6267852
## freetime3   4.13327372 0.877090351 36.4891510
## freetime4   6.15525084 1.290937174 54.6303049
## freetime5   5.98810589 1.112373834 56.3772652

Now we can say that for a one unit increase in grades G3, the odds of alcohol consumption (versus not consuming alcohol) increase by a factor of 0.93. It has to be noted that the odds ratio for the intercept is not generally interprete.

From the provided analyses in this paragraph we conclude that all involves independent variables - grades, absences, sex and free time of students are statistically significant and therefore no variables shoudl be excluded from moeling the consumptuon of alcohol as dependent variable.

3.4. Logistic regression model used for predictions.

Predicted probabilities can be computed for both categorical and continuous predictor variables. The R function predict() can be used to predict the alcohl consumption of students. The type = “response” option tells R to output probabilities of the form P(Y=1|X), as opposite to other information such as the logit. Here we do not supply another, test data set. Therefore the same training data set is used for computing probabilities to fit the logistic regression model.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:GGally':
## 
##     nasa

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

probabilities <- predict(m, type = "response")

Following the DataCamp exercise, we add the predicted probabilities to ‘alc’:

alc <- mutate(alc, probability = probabilities)

Use the probabilities to make a prediction of binary variable high_use

alc <- mutate(alc, prediction = probability>0.5)

Tabulate the target variable versus the predictions. The function table() is used to ptoduce a confusion matrix in order to determine hoe many observations were correctly or incorrectly classified.

table(high_use = alc$high_use, prediction = alc$prediction)

##         prediction
## high_use FALSE TRUE
##    FALSE   249   19
##    TRUE     86   28

The diagonal elements of confusion matrix indicate correct predictions while off-diagonals show the incorrect predictions. Thus our model correctly predict the student’s high alcohol consumption for 28 students and no or low consumption for 249 students. So, all correct predictions are 28 + 249 = 277. The mean() function can be used to compute the fraction of students for which the predictions about the alcohol consumption are correct. When we calculate the relative part of correct predictions = (28+249)/382 = 0.72, we see that the model produces 72% correct predictions.

Next we visualize these results:

g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()%>%addmargins()

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65183246 0.04973822 0.70157068
##    TRUE  0.22513089 0.07329843 0.29842932
##    Sum   0.87696335 0.12303665 1.00000000

From the analyses above we conclude that the model has quite high accuracy (72%), but we should keep in mind that we have not used any additional data.

Next we define a loss function (mean prediction error) as:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

and call it to compute the average number of wrong predictions in the (training) data

loss_func(class = alc$high_use, prob = alc$probability)

## [1] 0.2748691

Still the prediction error is low, 27% and thus we can assume this model as a good approximation to data about students alcohol consumptions. Again, this is the training error rate, which often appears to be too optimistic and tends to underestimate the test error rate. The better approach would be to use only part of the data for training the model, and another part to use as test setting for predicting. The error rate, obtained in this way, will be more realistic and near to the situations in the real life. Finally, we have to note that the simple heuristic strategy for choosing independent variables in this case was successful and the chosen independent variables turned to be significant.

Used and useful links

[Contingency tables (cross-tabulation), used for preliminary determining dependence of two variables] (https://en.wikipedia.org/wiki/Contingency_table https://www.qualtrics.com/wp-content/uploads/2013/05/Cross-Tabulation-Theory.pdf https://qualityandinnovation.com/2015/02/08/contingency-tables-with-gmodels/)

[Logit regression] (http://www.ats.ucla.edu/stat/r/dae/logit.htm)

[An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013] (http://www-bcf.usc.edu/~gareth/ISL/)

Cross-Validation for Predictive Analytics Using R

4. Clustering and classification

4.1. Reading the data

Here we use Boston data from the MASS package in R. The data are described here. They concern Housing Values in Suburbs of Boston. There are 506 available observations without missing data of 14 housing characteristics. Almost all variables are numerical except two categorical ones - chas, Charles River dummy variable, with values 1 (if tract bounds river) or 0 (otherwise), and rad - the if tract bounds river, which can take 5 different values (1-5).

In this chapter we use the data about crime rate in the suburbs of Boston for developing classification and clustering models.

4.2. Summary and correlation analysis of the Boston data.

The summary of data and its structure are given as:

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(tidyr); 
library(dplyr); 
library(ggplot2)
library(corrplot)

data("Boston")

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Preliminary analysis of data is based on: - plot matrix of variables:

pairs(Boston)

library(GGally)
ggpairs(Boston, diag=list(continuous="density", discrete="bar"), axisLabels="show")

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'density' to 'densityDiag'

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$discrete from 'bar' to 'barDiag'

numerical values of corrrelation matrix:

cor_matrix<-cor(Boston) %>% round(digits =2)
cor_matrix

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

graphical presentation of correlation matrix using correlogram

library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

In this correlogram code visualization method is “circle” and layout “upper” display upper triangular of the correlation matrix.

From these plots and numerical calculations we see that the crime rate depends significantly (correlation > 0.05) on almost all variables except the dummy variable chas (corr = -0.06).The correlation is quite low (|corr|=0.2) between the variables crime rate crime and zn (proportion of residential land zoned ) and rm (average number of rooms per dwelling). All other variables have higher (in absolute value) correlation to crime rate.

We observe also higher correlation between some of the other variables, for example the variable nox (nitrogen oxides concentration) is highly correlated (corr >0.6) to age, dis, rad. We do not describe here all highly correlated variables because we focus on crime rate.

From the matrix plot we observe that only the variable rm has near to normal distribution.

The summary of the data shows that different variables have different ranges, so some rescaling further in the analysis could be reasonable. Therefore we continue applying data standardization.

4.3. Standardization of Boston data

Rescaling of column x is provided by subtracting the column means mean(x) from the corresponding columns and dividing the difference with standard deviation sd(x).

scale (x)=(x - mean(x))/sd(x)

Since the Boston data contains only numerical values (also for both categorical variables) we can use the function scale() to standardize the whole dataset.

R code for center and standardization of data:

boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
summary(boston_scaled)

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

We observe that now, after standardization, the mean of all variables is 0, which allow to compare directly the influence of all variables.

4.4. Linear discriminant analysis

Linear discriminant analysis (LDA) is used here to find a linear combination of features which characterizes and separates two or more classes of objects. The resulting combination is used as a linear classifier.

The goal of classification problem is to find a good predictor (expressed as linear combinations of the original variables) for the class y of any sample of the same distribution (not necessarily from the training set) given only an observation x. The model produces the best possible separation between the different classes.

As a first step we choose class y to be the crime rate in Boston dataset. We use the cut function to convert this continuous variable into a categorical. The breaks argument to cut describes how ranges of numbers are converted to factor values. We provide a vector of values, which values are used to determine the breakpoints. The number of levels of the resultant factor is one less than the number of values in the vector. We define this vector using quantile() function, executed on scaled continues variable crime.

boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
class(boston_scaled)

## [1] "matrix"

boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim  
summary(scaled_crim)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000

bins <- quantile(scaled_crim) 
bins

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

crime <- cut(scaled_crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

We look the table of categorical variable y to see how many observations are in every class:

table(crime)

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Next we drop the old crime rate variable from the dataset and add the new categorical value to scaled data:

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Since we like to use the model for prediction, we divide the dataset to train and test data sets, so that 80% of the data belongs to the train set. We choose randomly 80% of the rows for the training set, and rest of the data for testing.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

4.4.1 LDA modeling using training data

We want to separate the Boston data by crime rate, where the number of classes in C=4, and the number of variables is x = 13. The maximum number of useful discriminant functions that can separate the Boston data by crime rate is the minimum of C-1 and x, which in our case is the minimum of 3 and 13, which is 3. Thus, we can find at most 3 useful discriminant functions to separate the data by crime rate.

Linear discriminant model on training data produce the following results:

lda.fit <- lda(crime ~ ., data = train)
lda.fit

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2450495 0.2599010 0.2376238 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.8712229 -0.8900991 -0.08304540 -0.8552027  0.44217953
## med_low  -0.1071928 -0.3302405 -0.07348562 -0.5597954 -0.14097084
## med_high -0.3802515  0.1853830  0.25261763  0.4301760  0.08750128
## high     -0.4872402  1.0149946 -0.02626030  1.0641856 -0.37022150
##                 age        dis        rad        tax     ptratio
## low      -0.8299932  0.8489068 -0.6837115 -0.7564337 -0.36586235
## med_low  -0.3548290  0.3026642 -0.5549663 -0.4929813 -0.09283751
## med_high  0.4268245 -0.4131735 -0.4087315 -0.3162735 -0.39253612
## high      0.8004355 -0.8386170  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.38472978 -0.76605797  0.52011370
## med_low   0.32637623 -0.15047880 -0.00521445
## med_high  0.08421343  0.03432644  0.17330024
## high     -0.77129054  0.85490898 -0.74326997
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.077469812  0.585651714 -0.89363673
## indus    0.002582126 -0.230178350  0.16767551
## chas    -0.079191514 -0.059496409 -0.01980242
## nox      0.249102086 -0.795136265 -1.27759569
## rm      -0.159505691 -0.066226223 -0.24187652
## age      0.340712266 -0.242103870 -0.21623983
## dis     -0.053705721 -0.127823780 -0.11315702
## rad      3.435356943  0.912992583 -0.26549986
## tax      0.038660187 -0.030529293  0.68313506
## ptratio  0.141982840  0.084729450 -0.28476531
## black   -0.171790159 -0.003720521  0.10895939
## lstat    0.175201187 -0.283848117  0.35371745
## medv     0.202383158 -0.439733862 -0.17053610
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9541 0.0351 0.0108

We see the estimated coefficients of all 3 useful discriminant functions LD1, LD2 and LD3. For example The model LD1 is: **y = 0.08zn + 0.008indus + … + 0.19*medv**. Some coefficients in these models are too small, thus suggesting that the significance of corresponding variables could be negligible.

For convenience the values for each discriminant function are scaled so that their mean value is zero and its variance is one.

The “proportion of trace” is the percentage separation achieved by each discriminant function. For our training data we get 94.4% for LD1, 4.17% for LD2, and 1.42% for LD3.

Next we visualize results by using biplot function. For this purpose we have to transform classes to numeric and add function lda.arrows for drawing lda biplot arrows. The argument dimen in plot() function determines how many discriminants are used.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The argument myscale in lda.arrows determines the length of arrows. By default, this plots arrows for the first and second axes. This LDA biplot shows good separation only about the class marked as “high”. All 3 other classes overlap quite a lot.

4.4.2 LDA for prediction: use of test data.

Now the task is to classify and calculate posterior probabilities of LDA model using new, unseen before data. In R this is calculated using a method predict for LDA objects. Here we use 20% of all data, which se separated to be test data. In order to analyze the model performance we use cross tabulation and create a table consisting of correct and predicted classes.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = test$crime, predicted = lda.pred$class)

##           predicted
## correct    low med_low med_high high
##   low       17       5        1    0
##   med_low    5      19        3    0
##   med_high   0       8       12    1
##   high       0       0        1   30

After applying predict() method, we use the table() function to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. We see that (14+20+15+24)/102 = 72% correct predictions for the class, to which a new observation belongs.

5. Distance measures and clustering. K-means algorithm

In clustering the number of classes is unknown, and the data are grouped based on their similarity. Here we apply the oldest K-clustering method. In order to be able to compare the results we use standardized Boston data. We use the most popular Eucledian distance as a measure how close are the observations. The dist() function creates a distance matrix consisting of pairwise distances of the observations. For large datasets its computation is time consuming and can need a lot of space.

dist_eu <- dist(boston_scaled, method = "euclidean")

## Warning in dist(boston_scaled, method = "euclidean"): NAs introduced by
## coercion

summary(dist_eu)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5270  4.9080  4.9390  6.2420 13.0000

K-means kmeans()is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. It takes as an argument the number of clusters to look:

km <-kmeans(dist_eu, centers = 4)

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is achieved when WCSS drops significantly.

The total within sum of squares twcss is defined bellow. K-means produces different results every time, because it randomly assigns the initial cluster centers using the function set.seed(). The most common choice for the number of clusters is 10.

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

The results show that the total within sum of squares drops significantly till k=6. On the last graphics we calculate and plot kmeans function for 2 clusters because it is easier to show. It seems that for most pairs the clusters are not clearly separated. Good separated clusters are shown for example for the pair of variables (lstat, age).

Next we perform LDA after clustering. We use 6 clusters as target classes. The obtained biplot shows that the clustering before LDA does not improve the results.

Used and useful links

1. Data standardization 2. Discriminant Analysis in R 3. Factors in R 4. Biplots for LDA in R 5. K means clustering(https://rpubs.com/FelipeRego/K-Means-Clustering) 6. 3D plots using plotly

5. Dimensionality reduction technique

5.1. The data

In this chapter we use human data from the United Nations Development Programme, which are originally published here. The data concern so called human development index (HDI) as a criteria for the development of a country. HDI is a summary measure of three main groups of variables about long and healthy life, being knowledgeable and have a decent standard of living. Original data include 195 observations of 19 variables. The meaning of all variables is described here.

The wrangling includes the following changes:

Two new columns are added to “Gender inequality” data: the ratio of Female and Male populations with secondary education in each country. (i.e. edu2F / edu2M), and the second one - the ratio of labour force participation of females and males in each country (i.e. labF / labM).
Next “Gender inequality” and “Human development” datas are joined using the variable Country as the identifier.
The variable GNi is transformed to numeric. Only the columns c(“Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”) are selected for further work.
The rows with missing values are removed. The last 10 observations relate to regions instead of countries. Therefore they are removed.
Finally the country names are added as rownames and the corresponding ciolumn is removed from data frame.

The transformed data involve 155 observations of 8 variables.

5.2.Graphical overview of the data and summaries of the variables

The preliminary treated data look as:

human= read.csv("human.csv") 
str(human)

## 'data.frame':    155 obs. of  9 variables:
##  $ X        : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 28 102 ...
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

head(human)

##             X   Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## 1      Norway 1.0072389 0.8908297    17.5     81.6 64992       4       7.8
## 2   Australia 0.9968288 0.8189415    20.2     82.4 42261       6      12.1
## 3 Switzerland 0.9834369 0.8251001    15.8     83.0 56431       6       1.9
## 4     Denmark 0.9886128 0.8840361    18.7     80.2 44025       5       5.1
## 5 Netherlands 0.9690608 0.8286119    17.9     81.6 45435       6       6.2
## 6     Germany 0.9927835 0.8072289    16.5     80.9 43919       7       3.8
##   Parli.F
## 1    39.6
## 2    30.5
## 3    28.5
## 4    38.0
## 5    36.9
## 6    36.9

Next we use ggpairs() for creating advanced plot matrix.

library(dplyr)
library(corrplot)
library(GGally)
human_ <- dplyr::select(human, -X)
ggpairs(human_, diag=list(continuous="density", discrete="bar"), axisLabels="show")

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'density' to 'densityDiag'

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$discrete from 'bar' to 'barDiag'

We see that the only variable which has near to normal distribution is Edu.Exp. It is also highly correlated to almost all other variables. Edu2.FM also shows correlation to the other variables, but the correlation coefficients are smaller comparing to coreesponding correlation coefficients of Edu.Exp. Life.Exp is positevily correlated to GNI, and negatively to Mat.Mor and Ado.Birth. GNI and Math.Mor are also negatively correlated. Math.mor has strong positive correlaton with Ado.Birth.

More clearly these dependences are shown using corrplot():

cor_matrix<-cor(human_)%>% round(digits =2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

5.3.Principal component analysis (PCA) on non- standardized and standardized data.

In this chapter we explore two methods for unsupervised learning - Principal component analysis and correspondence analysis. In the unsupervised learning we have only a set of features X1,X2, . . . , Xp measured on n observations. We do not have an associated response variable Y. Therefore here we are not interesting in the prediction. The goal is to discover interesting relations among the measurements on X1,X2, . . .,Xp. PCA is a tool used for data visualization or data pre-processing before applying methods for supervised learning. Here we apply this technique for dimensionality reduction. It is performed by singular value matrix decomposition method known from linear algebra. The data is transformed to new space with equal or less number of dimensions, called principal components. Principal components are expressed by relationships involving the original features X1,X2, . . .,Xp. Principal components are uncorrelated and each is less important than the previous one in terms of captured variance.

5.3.1 Non standardized data

First we perform PCA on the not standardized human data.

pca_human <- prcomp(human_)

The results of PCA are visualized using biplots. These are scatter plots where observations are placed on x and y coordinates defined by first two principal components PC1 and PC2. Arrows and labels on these plots represent the connections between original features and principal components. The length of the arrows are proportional to the standard deviations of X1,X2, . . .,Xp. The angle between a feature and a PC axis corresponds to correlation of its arrows. Small angle corresponds to high positive correlation. Analogously the angle between arrows corresponds to the correlation between two features.

The biplot of PCA on non-standardized human datais obtain using biplot() function.

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

The points marked in grey are the number of observations, which corresponds to the different countries, while arrows show the standard deviation of the corresponding feature.

It is difficult to analize this plot because the different dimensions of the investigated features. We observed the highest impact on PC1 has the variable GNI. This highest correlation is negative.

5.3.2 Standardized data

Next we apply standardization to X1,X2, . . .,Xp.

human_std <- scale(human_)
summary(human_std)

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

The function scale()=(x-mean(x))/sd(x). Therefore the mean of standardized data is 0, and their standard deviation is 1.

In contrast with linear regression, in which scaling the variables has no effect, here scaling change results.

In linear regression multiplying a variable by a factor of c will lead to multiplication of the corresponding coefficient estimate by a factor of 1/c.

In PCA the situation is different because the different scaling factors will directly influence calculation of each PC components. As we have observed on biplot above, when dimensions are different, the absolute values of variances also differ a lot. Then the first principal component loading vector have a very large loading for GNI, since this variable has the highest variance.

We provide PCA using scaled data.

pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

After comparing last two biplots we see that the scaling has substantial effect on the results. Because it is undesirable for the principal components obtained to depend on an arbitrary choice of scaling, we typically scale each variable to have standard deviation one before we perform PCA.

The results of last biplot shows that Mat.Mor is highly correlated to PC1, while Parli.F is highly correlated to PC2. We can assume that correlation between these original features is negligible. So, we conclude that PC1 places most of its weight on Mat.Mor, while PC2 places most of its weight on Parli.F. We observe also high correlation between Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNi, which was also shown on correlation matrix plot. Here we see that the observations on right hand side of this biplot involve countries, where the maternal mortality ratio and adolescent birth rate are high, but (since the negative correlation) in the same time the Gross National Income per capita, expected years of schooling, the ratio between females and males with at least secondary education, and life expectancy at birth are low. On the left hand side of this biplot we see the opposite situation - the countries where the maternal mortality ratio and adolescent birth rate are low, and in the same time the Gross National Income per capita, expected years of schooling, the ratio between females and males with at least secondary education, and life expectancy at birth are high. On the upper part of this plot we see the countries where the proportion of female and male on the labour market is high, and also the percentage of female presentation in the parliament is high. The opposite situation is observed on the bottom part of this biplot. Therefore we can conclude that the best situation where we expect high human development index is in the countries, which are situated on the left upper side of this biplot.

We take a look on summary of PCA:

s <- summary(pca_human)
s

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

and round the percetange of variance captured by each PC.

pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
pca_pr

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Finally we add these percetanges of variance on the biplot

pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

With this new information we have answer to the question how much of the information in human data set is lost by projecting the observations onto the first two principal components? The amount of variance, which is not contained in PC1 + PC2 = 100 - (53,6 + 16,2) = 30,2 %.

5.4 Multiple Correspondence Analysis

In the case of categorical variables dimensionality reduction can be provided using correspondence analysis (CA, n=2 variables) or multiple corresponding analysis (MCA, n>2). MCA is a generalization of PCA and an extension of CA. In this analysis cross tabulation matrices between all variables in the data set are used. The information from cross-tabulations has to be presented in clear graphical form. MCA could be used for example as a pre-processing for clustering.

5.4.1 The data

Here we use tea data from FactoMineR package. We use 6 categorical variables from the whole dataset.

library(FactoMineR)
library(tidyr)
library(ggplot2)
library(dplyr)
data("tea")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)

##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
##

str(tea_time)

## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

Only variables sugar and lunch are binomial, others express 3 (Tea,how,where) or 4 (How) levels. Therefore MCA approach is feasible for investigating these data.

Graphically the date are presented as barplots using ggplot() function.

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

## Warning: attributes are not identical across measure variables; they will
## be dropped

The amount of people, who use to drink their tea with sugar is almost the same as the others, who do not put sugar in their tea. Other variables have different number of observations for the different levels. We observe that there are not variable categories with a very low frequency, which could distort MCA analysis.

Next MCA analysis on the chosen tea data is provided.

mca <- MCA(tea_time, graph = FALSE)

In a correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. The goal of (M)CA is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space. In the terminology of (M)CA, the row and column totals of the matrix of relative frequencies are called the row mass and column mass.

summary(mca)

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

MCA results is interpreted as the results from a simple CA.

The first table contains the variances and the percentage of variances retained by each dimension. Eigenvalues correspond to the amount of information retained by each axis. Dimensions are ordered decreasingly and listed according to the amount of variance explained in the solution. Dimension 1 explains the most variance in the solution, followed by dimension 2 till Dimension 11.

The resulting Table 2 in summary() function contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active individuals on the dimensions 1 and 2.

Table 3 in summary() contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active variable categories on the dimensions 1 and 2. This table contains also a column called v.test. The value of the v.test is generally comprised between 2 and -2. For a given variable category, if the absolute value of the v.test is superior to 2, this means that the coordinate is significantly different from 0. This is a case for almost all variables involved in Dim.1, except the categories alone and other. The absolute value of v.test is less then 2 for category black in Dim.2, and for 5 categories in Dim.3 - green, milk, tea bag, tea bag + unpackaged and unpackaged.

Table 4 in summary() is categorical variables (eta2): contains the squared correlation between each variable and the dimensions. If the value is close to 1, it indicates a strong link with the variable and dimension. In Dim.1 this is a case for the variables how and where. In other words, from this table we see that the variables how and where are strongly linked to the first and a little bit less to the second dimension, and the variables Tea, How, suger and lunch are more related to the third dimension.

We first see from Table 4 which variables are linked to which dimensions, and then from Table 3 we observe the categories that are linked to the dimensions.

For further clarifying the MCA results we use graphical representation. MCA scatter plot draws a biplot of individuals and variable categories.

plot(mca, invisible=c("ind"), habillage = "quali")

MCA factor map is a two dimensional biplot of the active categorical levels. On this plot the distances between levels and the barycenter, and the distance between levels are calculated and shown. These distances provide information about similarity of different categories.

On this biplot only first two dimensions are shown. We observe that the categories Earl Grey, sugar and milk are close to each other, so people who prefer to drink Earl Gray most probably we do it with sugar and milk. There is similarity between unpackaged and tea shop categories, and also between tea bag + unpackaged and chain store + tea shop categories. The category green is alone quite far from the other categories.

Useful links

PCA explained visually

Mathematical tutorial on PCA

Biplots in practise

Correspondence Analysis

The FactoMineR webpage

Filter data with dplyr

Multiple Correspondence Analysis Essentials: Interpretation and application to investigate the associations between categories of multiple qualitative variables - R software and data mining

IODS course project

Neli Noykova

Introduction to Open Data Science

1. Tools and methods for open and reproducible research

2.Regression and model validation

2.1. Reading the data

2.2 Graphical overview of the data and summaries of the variables

2.3 Multiple regression: summary of results, analyses, conclusions.

2.4 Analyses and interpretation of the model parameters and the multiple R squared of the model.

2.5 Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Model assumptions and their validity

Useful links

3. Logistic regression

3.1 Reading the data

3.2. Choice of depending variables, which affect alcohol consumption

3.3. Cross-tabulations, bar plots and box plots for investigating the relationships between alcohol consumptions and the chosen dependent variables.

Box plots

Bar plots

Cross-tabulations (contingency table)

3.3. Logistic regression

3.4. Logistic regression model used for predictions.

Used and useful links

4. Clustering and classification

4.1. Reading the data

4.2. Summary and correlation analysis of the Boston data.

4.3. Standardization of Boston data

4.4. Linear discriminant analysis

4.4.1 LDA modeling using training data

4.4.2 LDA for prediction: use of test data.

5. Distance measures and clustering. K-means algorithm

Used and useful links

5. Dimensionality reduction technique

5.1. The data

5.2.Graphical overview of the data and summaries of the variables

5.3.Principal component analysis (PCA) on non- standardized and standardized data.

5.3.1 Non standardized data

5.3.2 Standardized data

5.4 Multiple Correspondence Analysis

5.4.1 The data

Useful links

IODS course project

Neli Noykova

Introduction to Open Data Science

1. Tools and methods for open and reproducible research

2.Regression and model validation

2.1. Reading the data

2.2 Graphical overview of the data and summaries of the variables

2.3 Multiple regression: summary of results, analyses, conclusions.

2.4 Analyses and interpretation of the model parameters and the multiple R squared of the model.

2.5 Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Model assumptions and their validity

Useful links

Link to to the GitHub page, related to my work on this course:

3. Logistic regression

3.1 Reading the data

3.2. Choice of depending variables, which affect alcohol consumption

3.3. Cross-tabulations, bar plots and box plots for investigating the relationships between alcohol consumptions and the chosen dependent variables.

Box plots

Bar plots

Cross-tabulations (contingency table)

3.3. Logistic regression

3.4. Logistic regression model used for predictions.

Used and useful links

Link to to the GitHub page, related to my work on this course:

4. Clustering and classification

4.1. Reading the data

4.2. Summary and correlation analysis of the Boston data.

4.3. Standardization of Boston data

4.4. Linear discriminant analysis

4.4.1 LDA modeling using training data

4.4.2 LDA for prediction: use of test data.

5. Distance measures and clustering. K-means algorithm

Used and useful links

5. Dimensionality reduction technique

5.1. The data

5.2.Graphical overview of the data and summaries of the variables

5.3.Principal component analysis (PCA) on non- standardized and standardized data.

5.3.1 Non standardized data

5.3.2 Standardized data

5.4 Multiple Correspondence Analysis

5.4.1 The data

Useful links

Link to to the GitHub page, related to my work on this course: