This course is a brief introduction to data science - what to do with the big amount of data, how to model them and extract important information, which could be used later on for analyses and predictions. The course will teach us how to use GitHUb (system for version control) to share results. The modeling of data will be performed using R.
First step is to learn basic GitHub commands as fork, copy, commit. We have to make our first steps in R as well. The goal of creating this document is to learn how to create markdown documents and share them on the net using GitHub.
Here is a link to to the GitHub page, related to my work on this course:
https://github.com/noykova/IODS-project
Here we use part of the sociological data described here. The data concern learning process of students. The full data consists of 183 observations of 60 variables. After wrangling the data into a format that is easy to analyze we have obtained 166 observations of 7 variables.
The wrangling includes the following changes:
The column Attitude in original data is a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5). Here we have scaled the combination variable back to the 1-5 scale.
Next multiple questions are combined into combination variables so that tree groups of questions related to deep, surface and strategic learning are formed. The columns deep, surf and stra are formed by averaging the corresponding groups of questions.
In the original data the variable ‘points’ denotes the student exam points in a statistics course exam. If the student did not attend an exam, the value of ‘points’ will be zero. These observations are removed from the data.
Thus the data set, used in this analysis, includes 166 observations of the following variables: gender,age, attitude, deep, stra, surf and points, where gender and age are the gender (F/M) and age of the students, and collected points>0.
The R code about wrangling described above is given here
The data are read using the following command:
lrn14 <- read.table("learning2014.txt", header=TRUE)
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
Next we create a more advanced plot matrix with ggpairs() The purpose of this function is to display two grouped data in a plot matrix. Analysis of this plot is used for roughly determining if we have a linear correlation between data.
p <- ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
In this function the elements are:
mapping is an aestetic mapping between variables aes - defines how the variables in the data are mapped via parameters in geom_bar
col should be a categorical variable, for which different colors are used.
alpha - for large datasets with overlapping the alpha aestetic will make the points more transparent.
lower (or upper) are lists that may contain the variables continuous, combo, discete or na. Each element may be a function or a string.
combo - this option is used either continuous X and categorical Y or categorical X and continuous Y. It could take values box, box-no-facet, dot, dot-no-facet, facetist, facet density, blank.
If lower (or upper or diagonal) lists are supplied, the combo option should be presented using a function(data, mapping). If a specific function fn needs its parameter set, then the function with its parameters should be of the form wrap(fn, param1 = val1, param2 = val2)
facets - create a subplot of each group and draw the subplots side by side.
facets_wrap - subplots are paid out horizontally and wrap around.
stat_bin - bins (binwidth) are bars used for representation of categorical data. By default bins are 1/30 of the range of data.
This code produce the following plot:
p
The graphs on the diagonal reflect the marginal distributions of the variables.
On the first top row the boxplot of the categorical variable gender depending on each of the other corresponding variables is shown. The marginal distribution of gender shows that the observed female students group is bigger, while male students tend to be a little older, have higher attitude and get a little bit better points.
The histograms in first column represent dependence of points, surf, stra, deep, attitude and age on categorical variable gender for its two values (M/F) separately.
The scatter plots under the diagonal represent the dependence of each pair of corresponding variables. As the matrix is symmetric, on the upper part of it the correlation coefficients for every pair and also separated by gender M/F are calculated and shown.
From the graphics of diaconal elements we see that the marginal distribution of variables stra and deep are very near to normal.
Thus we can make a hypothesis that the variable points depends on attitude (Cor = 0.437), stra (Cor = 0.146), and surf(Cor = -0.144) because the absolute value of the correlation coefficient is higher for these variables.
In order to exclude the effect of multicollinearity during next step of multiple regression we check the correlations among the pairs variables, which are candidates for dependen variables. If variables are correlatd, it becomes very difficult for the model to determine the true effect of independent variables on dependent variable.
From the matrix plot above we observe that Cor(stra, attitude) = 0.0617, Cor(surf,attitude) = -0.176 (quite significant), and Cor(stra,surf) = -0.161 (quite significant). Usually, correlation above 80% (subjective assumption) is considered higher. Therefore at thi stage we do not consider for removing any variable.
Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. Mathematically regression uses a linear function to approximate the dependent variable. When there is one dependent variable, we use simple regression. In this task we have to use 3 different dependent variables, which address the problem to multiple regression.
The function lm in R is used for linear reglession. We create a regression model with multiple explanatory variables attitude, stra, surf. The choice of independent variables is based on the analysis in 2. Points is chosen as dependent variable.
my_model <- lm(points ~ attitude + stra + surf, data = lrn14)
Print the summary of the model:
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From this statistic we observe that the residuals median is quite high(0.51). If the residuals are normally distributed, this indicates that the mean of the difference between the predictions and actual values is close to 0.
Next we analyse the parameter estimates and their standard errors. The intersect is beta0 value. It’s prediction is made by model when all independent variables are set to 0. We observe that the standard error of regression coefficient for surf is bigger than the parameter estimate itself. Also the standard error of regression coefficient for stra is relatively high.
The last column pf coefficients statistics show the p-value for the corresponding regression coefficient. When p values < 0.05 it is assumed that the corresponding coefficient is significant. We observe p>0.05 for regression coefficients for stra and surf.
From the analysis of the statistics we observe that variables surf and stra could be excluded from the model.
Next perform multiple regression for two variables. First include (attitude, stra), and next - (attitude,surf)
Finally we provide simple regression excluding both stra and surf variables.
my_model21 <- lm(points ~ attitude + stra, data = lrn14)
summary(my_model21)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
my_model22 <- lm(points ~ attitude + surf, data = lrn14)
summary(my_model22)
##
## Call:
## lm(formula = points ~ attitude + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.277 -3.236 0.386 3.977 10.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.11957 3.12714 4.515 1.21e-05 ***
## attitude 0.34264 0.05764 5.944 1.63e-08 ***
## surf -0.77895 0.79556 -0.979 0.329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
## F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
my_model1 <- lm(points ~ attitude, data = lrn14)
summary(my_model1)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary of last 3 models we conclude that the multiple regression points ~ attitude + stra could be chosen as pretty good model. Adding the regression coefficient for stra has a little bigger than boundary p value = 0.05, because of which we can include in the multiple regression. On the other hand the standard deviatoon of the estimate is quite big. But F-statistic shows higher value and lower p-value. We remember that the correlation between attitude and stra according the pair plot matrix is low.
Therefore we decide to continue with both simple (points ~ attitude) and multiple (points ~ attitude + stra) regression models.
Interpretation of model parameters was provided above, in section 3.
R Square (Coefficient of Determination) is a metric used to determine goodness of model fit. This metric explains the percentage of variance explained by covariates in the model. It ranges between 0 and 1. Usually, higher values are desirable but it depends on the data quality and domain. For example, if the data is noisy, we can accept a model at low R Square values.
R-Squared can be graphically illustrated via plots of observed responses Versus fitted responses.
For single regression this plot is:
library(lattice)
xyplot(points ~ attitude, lrn14, grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "r"), col.line="red")
Next we check residual plots and try to understand the pattern and derive actionable insights.
Set graphic output:
par(mfrow=c(2,2))
Create residual plots for simple and multivariate models
plot(my_model1)
plot(my_model21)
The four residual plots show potential problematic cases with the row numbers of the data in the dataset.
Residuals versus Fitted values plot should not show any patterns. This is a case for both investigated models. Few outliers could be fixed on these graphics for both models: observations #35, #56 and #145 are fixed as possible outliers for both regression models. If we see any shape (curve, U-shape), it suggests non-linearity in the data set. If a tunnel shape pattern occur, this means that data suffer from heteroskedansticity, where error terms have non-constant variance.
Normality Q-Q plot is used to determine the normal distribution of errors. It uses standardized values (normal quantile) of residuals. In ideal case this plot should show a straight line. Otherwise the residuals have a non-normal distribution. In both investigated models this plot could be approximated to straight line. Again observations #35, #56 and #145 are fixed as possible outliers.
Scale-location plot is also useful to determine heteroskedansticity. In ideal case it should not show any pattern. In both investigated models this plot does not show any pattern. gain observations #35, #56 and #145 are fixed as possible outliers.
Residuals versus leverage plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis. Unlike the other plots, here patterns are not relevant. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.
In both models there are such cases outside the Cook’s distance. For the simple regression model these are observations numbered #35, #56 and #71.
The situation is similar for multivariate model. Here the observations numbered #35, #141 and #71 are outside the Cook’s distance. We observe that two points are the same for both models - #35 and #71.
The analysis of four residual plots suggest a few points as outliers - #35, #56 and #145 for first 3 plots, and additionally point #71 for simple model, and #145 for multivariate model. If there is some additional information, supporting linear regression for these data, we can just exclude these observations and continue model analysis and investigations.
Before to make decision about these points we have to check how influential they are on the model by excluding them from data set and performing parameter estimation again. Next we have to compare the summary results and conclude about the impact of these points on parameter estimation.
As we see from residual plots for both simple and multiple regression models, the results are very similar for the single regression points ~ attitude, and multiple regression points ~ attitude + stra. As including more variables is assumed to be a better approach, the final modeling choice is a multiple regression points ~ attitude + stra.
[Meaning of all coefficients, given in summary for linear regression (lm)] (http://blog.yhat.com/posts/r-lm-summary.html)
[Article about ggplot and ggpairs] (http://vita.had.co.nz/papers/gpp.pdf)
[Beginners guide to regression] (http://blog.hackerearth.com/beginners-guide-regression-analysis-plot-interpretations)
Understanding Diagnostic Plots for Linear Regression Analysis
Here we use modified student alcohol consumption data described here.
The joined data set used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustments have been made:
The variables not used for joining the two data have been combined by averaging (including the grade variables)
‘alc_use’ is the average of ‘Dalc’ and ‘Walc’
‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
The R code about wrangling described above is given here
The data are read using the following command:
alc<- read.table("alc.txt", header=TRUE)
dim(alc)
## [1] 382 35
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
head(alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 2
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1 1 1 3 5 2 8 8 1.0 FALSE
## 2 1 1 3 3 7 8 8 1.0 FALSE
## 3 2 3 3 8 10 10 11 2.5 TRUE
## 4 1 1 5 1 14 14 14 1.0 FALSE
## 5 1 2 5 2 8 12 12 1.5 FALSE
## 6 1 2 5 8 14 14 14 1.5 FALSE
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.
From phisiology is known that the alcohol has different influence on male and females. Therefore we assume sex as important dependent variable to be investigated further.
If the student has a problem with the alcohol, he or she surely will tend to skip the school classes more often. So, the secont dependent variable we choose to be absence.
Alcohol consumption surely will influence the sdudent grades, which quantify the success of study process. Thus we chose the grades as third dependent variable.
At this stage we find logical to assume that the alcohol consumption is influenced by the freetime. If the students have more free time,they will have more possibilities to go to the parties or to have other activities with alcohol consumption.
The box plot (box, whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum. Values outside the whiskers can be considered as outliers, unusually distant observations.
Here we use box-plot analyses, provided in DataCamp exercise, and add two additional plots aiming to clarify the influence on free time on alcohol consumptions, grades and absences.
library(GGally)
library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex))
g1 + geom_boxplot() + ylab("grade")
We see that the distributions of grades for male and female students who have high consumption of alcohol,have different medians and deviations.
g2 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + ggtitle("Student absences by alcohol consumption and sex")
g2 + geom_boxplot() + ylab("absences")
This plot shows that the distributions of absences for male and female students who have high consumption of alcohol, have different medians and deviations.
g3 <- ggplot(alc, aes(x = high_use, y = G3, col=freetime)) + ggtitle("Student grasdes by alcohol consumption and free time")
g3 + geom_boxplot() + ylab("grades")
g4 <- ggplot(alc, aes(x = high_use, y = absences, col=freetime)) + ggtitle("Student absences by alcohol consumption and free time")
g4 + geom_boxplot() + ylab("absence")
From last two plots we can conclude that the grades and absences for students with different alcohol consumption are influenced by ammount ofstudents free time.
From these plots becomse clear that it is logical to assume sex, absence, grade and free time as variables, which could be assumed to have high impact on alcohol consumption.
In previous chapter 2. we have used a plot matix to roughly estimate the dependences between different variables in the model. This was possible because the output variable was continuous.
Here the output varable high_use is binary and theefore we should explore another techniques.
Bar plots are one of the most commonly used kind of data visualization. They are used to display numeric values (on the y-axis), for different categories (on the x-axis).
Bar plots need not be based on counts or frequencies. We can create bar plots that represent means, medians, standard deviation. Then we have tuo use the aggregate( ) function and pass the results to the barplot( ) function.
Here we have examined the counts of different variables on y-axes. The grouped plots below present the counts of four dependent variables. The alcohol consumption is given as fill:
bp <- ggplot(alc, aes(freetime, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Grouped Barplot of free time counts for high and low alcohol usage")
bp
bp2 <- ggplot(alc, aes(sex, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Grouped Barplot of free time counts for high and low alcohol usage")
bp2
bp3 <- ggplot(alc, aes(absences, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Grouped Barplot of absence counts for high and low alcohol usage")
bp3
bp4 <- ggplot(alc, aes(G3, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Grouped Barplot of grades counts for high and low alcohol usage")
bp4
The bar graph of free time counts show that the use of alcohol is higer for gigher values of free time values. Az expected for higher number of absences the use of alcohol is higher.
Next bar plot shows the sependence of grades on high use of alcohol and free time. It is clear that students who have more free time tends to use alcohol and get lower grades.
b2 <- ggplot(alc, aes(x=high_use, y=G3, fill=freetime)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Student grasdes by alcohol consumption and free time")
b2
Similar graph is drawn for the abcebses. Here is more clear that the students, who use alcohol have both more free time and bigger number of ancenses.
#dependence of absences on high_use and sex.
b3 <- ggplot(alc, aes(x=high_use, y=absences, fill=freetime)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Student absences by alcohol consumption and free time")
# draw the plot
b3
Cross tabulations are type of tables in a matrix format that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
In CrossTable() function we can control whether row percentages (prop.r), column percentages (prop.c), or table percentages (prop.t) show up by making them TRUE in the call to CrossTable. Chi - square distribution is also shown.
We construct cross tables for all dependent variables.
library(descr)
CrossTable(alc$freetime,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## =====================================
## alc$high_use
## alc$freetime FALSE TRUE Total
## -------------------------------------
## 1 15 2 17
## 0.792 1.862
## 0.882 0.118 0.045
## 0.056 0.018
## 0.039 0.005
## -------------------------------------
## 2 49 16 65
## 0.253 0.595
## 0.754 0.246 0.170
## 0.183 0.140
## 0.128 0.042
## -------------------------------------
## 3 113 40 153
## 0.298 0.702
## 0.739 0.261 0.401
## 0.422 0.351
## 0.296 0.105
## -------------------------------------
## 4 70 41 111
## 0.796 1.872
## 0.631 0.369 0.291
## 0.261 0.360
## 0.183 0.107
## -------------------------------------
## 5 21 15 36
## 0.717 1.686
## 0.583 0.417 0.094
## 0.078 0.132
## 0.055 0.039
## -------------------------------------
## Total 268 114 382
## 0.702 0.298
## =====================================
We again see the dependence between bigger ammount of free time and high consumption of alcohol.
CrossTable(alc$sex,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ================================
## alc$high_use
## alc$sex FALSE TRUE Total
## --------------------------------
## F 156 42 198
## 2.102 4.942
## 0.788 0.212 0.518
## 0.582 0.368
## 0.408 0.110
## --------------------------------
## M 112 72 184
## 2.262 5.318
## 0.609 0.391 0.482
## 0.418 0.632
## 0.293 0.188
## --------------------------------
## Total 268 114 382
## 0.702 0.298
## ================================
We see that consumption of alcohol of male and female students is quite the same.
Next we model the output variable high_use as a function of chosen dependent variables using logistic regression.
Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1–failure, success) data. When the response is a binary (dichotomous) variable, and x is numeric, logistic regression fits a logistic curve to the relationship between x and y.
We define the students free time as a factorial variable (it takes values 1-5). The model and results (summary and coefficients) are given using following functions:
alc$freetime <- factor(alc$freetime)
m <- glm(high_use ~ G3 + absences+sex+freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + sex + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6795 -0.8329 -0.6122 1.0647 2.1349
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.45690 1.00686 -2.440 0.01468 *
## G3 -0.07553 0.03653 -2.068 0.03866 *
## absences 0.09651 0.02276 4.240 2.24e-05 ***
## sexM 0.88658 0.24847 3.568 0.00036 ***
## freetime2 1.30818 0.94479 1.385 0.16616
## freetime3 1.41907 0.91790 1.546 0.12210
## freetime4 1.81731 0.92203 1.971 0.04873 *
## freetime5 1.78978 0.96758 1.850 0.06435 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 418.34 on 374 degrees of freedom
## AIC: 434.34
##
## Number of Fisher Scoring iterations: 5
coef(m)
## (Intercept) G3 absences sexM freetime2 freetime3
## -2.45689613 -0.07553214 0.09650664 0.88657847 1.30818455 1.41906976
## freetime4 freetime5
## 1.81730551 1.78977515
The deviance residuals are a measure of model fit. All deoendent variables are statistically significant because their p value < 0.05. The grades G3 and freetime are less significant than absences and sex. As free time is a factorial variable, and shows significant level only for one category, its overall significance shoudl be investigated further.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
For every one unit change in grades, the log odds of high alcohol consumption (versus non-alcohol consumption) decreases by - 0.07.
For a one unit increase in absences, the log odds of high alcohol consumtion increases by 0.09.
The categorial variable for sex have a slightly different interpretation. Male students versus female students changes the log odds of alcohol consumption by 0.88.
The factorial variable for free time have a similar explanation. For example, students with free time labeled as 6, versus students who have less fee time labeled as 2, changes the log odds of alcohol consumption by 0.28.
We can use the confint function to obtain confidence intervals for the coefficient estimates. For logistic models, confidence intervals are based on the profiled log-likelihood function. We can also get confidence intervals based on just the standard errors by using the default method.
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -4.78973005 -0.713705961
## G3 -0.14778245 -0.004113323
## absences 0.05366041 0.143382613
## sexM 0.40452432 1.380580560
## freetime2 -0.32049218 3.515322927
## freetime3 -0.13114527 3.597014983
## freetime4 0.25536845 4.000588764
## freetime5 0.10649632 4.032065978
Confidence intervals using standard errors:
confint.default(m)
## 2.5 % 97.5 %
## (Intercept) -4.43031196 -0.483480302
## G3 -0.14712585 -0.003938437
## absences 0.05189363 0.141119657
## sexM 0.39958049 1.373576441
## freetime2 -0.54356698 3.159936066
## freetime3 -0.37997251 3.218112035
## freetime4 0.01016471 3.624446311
## freetime5 -0.10664973 3.686200033
These tables show that confidence intervals for all estimated parameters are acceptable.
In this model we use one factorial variable (free time), where coefficients denote the difference of each class to this reference class. for investigating the overall effect of this variable we should perform a Wald test (package aod, function wald.test). The goal is to test whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. The argument b of wald.test function supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, 6 and 7, are the four terms for the levels of students free time.
library(aod)
wald.test(b = coef(m), Sigma = vcov(m), Terms = 4:7)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 19.8, df = 4, P(> X2) = 0.00056
The chi-squared test statistic of 19.8, with four degrees of freedom is associated with a p-value of 0.00011 < 0.05 indicating that the overall effect of students free time on alcohol consumption is statistically significant.
We can also exponentiate the coefficients and interpret them as odds-ratios. To get the exponentiated coefficients, we tell R that we want to exponentiate (exp), and that the object we want to exponentiate is called coefficients and it is part of model m (coef(m)). We use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
exp(coef(m))
## (Intercept) G3 absences sexM freetime2 freetime3
## 0.08570054 0.92724993 1.10131690 2.42681201 3.69945143 4.13327372
## freetime4 freetime5
## 6.15525084 5.98810589
After adding confidence intervals:
exp(cbind(OR = coef(m), confint(m)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08570054 0.008314702 0.4898256
## G3 0.92724993 0.862618756 0.9958951
## absences 1.10131690 1.055126232 1.1541713
## sexM 2.42681201 1.498589485 3.9772100
## freetime2 3.69945143 0.725791726 33.6267852
## freetime3 4.13327372 0.877090351 36.4891510
## freetime4 6.15525084 1.290937174 54.6303049
## freetime5 5.98810589 1.112373834 56.3772652
Now we can say that for a one unit increase in grades G3, the odds of alcohol consumption (versus not consuming alcohol) increase by a factor of 0.93. It has to be noted that the odds ratio for the intercept is not generally interprete.
From the provided analyses in this paragraph we conclude that all involves independent variables - grades, absences, sex and free time of students are statistically significant and therefore no variables shoudl be excluded from moeling the consumptuon of alcohol as dependent variable.
Predicted probabilities can be computed for both categorical and continuous predictor variables. The R function predict() can be used to predict the alcohl consumption of students. The type = “response” option tells R to output probabilities of the form P(Y=1|X), as opposite to other information such as the logit. Here we do not supply another, test data set. Therefore the same training data set is used for computing probabilities to fit the logistic regression model.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
probabilities <- predict(m, type = "response")
Following the DataCamp exercise, we add the predicted probabilities to ‘alc’:
alc <- mutate(alc, probability = probabilities)
Use the probabilities to make a prediction of binary variable high_use
alc <- mutate(alc, prediction = probability>0.5)
Tabulate the target variable versus the predictions. The function table() is used to ptoduce a confusion matrix in order to determine hoe many observations were correctly or incorrectly classified.
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 249 19
## TRUE 86 28
The diagonal elements of confusion matrix indicate correct predictions while off-diagonals show the incorrect predictions. Thus our model correctly predict the student’s high alcohol consumption for 28 students and no or low consumption for 249 students. So, all correct predictions are 28 + 249 = 277. The mean() function can be used to compute the fraction of students for which the predictions about the alcohol consumption are correct. When we calculate the relative part of correct predictions = (28+249)/382 = 0.72, we see that the model produces 72% correct predictions.
Next we visualize these results:
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65183246 0.04973822 0.70157068
## TRUE 0.22513089 0.07329843 0.29842932
## Sum 0.87696335 0.12303665 1.00000000
From the analyses above we conclude that the model has quite high accuracy (72%), but we should keep in mind that we have not used any additional data.
Next we define a loss function (mean prediction error) as:
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
and call it to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2748691
Still the prediction error is low, 27% and thus we can assume this model as a good approximation to data about students alcohol consumptions. Again, this is the training error rate, which often appears to be too optimistic and tends to underestimate the test error rate. The better approach would be to use only part of the data for training the model, and another part to use as test setting for predicting. The error rate, obtained in this way, will be more realistic and near to the situations in the real life. Finally, we have to note that the simple heuristic strategy for choosing independent variables in this case was successful and the chosen independent variables turned to be significant.
[Contingency tables (cross-tabulation), used for preliminary determining dependence of two variables] (https://en.wikipedia.org/wiki/Contingency_table https://www.qualtrics.com/wp-content/uploads/2013/05/Cross-Tabulation-Theory.pdf https://qualityandinnovation.com/2015/02/08/contingency-tables-with-gmodels/)
[Logit regression] (http://www.ats.ucla.edu/stat/r/dae/logit.htm)
[An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013] (http://www-bcf.usc.edu/~gareth/ISL/)
Here we use Boston data from the MASS package in R. The data are described here. They concern Housing Values in Suburbs of Boston. There are 506 available observations without missing data of 14 housing characteristics. Almost all variables are numerical except two categorical ones - chas, Charles River dummy variable, with values 1 (if tract bounds river) or 0 (otherwise), and rad - the if tract bounds river, which can take 5 different values (1-5).
In this chapter we use the data about crime rate in the suburbs of Boston for developing classification and clustering models.
The summary of data and its structure are given as:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr);
library(dplyr);
library(ggplot2)
library(corrplot)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
Preliminary analysis of data is based on: - plot matrix of variables:
pairs(Boston)
library(GGally)
ggpairs(Boston, diag=list(continuous="density", discrete="bar"), axisLabels="show")
## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'density' to 'densityDiag'
## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$discrete from 'bar' to 'barDiag'
cor_matrix<-cor(Boston) %>% round(digits =2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
In this correlogram code visualization method is “circle” and layout “upper” display upper triangular of the correlation matrix.
From these plots and numerical calculations we see that the crime rate depends significantly (correlation > 0.05) on almost all variables except the dummy variable chas (corr = -0.06).The correlation is quite low (|corr|=0.2) between the variables crime rate crime and zn (proportion of residential land zoned ) and rm (average number of rooms per dwelling). All other variables have higher (in absolute value) correlation to crime rate.
We observe also higher correlation between some of the other variables, for example the variable nox (nitrogen oxides concentration) is highly correlated (corr >0.6) to age, dis, rad. We do not describe here all highly correlated variables because we focus on crime rate.
From the matrix plot we observe that only the variable rm has near to normal distribution.
The summary of the data shows that different variables have different ranges, so some rescaling further in the analysis could be reasonable. Therefore we continue applying data standardization.
Rescaling of column x is provided by subtracting the column means mean(x) from the corresponding columns and dividing the difference with standard deviation sd(x).
scale (x)=(x - mean(x))/sd(x)
Since the Boston data contains only numerical values (also for both categorical variables) we can use the function scale() to standardize the whole dataset.
R code for center and standardization of data:
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We observe that now, after standardization, the mean of all variables is 0, which allow to compare directly the influence of all variables.
Linear discriminant analysis (LDA) is used here to find a linear combination of features which characterizes and separates two or more classes of objects. The resulting combination is used as a linear classifier.
The goal of classification problem is to find a good predictor (expressed as linear combinations of the original variables) for the class y of any sample of the same distribution (not necessarily from the training set) given only an observation x. The model produces the best possible separation between the different classes.
As a first step we choose class y to be the crime rate in Boston dataset. We use the cut function to convert this continuous variable into a categorical. The breaks argument to cut describes how ranges of numbers are converted to factor values. We provide a vector of values, which values are used to determine the breakpoints. The number of levels of the resultant factor is one less than the number of values in the vector. We define this vector using quantile() function, executed on scaled continues variable crime.
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
We look the table of categorical variable y to see how many observations are in every class:
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Next we drop the old crime rate variable from the dataset and add the new categorical value to scaled data:
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Since we like to use the model for prediction, we divide the dataset to train and test data sets, so that 80% of the data belongs to the train set. We choose randomly 80% of the rows for the training set, and rest of the data for testing.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
We want to separate the Boston data by crime rate, where the number of classes in C=4, and the number of variables is x = 13. The maximum number of useful discriminant functions that can separate the Boston data by crime rate is the minimum of C-1 and x, which in our case is the minimum of 3 and 13, which is 3. Thus, we can find at most 3 useful discriminant functions to separate the data by crime rate.
Linear discriminant model on training data produce the following results:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2450495 0.2599010 0.2376238
##
## Group means:
## zn indus chas nox rm
## low 0.8712229 -0.8900991 -0.08304540 -0.8552027 0.44217953
## med_low -0.1071928 -0.3302405 -0.07348562 -0.5597954 -0.14097084
## med_high -0.3802515 0.1853830 0.25261763 0.4301760 0.08750128
## high -0.4872402 1.0149946 -0.02626030 1.0641856 -0.37022150
## age dis rad tax ptratio
## low -0.8299932 0.8489068 -0.6837115 -0.7564337 -0.36586235
## med_low -0.3548290 0.3026642 -0.5549663 -0.4929813 -0.09283751
## med_high 0.4268245 -0.4131735 -0.4087315 -0.3162735 -0.39253612
## high 0.8004355 -0.8386170 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38472978 -0.76605797 0.52011370
## med_low 0.32637623 -0.15047880 -0.00521445
## med_high 0.08421343 0.03432644 0.17330024
## high -0.77129054 0.85490898 -0.74326997
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.077469812 0.585651714 -0.89363673
## indus 0.002582126 -0.230178350 0.16767551
## chas -0.079191514 -0.059496409 -0.01980242
## nox 0.249102086 -0.795136265 -1.27759569
## rm -0.159505691 -0.066226223 -0.24187652
## age 0.340712266 -0.242103870 -0.21623983
## dis -0.053705721 -0.127823780 -0.11315702
## rad 3.435356943 0.912992583 -0.26549986
## tax 0.038660187 -0.030529293 0.68313506
## ptratio 0.141982840 0.084729450 -0.28476531
## black -0.171790159 -0.003720521 0.10895939
## lstat 0.175201187 -0.283848117 0.35371745
## medv 0.202383158 -0.439733862 -0.17053610
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9541 0.0351 0.0108
We see the estimated coefficients of all 3 useful discriminant functions LD1, LD2 and LD3. For example The model LD1 is: **y = 0.08zn + 0.008indus + … + 0.19*medv**. Some coefficients in these models are too small, thus suggesting that the significance of corresponding variables could be negligible.
For convenience the values for each discriminant function are scaled so that their mean value is zero and its variance is one.
The “proportion of trace” is the percentage separation achieved by each discriminant function. For our training data we get 94.4% for LD1, 4.17% for LD2, and 1.42% for LD3.
Next we visualize results by using biplot function. For this purpose we have to transform classes to numeric and add function lda.arrows for drawing lda biplot arrows. The argument dimen in plot() function determines how many discriminants are used.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The argument myscale in lda.arrows determines the length of arrows. By default, this plots arrows for the first and second axes. This LDA biplot shows good separation only about the class marked as “high”. All 3 other classes overlap quite a lot.
Now the task is to classify and calculate posterior probabilities of LDA model using new, unseen before data. In R this is calculated using a method predict for LDA objects. Here we use 20% of all data, which se separated to be test data. In order to analyze the model performance we use cross tabulation and create a table consisting of correct and predicted classes.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = test$crime, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 5 1 0
## med_low 5 19 3 0
## med_high 0 8 12 1
## high 0 0 1 30
After applying predict() method, we use the table() function to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. We see that (14+20+15+24)/102 = 72% correct predictions for the class, to which a new observation belongs.
In clustering the number of classes is unknown, and the data are grouped based on their similarity. Here we apply the oldest K-clustering method. In order to be able to compare the results we use standardized Boston data. We use the most popular Eucledian distance as a measure how close are the observations. The dist() function creates a distance matrix consisting of pairwise distances of the observations. For large datasets its computation is time consuming and can need a lot of space.
dist_eu <- dist(boston_scaled, method = "euclidean")
## Warning in dist(boston_scaled, method = "euclidean"): NAs introduced by
## coercion
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5270 4.9080 4.9390 6.2420 13.0000
K-means kmeans()is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. It takes as an argument the number of clusters to look:
km <-kmeans(dist_eu, centers = 4)
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is achieved when WCSS drops significantly.
The total within sum of squares twcss is defined bellow. K-means produces different results every time, because it randomly assigns the initial cluster centers using the function set.seed(). The most common choice for the number of clusters is 10.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
The results show that the total within sum of squares drops significantly till k=6. On the last graphics we calculate and plot kmeans function for 2 clusters because it is easier to show. It seems that for most pairs the clusters are not clearly separated. Good separated clusters are shown for example for the pair of variables (lstat, age).
Next we perform LDA after clustering. We use 6 clusters as target classes. The obtained biplot shows that the clustering before LDA does not improve the results.
In this chapter we use human data from the United Nations Development Programme, which are originally published here. The data concern so called human development index (HDI) as a criteria for the development of a country. HDI is a summary measure of three main groups of variables about long and healthy life, being knowledgeable and have a decent standard of living. Original data include 195 observations of 19 variables. The meaning of all variables is described here.
The wrangling includes the following changes:
Two new columns are added to “Gender inequality” data: the ratio of Female and Male populations with secondary education in each country. (i.e. edu2F / edu2M), and the second one - the ratio of labour force participation of females and males in each country (i.e. labF / labM).
Next “Gender inequality” and “Human development” datas are joined using the variable Country as the identifier.
The variable GNi is transformed to numeric. Only the columns c(“Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”) are selected for further work.
The rows with missing values are removed. The last 10 observations relate to regions instead of countries. Therefore they are removed.
Finally the country names are added as rownames and the corresponding ciolumn is removed from data frame.
The transformed data involve 155 observations of 8 variables.
The preliminary treated data look as:
human= read.csv("human.csv")
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ X : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 28 102 ...
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
head(human)
## X Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth
## 1 Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## 2 Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## 3 Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## 4 Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## 5 Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6 6.2
## 6 Germany 0.9927835 0.8072289 16.5 80.9 43919 7 3.8
## Parli.F
## 1 39.6
## 2 30.5
## 3 28.5
## 4 38.0
## 5 36.9
## 6 36.9
Next we use ggpairs() for creating advanced plot matrix.
library(dplyr)
library(corrplot)
library(GGally)
human_ <- dplyr::select(human, -X)
ggpairs(human_, diag=list(continuous="density", discrete="bar"), axisLabels="show")
## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'density' to 'densityDiag'
## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$discrete from 'bar' to 'barDiag'
We see that the only variable which has near to normal distribution is Edu.Exp. It is also highly correlated to almost all other variables. Edu2.FM also shows correlation to the other variables, but the correlation coefficients are smaller comparing to coreesponding correlation coefficients of Edu.Exp. Life.Exp is positevily correlated to GNI, and negatively to Mat.Mor and Ado.Birth. GNI and Math.Mor are also negatively correlated. Math.mor has strong positive correlaton with Ado.Birth.
More clearly these dependences are shown using corrplot():
cor_matrix<-cor(human_)%>% round(digits =2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
In this chapter we explore two methods for unsupervised learning - Principal component analysis and correspondence analysis. In the unsupervised learning we have only a set of features X1,X2, . . . , Xp measured on n observations. We do not have an associated response variable Y. Therefore here we are not interesting in the prediction. The goal is to discover interesting relations among the measurements on X1,X2, . . .,Xp. PCA is a tool used for data visualization or data pre-processing before applying methods for supervised learning. Here we apply this technique for dimensionality reduction. It is performed by singular value matrix decomposition method known from linear algebra. The data is transformed to new space with equal or less number of dimensions, called principal components. Principal components are expressed by relationships involving the original features X1,X2, . . .,Xp. Principal components are uncorrelated and each is less important than the previous one in terms of captured variance.
First we perform PCA on the not standardized human data.
pca_human <- prcomp(human_)
The results of PCA are visualized using biplots. These are scatter plots where observations are placed on x and y coordinates defined by first two principal components PC1 and PC2. Arrows and labels on these plots represent the connections between original features and principal components. The length of the arrows are proportional to the standard deviations of X1,X2, . . .,Xp. The angle between a feature and a PC axis corresponds to correlation of its arrows. Small angle corresponds to high positive correlation. Analogously the angle between arrows corresponds to the correlation between two features.
The biplot of PCA on non-standardized human datais obtain using biplot() function.
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
The points marked in grey are the number of observations, which corresponds to the different countries, while arrows show the standard deviation of the corresponding feature.
It is difficult to analize this plot because the different dimensions of the investigated features. We observed the highest impact on PC1 has the variable GNI. This highest correlation is negative.
Next we apply standardization to X1,X2, . . .,Xp.
human_std <- scale(human_)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
The function scale()=(x-mean(x))/sd(x). Therefore the mean of standardized data is 0, and their standard deviation is 1.
In contrast with linear regression, in which scaling the variables has no effect, here scaling change results.
In linear regression multiplying a variable by a factor of c will lead to multiplication of the corresponding coefficient estimate by a factor of 1/c.
In PCA the situation is different because the different scaling factors will directly influence calculation of each PC components. As we have observed on biplot above, when dimensions are different, the absolute values of variances also differ a lot. Then the first principal component loading vector have a very large loading for GNI, since this variable has the highest variance.
We provide PCA using scaled data.
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
After comparing last two biplots we see that the scaling has substantial effect on the results. Because it is undesirable for the principal components obtained to depend on an arbitrary choice of scaling, we typically scale each variable to have standard deviation one before we perform PCA.
The results of last biplot shows that Mat.Mor is highly correlated to PC1, while Parli.F is highly correlated to PC2. We can assume that correlation between these original features is negligible. So, we conclude that PC1 places most of its weight on Mat.Mor, while PC2 places most of its weight on Parli.F. We observe also high correlation between Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNi, which was also shown on correlation matrix plot. Here we see that the observations on right hand side of this biplot involve countries, where the maternal mortality ratio and adolescent birth rate are high, but (since the negative correlation) in the same time the Gross National Income per capita, expected years of schooling, the ratio between females and males with at least secondary education, and life expectancy at birth are low. On the left hand side of this biplot we see the opposite situation - the countries where the maternal mortality ratio and adolescent birth rate are low, and in the same time the Gross National Income per capita, expected years of schooling, the ratio between females and males with at least secondary education, and life expectancy at birth are high. On the upper part of this plot we see the countries where the proportion of female and male on the labour market is high, and also the percentage of female presentation in the parliament is high. The opposite situation is observed on the bottom part of this biplot. Therefore we can conclude that the best situation where we expect high human development index is in the countries, which are situated on the left upper side of this biplot.
We take a look on summary of PCA:
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
and round the percetange of variance captured by each PC.
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Finally we add these percetanges of variance on the biplot
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
With this new information we have answer to the question how much of the information in human data set is lost by projecting the observations onto the first two principal components? The amount of variance, which is not contained in PC1 + PC2 = 100 - (53,6 + 16,2) = 30,2 %.
In the case of categorical variables dimensionality reduction can be provided using correspondence analysis (CA, n=2 variables) or multiple corresponding analysis (MCA, n>2). MCA is a generalization of PCA and an extension of CA. In this analysis cross tabulation matrices between all variables in the data set are used. The information from cross-tabulations has to be presented in clear graphical form. MCA could be used for example as a pre-processing for clustering.
Here we use tea data from FactoMineR package. We use 6 categorical variables from the whole dataset.
library(FactoMineR)
library(tidyr)
library(ggplot2)
library(dplyr)
data("tea")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
Only variables sugar and lunch are binomial, others express 3 (Tea,how,where) or 4 (How) levels. Therefore MCA approach is feasible for investigating these data.
Graphically the date are presented as barplots using ggplot() function.
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
The amount of people, who use to drink their tea with sugar is almost the same as the others, who do not put sugar in their tea. Other variables have different number of observations for the different levels. We observe that there are not variable categories with a very low frequency, which could distort MCA analysis.
Next MCA analysis on the chosen tea data is provided.
mca <- MCA(tea_time, graph = FALSE)
In a correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. The goal of (M)CA is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space. In the terminology of (M)CA, the row and column totals of the matrix of relative frequencies are called the row mass and column mass.
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
MCA results is interpreted as the results from a simple CA.
The first table contains the variances and the percentage of variances retained by each dimension. Eigenvalues correspond to the amount of information retained by each axis. Dimensions are ordered decreasingly and listed according to the amount of variance explained in the solution. Dimension 1 explains the most variance in the solution, followed by dimension 2 till Dimension 11.
The resulting Table 2 in summary() function contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active individuals on the dimensions 1 and 2.
Table 3 in summary() contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active variable categories on the dimensions 1 and 2. This table contains also a column called v.test. The value of the v.test is generally comprised between 2 and -2. For a given variable category, if the absolute value of the v.test is superior to 2, this means that the coordinate is significantly different from 0. This is a case for almost all variables involved in Dim.1, except the categories alone and other. The absolute value of v.test is less then 2 for category black in Dim.2, and for 5 categories in Dim.3 - green, milk, tea bag, tea bag + unpackaged and unpackaged.
Table 4 in summary() is categorical variables (eta2): contains the squared correlation between each variable and the dimensions. If the value is close to 1, it indicates a strong link with the variable and dimension. In Dim.1 this is a case for the variables how and where. In other words, from this table we see that the variables how and where are strongly linked to the first and a little bit less to the second dimension, and the variables Tea, How, suger and lunch are more related to the third dimension.
We first see from Table 4 which variables are linked to which dimensions, and then from Table 3 we observe the categories that are linked to the dimensions.
For further clarifying the MCA results we use graphical representation. MCA scatter plot draws a biplot of individuals and variable categories.
plot(mca, invisible=c("ind"), habillage = "quali")
MCA factor map is a two dimensional biplot of the active categorical levels. On this plot the distances between levels and the barycenter, and the distance between levels are calculated and shown. These distances provide information about similarity of different categories.
On this biplot only first two dimensions are shown. We observe that the categories Earl Grey, sugar and milk are close to each other, so people who prefer to drink Earl Gray most probably we do it with sugar and milk. There is similarity between unpackaged and tea shop categories, and also between tea bag + unpackaged and chain store + tea shop categories. The category green is alone quite far from the other categories.