conducted by Neli Noykova.
In this exercise set we use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.
The 4 substantive variables, which values are masured in 1-5 scale, are:
A: Married people are generally happier than unmarried people. B: People who want children ought to get married. C: It is all right for a couple to live together without intending to get married. D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.
The demographic variables are: g: gender (1=male, 2=female) a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+) e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics , 7= University, lower academic degree, BA, 8=University, higher academic degree, MA p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)
The data wrangling includes the following changes:
The missing data ae removed (this has already been provided). The number of observations without missing data is N=924.
For Task 2 a combined variable ga, describing the intraction of gender and age categories, is formed as: **ga = 6*(g-1) + a**
The preliminary treated data look as:
Finland <- read.table("Finland.txt")
head(Finland)
## A B C D g a e p
## 1 3 3 1 2 1 2 4 3
## 2 3 2 3 2 1 4 2 3
## 3 3 3 1 3 1 3 8 1
## 4 3 2 2 3 2 2 6 1
## 5 2 2 2 3 2 4 5 1
## 6 3 3 1 3 1 3 7 3
dim(Finland)
## [1] 924 8
str(Finland)
## 'data.frame': 924 obs. of 8 variables:
## $ A: int 3 3 3 3 2 3 3 2 2 3 ...
## $ B: int 3 2 3 2 2 3 3 3 3 3 ...
## $ C: int 1 3 1 2 2 1 1 1 2 3 ...
## $ D: int 2 2 3 3 3 3 2 2 3 3 ...
## $ g: int 1 1 1 2 2 1 2 2 2 2 ...
## $ a: int 2 4 3 2 4 3 1 2 5 4 ...
## $ e: int 4 2 8 6 5 7 4 8 7 5 ...
## $ p: int 3 3 1 1 1 3 3 3 1 3 ...
The goal is to discover interesting relations among the measurements on A, …, D, g, …, p. Correspondence analysis is a descriptive/exploratory technique designed to analyse simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. CA may also be defined as a special case of Principal Components Analysis of the rows and columns of a table, especially applicable to a cross-tabulation. Principal components analysis is used for tables consisting of continuous measurement, whereas CA is applied to contingency tables (i.e. cross-tabulations). The goal of CA is to transform a table of numerical information into a graphical display, in which each row and each column is depicted as a point. Thus CA can be used for data visualization or data pre-processing before applying methods for supervised learning. We can apply this technique for dimensionality reduction.
First we provide a cross-tabulation between the age variable a and answer A: Married people are generally happier than unmarried people. The variable a consists of 6 age groups, corresponding to rownames of crosstable: a1 = 16-25, a2=26-35, a3 = 36-45, a4 = 46-55, a5 = 56-65, a6 = 66+ The categorical variable A assume 5 possible values, which correspond to column names of cross-table: SA - strongly agree; A - agree, NN - neither agree or disagree, D - disagree, SD - strongly disagree.
tab <-table(Finland[,"a"], Finland[,"A"])
rownames(tab) <- c("a1","a2","a3","a4","a5","a6")
colnames(tab) <- c("SA","A","NN","D","SD")
tab
##
## SA A NN D SD
## a1 5 18 42 29 29
## a2 6 25 44 34 34
## a3 6 30 55 30 17
## a4 9 37 60 46 34
## a5 15 67 58 49 15
## a6 5 37 56 26 6
From this table we see that the biggest part of age group a5 agree with question A. Since these values are absolute, we need more precise information wich could be obtained row and column profiles, calculated from this cross-table. We need also to calculate the chi-square statistic as it has been shown durint the second exercise session.
An alternative means of extracting the nature of the dependency between the rows and columns of the contingency table is to represent the row or column profiles graphically.
require(ca)
## Loading required package: ca
plot(ca(tab))
This plot is assymetric because the row profiles are plotted simultaneously with apexes representing the columns. a3 is strongly “associated” with answer NN, which is clearly the case from the profile presented in the cross-table. Likewise we observe some proximity of a5 to the apex representing A answer to question A. The fact that a1, a2 and a4 are positioned relatively far away from answers SA and A means that these younger people rather disagree then agree with the statement A. The age group a6 is far from all answers of A, which probably means that people from this age group do not care so much about statement A.
As a second example we conduct cross-tabulation between the variable p (if there is a partner and they live together) and answer **C*: C: It is all right for a couple to live together without intending to get married. The obtained cross-table and resulting plot from CA are:
#Cross-tabulations
#between g and D
tab2 <-table(Finland[,"p"], Finland[,"C"])
rownames(tab2) <- c("Live together","Not live together", "No partner")
colnames(tab2) <- c("SA","A","NN","D","SD")
tab2
##
## SA A NN D SD
## Live together 291 247 44 32 29
## Not live together 26 10 5 1 0
## No partner 119 79 15 11 15
require(ca)
plot(ca(tab2))
From the cross-tabulation and CA plot becomes clear that people, who live together, tend to give positive answer to the statement C.
As a third example we investigate relationship between education e and statement A:
tab1 <-table(Finland[,"e"], Finland[,"A"])
rownames(tab1) <- c("Primary","Comprehensive", "Post-comprehensive", "General", "Vocational", "Polytechnics", "University, B", "University, MSc")
colnames(tab1) <- c("SA","A","NN","D","SD")
tab1
##
## SA A NN D SD
## Primary 3 28 34 14 3
## Comprehensive 6 9 18 16 10
## Post-comprehensive 16 37 71 54 26
## General 1 20 39 18 18
## Vocational 4 49 63 53 39
## Polytechnics 7 25 28 22 16
## University, B 2 11 23 16 5
## University, MSc 7 35 39 21 18
require(ca)
plot(ca(tab1))
From this plot becomes clear that no one of the educational gruops strongly afrees with the statement A. The highly educated people with university degrees tend to answer using middle alternative.
The variable ga, which is the interaction of gender and age categories, is created as:
Finland$ga <- 6*(Finland$g-1) + Finland$a
head(Finland)
## A B C D g a e p ga
## 1 3 3 1 2 1 2 4 3 2
## 2 3 2 3 2 1 4 2 3 4
## 3 3 3 1 3 1 3 8 1 3
## 4 3 2 2 3 2 2 6 1 8
## 5 2 2 2 3 2 4 5 1 10
## 6 3 3 1 3 1 3 7 3 3
str(Finland)
## 'data.frame': 924 obs. of 9 variables:
## $ A : int 3 3 3 3 2 3 3 2 2 3 ...
## $ B : int 3 2 3 2 2 3 3 3 3 3 ...
## $ C : int 1 3 1 2 2 1 1 1 2 3 ...
## $ D : int 2 2 3 3 3 3 2 2 3 3 ...
## $ g : int 1 1 1 2 2 1 2 2 2 2 ...
## $ a : int 2 4 3 2 4 3 1 2 5 4 ...
## $ e : int 4 2 8 6 5 7 4 8 7 5 ...
## $ p : int 3 3 1 1 1 3 3 3 1 3 ...
## $ ga: num 2 4 3 8 10 3 7 8 11 10 ...
The cross-tabulation between ga and A is computed as:
tab3 <-table(Finland[,"ga"], Finland[,"C"])
rownames(tab3) <- c("Ma1","Ma2", "Ma3", "Ma4", "Ma5", "Ma6", "Fa1","Fa2", "Fa3", "Fa4", "Fa5", "Fa6")
colnames(tab3) <- c("SA","A","NN","D","SD")
tab3
##
## SA A NN D SD
## Ma1 30 11 3 0 4
## Ma2 46 15 0 1 0
## Ma3 26 31 0 4 1
## Ma4 38 38 7 2 2
## Ma5 29 36 9 7 3
## Ma6 8 37 6 8 2
## Fa1 38 21 1 6 9
## Fa2 59 16 1 0 5
## Fa3 43 25 4 2 2
## Fa4 63 24 7 2 3
## Fa5 42 45 18 8 7
## Fa6 14 37 8 4 6
The CA between ga and A:
require(ca)
plot(ca(tab3))
From this plot becomes clear that female of age group 4 most often strongly agree with the statement A, while males in age group 5 most often agree with A.
The most classical and standard approach to MCA is to apply a simple CA to the indicatormatrix Z. The indicator matrix Z = {zij} corresponds to a binary coding of the factors - instead of using a factor with Jq levels one uses Jq columns containing binary values, also called dummy variables. The Burt matrix C is obtained directly from the indicator matrix Z: C = ZTZ. Burt matrix concatenates all two-way cross-tabulations between pairs of variables
The Burt matrix is computed as:
require(ca)
Finland.B <- mjca(Finland)$Burt
head(Finland.B)
## A:1 A:2 A:3 A:4 A:5 B:1 B:2 B:3 B:4 B:5 C:1 C:2 C:3 C:4 C:5 D:1 D:2
## A:1 46 0 0 0 0 28 12 4 2 0 15 13 6 6 6 13 15
## A:2 0 214 0 0 0 50 92 38 25 9 64 101 17 15 17 28 91
## A:3 0 0 315 0 0 24 92 95 83 21 145 120 30 14 6 55 115
## A:4 0 0 0 214 0 15 42 30 97 30 104 88 10 9 3 41 84
## A:5 0 0 0 0 135 9 10 6 24 86 108 14 1 0 12 40 43
## B:1 28 50 24 15 9 126 0 0 0 0 26 34 17 15 34 29 24
## D:3 D:4 D:5 g:1 g:2 a:1 a:2 a:3 a:4 a:5 a:6 e:1 e:2 e:3 e:4 e:5 e:6
## A:1 5 7 6 28 18 5 6 6 9 15 5 3 6 16 1 4 7
## A:2 46 36 13 113 101 18 25 30 37 67 37 28 9 37 20 49 25
## A:3 88 43 14 136 179 42 44 55 60 58 56 34 18 71 39 63 28
## A:4 41 43 5 85 129 29 34 30 46 49 26 14 16 54 18 53 22
## A:5 27 12 13 42 93 29 34 17 34 15 6 3 10 26 18 39 16
## B:1 20 29 24 59 67 15 8 16 16 41 30 15 10 27 12 32 7
## e:7 e:8 p:1 p:2 p:3 ga:1 ga:2 ga:3 ga:4 ga:5 ga:6 ga:7 ga:8 ga:9 ga:10
## A:1 2 7 35 1 10 3 3 4 6 10 2 2 3 2 3
## A:2 11 35 171 8 35 10 12 16 22 29 24 8 13 14 15
## A:3 23 39 224 12 79 16 21 21 32 21 25 26 23 34 28
## A:4 16 21 141 14 59 8 14 13 20 20 10 21 20 17 26
## A:5 5 18 72 7 56 11 12 8 7 4 0 18 22 9 27
## B:1 4 19 97 2 27 7 3 8 8 19 14 8 5 8 8
## ga:11 ga:12
## A:1 5 3
## A:2 38 13
## A:3 37 31
## A:4 29 16
## A:5 11 6
## B:1 22 16
summary(Finland.B)
## A:1 A:2 A:3 A:4
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 13.00 1st Qu.: 21.00 1st Qu.: 14.00
## Median : 6.000 Median : 25.00 Median : 34.00 Median : 22.00
## Mean : 8.118 Mean : 37.76 Mean : 55.59 Mean : 37.76
## 3rd Qu.:10.000 3rd Qu.: 38.00 3rd Qu.: 67.00 3rd Qu.: 44.50
## Max. :46.000 Max. :214.00 Max. :315.00 Max. :214.00
## A:5 B:1 B:2 B:3
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.50 1st Qu.: 8.00 1st Qu.: 14.00 1st Qu.: 10.00
## Median : 13.00 Median : 16.00 Median : 27.00 Median : 20.00
## Mean : 23.82 Mean : 22.24 Mean : 43.76 Mean : 30.53
## 3rd Qu.: 28.00 3rd Qu.: 27.50 3rd Qu.: 52.00 3rd Qu.: 37.50
## Max. :135.00 Max. :126.00 Max. :248.00 Max. :173.00
## B:4 B:5 C:1 C:2
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 13.50 1st Qu.: 5.00 1st Qu.: 26.00 1st Qu.: 19.50
## Median : 25.00 Median : 16.00 Median : 59.00 Median : 36.00
## Mean : 40.76 Mean : 25.76 Mean : 76.94 Mean : 59.29
## 3rd Qu.: 50.50 3rd Qu.: 31.50 3rd Qu.:102.50 3rd Qu.: 77.00
## Max. :231.00 Max. :146.00 Max. :436.00 Max. :336.00
## C:3 C:4 C:5 D:1
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 12.00
## Median : 7.00 Median : 5.000 Median : 5.000 Median : 22.00
## Mean :11.29 Mean : 7.765 Mean : 7.765 Mean : 31.24
## 3rd Qu.:17.00 3rd Qu.:12.000 3rd Qu.: 9.500 3rd Qu.: 38.00
## Max. :64.00 Max. :44.000 Max. :44.000 Max. :177.00
## D:2 D:3 D:4 D:5
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 22.50 1st Qu.: 15.00 1st Qu.: 10.00 1st Qu.: 3.5
## Median : 37.00 Median : 22.00 Median : 15.00 Median : 6.0
## Mean : 61.41 Mean : 36.53 Mean : 24.88 Mean : 9.0
## 3rd Qu.: 85.00 3rd Qu.: 45.50 3rd Qu.: 30.50 3rd Qu.:11.0
## Max. :348.00 Max. :207.00 Max. :141.00 Max. :51.0
## g:1 g:2 a:1 a:2
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 26.50 1st Qu.: 33.50 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 61.00 Median : 76.00 Median : 13.00 Median : 17.00
## Mean : 71.29 Mean : 91.76 Mean : 21.71 Mean : 25.24
## 3rd Qu.: 87.00 3rd Qu.:117.00 3rd Qu.: 30.00 3rd Qu.: 35.50
## Max. :404.00 Max. :520.00 Max. :123.00 Max. :143.00
## a:3 a:4 a:5 a:6
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 16.00 Median : 16.00 Median : 16.0 Median : 11.00
## Mean : 24.35 Mean : 32.82 Mean : 36.0 Mean : 22.94
## 3rd Qu.: 33.00 3rd Qu.: 46.50 3rd Qu.: 53.5 3rd Qu.: 31.00
## Max. :138.00 Max. :186.00 Max. :204.0 Max. :130.00
## e:1 e:2 e:3 e:4
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 2.50 1st Qu.: 12.5 1st Qu.: 4.00
## Median : 8.00 Median : 7.00 Median : 25.0 Median :11.00
## Mean :14.47 Mean :10.41 Mean : 36.0 Mean :16.94
## 3rd Qu.:22.50 3rd Qu.:15.50 3rd Qu.: 48.0 3rd Qu.:22.00
## Max. :82.00 Max. :59.00 Max. :204.0 Max. :96.00
## e:5 e:6 e:7 e:8
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 3.50 1st Qu.: 2.00 1st Qu.: 5.00
## Median : 24.00 Median :13.00 Median : 6.00 Median : 16.00
## Mean : 36.71 Mean :17.29 Mean :10.06 Mean : 21.18
## 3rd Qu.: 51.00 3rd Qu.:22.50 3rd Qu.:13.00 3rd Qu.: 28.00
## Max. :208.00 Max. :98.00 Max. :57.00 Max. :120.00
## p:1 p:2 p:3 ga:1
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 46.0 1st Qu.: 3.000 1st Qu.: 16.00 1st Qu.: 0.000
## Median : 88.0 Median : 5.000 Median : 34.00 Median : 3.000
## Mean :113.5 Mean : 7.412 Mean : 42.18 Mean : 8.471
## 3rd Qu.:142.0 3rd Qu.: 9.500 3rd Qu.: 53.50 3rd Qu.:11.000
## Max. :643.0 Max. :42.000 Max. :239.00 Max. :48.000
## ga:2 ga:3 ga:4 ga:5
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 3.00 Median : 5.00 Median : 7.00 Median : 7.00
## Mean :10.94 Mean :10.94 Mean :15.35 Mean :14.82
## 3rd Qu.:14.50 3rd Qu.:14.50 3rd Qu.:21.00 3rd Qu.:19.00
## Max. :62.00 Max. :62.00 Max. :87.00 Max. :84.00
## ga:6 ga:7 ga:8 ga:9
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.00 Median : 6.00 Median : 5.00 Median : 4.00
## Mean :10.76 Mean :13.24 Mean :14.29 Mean :13.41
## 3rd Qu.:12.50 3rd Qu.:19.50 3rd Qu.:19.50 3rd Qu.:17.50
## Max. :61.00 Max. :75.00 Max. :81.00 Max. :76.00
## ga:10 ga:11 ga:12
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 6.00 Median : 10.00 Median : 5.00
## Mean :17.47 Mean : 21.18 Mean :12.18
## 3rd Qu.:24.00 3rd Qu.: 27.00 3rd Qu.:15.50
## Max. :99.00 Max. :120.00 Max. :69.00
We take the stacked tables of A and D substantive variables from the whole Burt matrix:
Finland.AD = Finland.B[c(1:5,16:20), c(1:5,16:20)]
The principal intervals (eigenvalues) are:
summary(Finland.AD)
## A:1 A:2 A:3 A:4
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 5.5 Median : 20.5 Median : 28.50 Median : 23.0
## Mean : 9.2 Mean : 42.8 Mean : 63.00 Mean : 42.8
## 3rd Qu.:11.5 3rd Qu.: 43.5 3rd Qu.: 79.75 3rd Qu.: 42.5
## Max. :46.0 Max. :214.0 Max. :315.00 Max. :214.0
## A:5 D:1 D:2 D:3
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 12.50 Median : 20.50 Median : 29.00 Median : 16.00
## Mean : 27.00 Mean : 35.40 Mean : 69.60 Mean : 41.40
## 3rd Qu.: 36.75 3rd Qu.: 40.75 3rd Qu.: 89.25 3rd Qu.: 44.75
## Max. :135.00 Max. :177.00 Max. :348.00 Max. :207.00
## D:4 D:5
## Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 9.50 Median : 5.5
## Mean : 28.20 Mean :10.2
## 3rd Qu.: 41.25 3rd Qu.:13.0
## Max. :141.00 Max. :51.0
The computation of MCA is the application of the (simple) CA algorithm to the Burt matrix C.
The following CA is conducted:
plot(Finland.AD, main="MCA=CA of Burt")
Here I use the script from Exercise set 2 about joint correspondance analysis and its fit to two-way tables:
Finland.mca <- mjca(Finland[,1:4], ps="", lambda="JCA")
par(mar=c(4.2,4,3,1), mfrow=c(1,1), font.lab=2)
plot(Finland.mca, main="Joint correspondence analysis")
The example code from Exercise set 3 is used, only data are different.
MCA is the CA of indicator matrix:
require(ca)
Finland.mca1 <- mjca(Finland[,1:4], lambda="indicator", ps="", reti=T)
sum(Finland.mca1$sv^2)
## [1] 4
summary(Finland.mca1)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.506305 12.7 12.7 ***
## 2 0.475751 11.9 24.6 ***
## 3 0.322936 8.1 32.6 **
## 4 0.309594 7.7 40.4 **
## 5 0.274095 6.9 47.2 **
## 6 0.268731 6.7 53.9 **
## 7 0.250166 6.3 60.2 **
## 8 0.236639 5.9 66.1 *
## 9 0.225387 5.6 71.7 *
## 10 0.217098 5.4 77.2 *
## 11 0.209046 5.2 82.4 *
## 12 0.186252 4.7 87.0 *
## 13 0.162414 4.1 91.1 *
## 14 0.142888 3.6 94.7 *
## 15 0.125436 3.1 97.8 *
## 16 0.087262 2.2 100.0 *
## -------- -----
## Total: 4.000000 100.0
##
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A1 | 12 198 57 | -779 32 15 | 1783 167 83 |
## 2 | A2 | 58 181 46 | -730 161 61 | 257 20 8 |
## 3 | A3 | 85 95 38 | -187 18 6 | -384 76 26 |
## 4 | A4 | 58 60 44 | 142 6 2 | -421 54 22 |
## 5 | A5 | 37 509 63 | 1635 458 193 | 550 52 23 |
## 6 | B1 | 34 616 65 | -694 76 32 | 1850 540 245 |
## 7 | B2 | 67 244 47 | -767 216 78 | -276 28 11 |
## 8 | B3 | 47 84 46 | -239 13 5 | -555 71 30 |
## 9 | B4 | 62 110 45 | 281 26 10 | -501 84 33 |
## 10 | B5 | 40 588 66 | 1740 568 236 | 322 19 9 |
## 11 | C1 | 118 559 39 | 789 556 145 | -57 3 1 |
## 12 | C2 | 91 379 42 | -650 241 76 | -491 138 46 |
## 13 | C3 | 17 65 52 | -894 59 27 | 279 6 3 |
## 14 | C4 | 12 101 54 | -1130 64 30 | 863 37 19 |
## 15 | C5 | 12 472 66 | -430 9 4 | 3041 462 231 |
## 16 | D1 | 48 159 47 | 795 150 60 | 196 9 4 |
## 17 | D2 | 94 155 36 | -247 37 11 | -443 118 39 |
## 18 | D3 | 56 12 42 | -23 0 0 | -203 12 5 |
## 19 | D4 | 38 35 46 | -310 17 7 | 310 17 8 |
## 20 | D5 | 14 313 59 | -123 1 0 | 2310 312 155 |
We calculate: - the number of categories:
J <- 4*5 -1
The number of substantive questions is Q=4. Then:
Q <- 4
(J-Q)/Q
## [1] 3.75
We check the contents of these components of the mjca object (eigenvalues (squared singular values),standard coordinates of columns, standard coordinates of first 10 rows, variable names, category names, first 10 rows of 924 x 20 indicator matrix, first 20 rows and columns of 20 x 20 Burt matrix) :
names(Finland.mca1)
## [1] "sv" "lambda" "inertia.e" "inertia.t" "inertia.et"
## [6] "levelnames" "factors" "levels.n" "nd" "nd.max"
## [11] "rownames" "rowmass" "rowdist" "rowinertia" "rowcoord"
## [16] "rowpcoord" "rowctr" "rowcor" "colnames" "colmass"
## [21] "coldist" "colinertia" "colcoord" "colpcoord" "colctr"
## [26] "colcor" "colsup" "subsetcol" "Burt" "Burt.upd"
## [31] "subinertia" "JCA.iter" "indmat" "call"
Finland.mca1$sv^2 # eigenvalues (squared singular values) on all 31 dimensions
## [1] 0.50630492 0.47575147 0.32293568 0.30959403 0.27409529 0.26873127
## [7] 0.25016609 0.23663902 0.22538662 0.21709794 0.20904570 0.18625197
## [13] 0.16241364 0.14288849 0.12543559 0.08726229
Finland.mca1$colcoord[,1:4] # standard coordinates of columns
## [,1] [,2] [,3] [,4]
## [1,] -1.09485075 2.58466228 0.38117469 -1.0611142
## [2,] -1.02632480 0.37232989 -0.47157445 1.4201562
## [3,] -0.26345610 -0.55714283 -0.80286444 -1.3121928
## [4,] 0.19957060 -0.61095108 2.14671089 -0.1271033
## [5,] 2.29834966 0.79755898 -0.91193285 1.3736196
## [6,] -0.97470923 2.68220803 0.48209843 -0.6463525
## [7,] -1.07792715 -0.39988664 -0.66328834 1.5112705
## [8,] -0.33602699 -0.80484225 -1.22794762 -1.6797485
## [9,] 0.39518935 -0.72610435 1.92930121 -0.7637834
## [10,] 2.44509059 0.46699649 -0.88686667 1.1895602
## [11,] 1.10913902 -0.08237802 0.03901019 -0.2765203
## [12,] -0.91287622 -0.71121198 0.25165618 0.6290356
## [13,] -1.25648834 0.40439859 -2.40903951 -1.9962482
## [14,] -1.58745288 1.25079576 0.30637839 1.2613963
## [15,] -0.60443232 4.40835272 0.88938548 -0.4212425
## [16,] 1.11702020 0.28392642 -0.07957831 -0.3599156
## [17,] -0.34677975 -0.64190277 -0.07750072 0.8506859
## [18,] -0.03261598 -0.29495380 -0.87687783 -0.9136392
## [19,] -0.43594396 0.44926861 1.62187492 -0.1096170
## [20,] -0.17281598 3.34972059 -0.11990279 -0.5442025
Finland.mca1$rowcoord[1:10,1:4] # standard coordinates of first 10 rows
## [,1] [,2] [,3] [,4]
## [1,] 0.05722575 -0.7561703 -0.9103450 -1.08632489
## [2,] -1.03458890 -0.4329605 -1.7389019 -0.42526271
## [3,] 0.16760566 -0.6304181 -1.2620137 -1.87904947
## [4,] -0.80348254 -0.7115631 -0.9200550 -0.03842743
## [5,] -1.07151274 -0.3746743 -0.7743111 1.18923762
## [6,] 0.16760566 -0.6304181 -1.2620137 -1.87904947
## [7,] 0.05722575 -0.7561703 -0.9103450 -1.08632489
## [8,] -0.21080445 -0.4192814 -0.7646011 0.14134015
## [9,] -0.81084974 -0.5214511 -1.0227208 -0.24451133
## [10,] -0.66354598 -0.4539851 -2.3389802 -2.65173623
Finland.mca1$colnames # variable names
## [1] "A" "B" "C" "D"
Finland.mca1$levelnames # category names
## [1] "A1" "A2" "A3" "A4" "A5" "B1" "B2" "B3" "B4" "B5" "C1" "C2" "C3" "C4"
## [15] "C5" "D1" "D2" "D3" "D4" "D5"
Finland.mca1$indmat[1:10,] # first 10 rows of 924 x 20 indicator matrix
## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4 D5
## 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
## 2 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0
## 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
## 4 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
## 5 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
## 6 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
## 7 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
## 8 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
## 9 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0
## 10 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
Finland.mca1$Burt[1:20, 1:20] # first 20 rows and columns of 20 x 20 Burt matrix
## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4
## A1 46 0 0 0 0 28 12 4 2 0 15 13 6 6 6 13 15 5 7
## A2 0 214 0 0 0 50 92 38 25 9 64 101 17 15 17 28 91 46 36
## A3 0 0 315 0 0 24 92 95 83 21 145 120 30 14 6 55 115 88 43
## A4 0 0 0 214 0 15 42 30 97 30 104 88 10 9 3 41 84 41 43
## A5 0 0 0 0 135 9 10 6 24 86 108 14 1 0 12 40 43 27 12
## B1 28 50 24 15 9 126 0 0 0 0 26 34 17 15 34 29 24 20 29
## B2 12 92 92 42 10 0 248 0 0 0 52 141 27 25 3 30 119 53 37
## B3 4 38 95 30 6 0 0 173 0 0 79 75 17 1 1 28 75 45 23
## B4 2 25 83 97 24 0 0 0 231 0 142 82 3 3 1 41 90 55 37
## B5 0 9 21 30 86 0 0 0 0 146 137 4 0 0 5 49 40 34 15
## C1 15 64 145 104 108 26 52 79 142 137 436 0 0 0 0 128 137 93 57
## C2 13 101 120 88 14 34 141 75 82 4 0 336 0 0 0 35 175 73 51
## C3 6 17 30 10 1 17 27 17 3 0 0 0 64 0 0 7 22 22 7
## C4 6 15 14 9 0 15 25 1 3 0 0 0 0 44 0 3 12 12 11
## C5 6 17 6 3 12 34 3 1 1 5 0 0 0 0 44 4 2 7 15
## D1 13 28 55 41 40 29 30 28 41 49 128 35 7 3 4 177 0 0 0
## D2 15 91 115 84 43 24 119 75 90 40 137 175 22 12 2 0 348 0 0
## D3 5 46 88 41 27 20 53 45 55 34 93 73 22 12 7 0 0 207 0
## D4 7 36 43 43 12 29 37 23 37 15 57 51 7 11 15 0 0 0 141
## D5 6 13 14 5 13 24 9 2 8 8 21 2 6 6 16 0 0 0 0
## D5
## A1 6
## A2 13
## A3 14
## A4 5
## A5 13
## B1 24
## B2 9
## B3 2
## B4 8
## B5 8
## C1 21
## C2 2
## C3 6
## C4 6
## C5 16
## D1 0
## D2 0
## D3 0
## D4 0
## D5 51
Next we compute positions of cases and plot them.
Finland.rpc <- Finland.mca1$indmat %*% Finland.mca1$colcoord[,1:2] / 4
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca1, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
The MCA is calculated as a CA of Burt matrix:
Finland.mca2 <- mjca(Finland[,1:4], lambda="Burt", ps="")
sum(Finland.mca2$sv^2)
## [1] 1.195211
summary(Finland.mca2)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.256345 21.4 21.4 *****
## 2 0.226339 18.9 40.4 *****
## 3 0.104287 8.7 49.1 **
## 4 0.095848 8.0 57.1 **
## 5 0.075128 6.3 63.4 **
## 6 0.072216 6.0 69.5 **
## 7 0.062583 5.2 74.7 *
## 8 0.055998 4.7 79.4 *
## 9 0.050799 4.3 83.6 *
## 10 0.047132 3.9 87.6 *
## 11 0.043700 3.7 91.2 *
## 12 0.034690 2.9 94.1 *
## 13 0.026378 2.2 96.3 *
## 14 0.020417 1.7 98.0
## 15 0.015734 1.3 99.4
## 16 0.007615 0.6 100.0
## -------- -----
## Total: 1.195211 100.0
##
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A1 | 12 335 57 | -554 57 15 | 1230 278 83 |
## 2 | A2 | 58 317 46 | -520 284 61 | 177 33 8 |
## 3 | A3 | 85 165 38 | -133 33 6 | -265 132 26 |
## 4 | A4 | 58 104 44 | 101 11 2 | -291 93 22 |
## 5 | A5 | 37 724 63 | 1164 654 193 | 379 70 23 |
## 6 | B1 | 34 825 65 | -494 107 32 | 1276 718 245 |
## 7 | B2 | 67 400 47 | -546 357 78 | -190 43 11 |
## 8 | B3 | 47 148 46 | -170 24 5 | -383 124 30 |
## 9 | B4 | 62 186 45 | 200 47 10 | -345 139 33 |
## 10 | B5 | 40 790 66 | 1238 765 236 | 222 25 9 |
## 11 | C1 | 118 793 39 | 562 789 145 | -39 4 1 |
## 12 | C2 | 91 589 42 | -462 384 76 | -338 206 46 |
## 13 | C3 | 17 123 52 | -636 113 27 | 192 10 3 |
## 14 | C4 | 12 184 54 | -804 119 30 | 595 65 19 |
## 15 | C5 | 12 681 66 | -306 14 4 | 2097 666 231 |
## 16 | D1 | 48 288 47 | 566 273 60 | 135 16 4 |
## 17 | D2 | 94 270 36 | -176 67 11 | -305 203 39 |
## 18 | D3 | 56 22 42 | -17 0 0 | -140 22 5 |
## 19 | D4 | 38 65 46 | -221 34 7 | 214 32 8 |
## 20 | D5 | 14 502 59 | -87 2 0 | 1594 500 155 |
The default plot shows principal coordinates:
plot(Finland.mca2)
The adjusted MCA is the default of mjca function:
Finland.mca3 <- mjca(Finland[,1:4], ps="")
summary(Finland.mca3)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.116786 44.9 44.9 *************
## 2 0.090602 34.8 79.7 **********
## 3 0.009457 3.6 83.3 *
## 4 0.006314 2.4 85.7 *
## 5 0.001032 0.4 86.1
## 6 0.000624 0.2 86.4
## 7 00000000 0.0 86.4
## -------- -----
## Total: 0.260281
##
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A1 | 12 848 57 | -374 159 15 | 778 689 83 |
## 2 | A2 | 58 847 46 | -351 769 61 | 112 78 8 |
## 3 | A3 | 85 545 38 | -90 122 6 | -168 423 26 |
## 4 | A4 | 58 375 44 | 68 45 2 | -184 329 22 |
## 5 | A5 | 37 830 63 | 785 759 193 | 240 71 23 |
## 6 | B1 | 34 834 65 | -333 121 32 | 807 713 245 |
## 7 | B2 | 67 738 47 | -368 666 78 | -120 71 11 |
## 8 | B3 | 47 543 46 | -115 100 5 | -242 443 30 |
## 9 | B4 | 62 460 45 | 135 127 10 | -219 333 33 |
## 10 | B5 | 40 804 66 | 836 782 236 | 141 22 9 |
## 11 | C1 | 118 902 39 | 379 899 145 | -25 4 1 |
## 12 | C2 | 91 898 42 | -312 611 76 | -214 288 46 |
## 13 | C3 | 17 636 52 | -429 588 27 | 122 47 3 |
## 14 | C4 | 12 756 54 | -542 510 30 | 376 246 19 |
## 15 | C5 | 12 845 66 | -207 20 4 | 1327 825 231 |
## 16 | D1 | 48 970 47 | 382 924 60 | 85 46 4 |
## 17 | D2 | 94 829 36 | -119 227 11 | -193 602 39 |
## 18 | D3 | 56 277 42 | -11 4 0 | -89 273 5 |
## 19 | D4 | 38 557 46 | -149 305 7 | 135 252 8 |
## 20 | D5 | 14 960 59 | -59 3 0 | 1008 956 155 |
A joint correspondence analysis is used for optimal fit to two-way tables:
Finland.mca4 <- mjca(Finland[,1:4], ps="", lambda="JCA")
par(mar=c(4.2,4,3,1), mfrow=c(1,1), font.lab=2)
plot(Finland.mca4, main="Joint correspondence analysis")
summary(Finland.mca4)
##
## Principal inertias (eigenvalues):
##
## dim value
## 1 0.156300
## 2 0.101794
## 3 0.006669
## 4 0.005138
## 5 0.003304
## 6 0.000545
## 7 0.000228
## --------
## Total: 0.285989
##
## Diagonal inertia discounted from eigenvalues: 0.090779
## Percentage explained by JCA in 2 dimensions: 85.7%
## (Eigenvalues are not nested)
## [Iterations in JCA: 50 , epsilon = 0.0008615]
##
##
## Columns:
## name mass inr k=1 k=2 cor ctr
## 1 | A1 | 12 57 | -522 -606 | 856 42 |
## 2 | A2 | 58 46 | -364 -25 | 879 37 |
## 3 | A3 | 85 38 | -74 170 | 564 14 |
## 4 | A4 | 58 44 | 103 145 | 362 10 |
## 5 | A5 | 37 63 | 765 -381 | 912 121 |
## 6 | B1 | 34 65 | -722 -880 | 970 136 |
## 7 | B2 | 67 47 | -457 209 | 848 52 |
## 8 | B3 | 47 46 | -87 300 | 600 17 |
## 9 | B4 | 62 45 | 255 225 | 528 21 |
## 10 | B5 | 40 66 | 1100 -309 | 954 151 |
## 11 | C1 | 118 39 | 377 -41 | 943 80 |
## 12 | C2 | 91 42 | -259 280 | 918 60 |
## 13 | C3 | 17 52 | -460 -53 | 711 17 |
## 14 | C4 | 12 54 | -606 -293 | 793 24 |
## 15 | C5 | 12 66 | -482 -1367 | 917 104 |
## 16 | D1 | 48 47 | 270 -145 | 745 25 |
## 17 | D2 | 94 36 | -59 201 | 795 21 |
## 18 | D3 | 56 42 | 10 81 | 257 2 |
## 19 | D4 | 38 46 | -135 -103 | 487 6 |
## 20 | D5 | 14 59 | -206 -911 | 917 60 |
Call the used code about confidence plots, package ellipse, and calculated individual points:
source("confidenceplots.R")
require(ellipse)
## Loading required package: ellipse
Finland.rpc <- Finland.mca1$indmat %*% Finland.mca1$colcoord[,1:2] / 4
The demographic groups are: education e, partnership p and ga groups.Their confidence plots are :
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=rep("chocolate",6), shownames=T, add=T)
lines(matrix(c(-0.5,-0.6, -0.5,0.4, 0.5,0.4, 0.5,-0.6, -0.5,-0.6), ncol=2, byrow=T), lty=2, col="gray30", lwd=2)
We zoom in to the group means and confidence ellipses:
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=rep("chocolate",6), shownames=T, add=T)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)),
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)
And add shading:
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)
Next we go back to original plot, with category labels, gender-age labels slightly smaller (cex=0.8)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T, cex=0.8)
Back to other demographic variable, education:
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shade=T,
shownames=T, add=T, cex=0.8)
Plot confidence ellipses for all demographic groups together in a zoom:
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shade=T,
groupnames=c("liveTog","liveSep","Single"), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shade=T,
groupnames=c("e0","e1","e2","e3","e4","e5"), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)
As in the previous Assignment 1, here we again use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.
The 4 substantive variables, which values are measured in 1-5 scale, are:
A: Married people are generally happier than unmarried people. B: People who want children ought to get married. C: It is all right for a couple to live together without intending to get married. D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.
The demographic variables are: g: gender (1=male, 2=female) a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+) e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics , 7= University, lower academic degree, BA, 8=University, higher academic degree, MA p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)
Here we do not provide preliminary data treatment. We include all original variables, and also all missing data, denoted with number 9. During the data analyses we simply assign a new category, 9, to the missing data.
The original data, including missing values denoted by number 9, look as:
Finland2 <- read.table("FinlandWithMissing.txt")
head(Finland2)
## A B C D g a e p
## 1 3 3 1 2 1 2 4 3
## 2 3 2 3 2 1 4 2 3
## 3 3 3 1 3 1 3 8 1
## 4 9 4 1 1 2 2 4 2
## 5 3 2 2 3 2 2 6 1
## 6 9 4 2 1 2 5 6 3
dim(Finland2)
## [1] 1171 8
str(Finland2)
## 'data.frame': 1171 obs. of 8 variables:
## $ A: int 3 3 3 9 3 9 2 1 3 3 ...
## $ B: int 3 2 3 4 2 4 2 1 3 3 ...
## $ C: int 1 3 1 1 2 2 2 1 1 1 ...
## $ D: int 2 2 3 1 3 1 3 9 3 2 ...
## $ g: int 1 1 1 2 2 2 2 2 1 2 ...
## $ a: int 2 4 3 2 2 5 4 2 3 1 ...
## $ e: int 4 2 8 4 6 6 5 5 7 4 ...
## $ p: int 3 3 1 2 1 3 1 3 3 3 ...
As it was noted during the lectures, since we analyze the data at the nominal level, a missing value is treated as an additional category (in the example here category 9), which is included during the subset analysis.
We show the results for both data sets (without and with missing data) using the default symmetric plot, where both coordinates are principle. We recall the main formulas from MCA (from lecture slides) in order to show the mathematical expression of these coordinates.
If we denote the data table (Finland.txt or FinlandWithMissing.txt) as N, then the correspondence matrix P is obtained as:
\[\mathbf{P} = \left( \frac{1}{n} \right)\mathbf{N} \]
If we denote the row and column marginal totals (masses) of P as r and c respectively, and \[\mathbf{D_r}\] and \[\mathbf{D_c}\] are the diagonal matrices of these masses, then after applying singular value decomposition (SVD) we obtain:
\[\mathbf{S} = \mathbf{D^{\left(-1/2\right)}_r} \mathbf{\left(P-rc^T \right)}\mathbf{D^{\left(-1/2\right)}_c} \] which is equivalent to
\[\mathbf{S} = \mathbf{D^{\left(1/2\right)}_r} \mathbf{\left(D^{\left(-1\right)}_rPD^{\left(-1\right)}_c-11^T\right)} \mathbf{D^{\left(1/2\right)}_c} \]
According SVD \[\mathbf{S} = \mathbf{UD_\alpha V^T}\] Then the principal coordinates are presented as: - for rows:
\[\mathbf{F} = \mathbf{D^{\left(-1/2\right)}_rUD_{\alpha}}\] - for columns:
\[\mathbf{G} = \mathbf{D^{\left(-1/2\right)}_c V D_{\alpha}}\]
The total variance (inertia) is the sum of squares of the elements of \[\mathbf{S} = \mathbf{trace\left(SS^T\right) = {\chi}^2/n}\]
The Chi-square distance \[{\chi}^2 = \sum_{i=1}^r\sum_{j=1}^c\frac{\left(fo_{ij}-fe_{ij}\right)^2}{fe_{ij}}\]
In this expression the observed and expected frequencies of the cell in row i and column j are denoted by \[fo_{ij}\] and \[fe_{ij}\] respectively.
We first provide MCA on data Finland.txt, where missed data are excluded and draw the default symmetric plot.
require(ca)
Finland <- read.table("Finland.txt")
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), cex.axis=0.8, font.lab=2, mfrow=c(1,1))
plot(mjca(Finland[,1:4], ps=""), mass=c(F,T))
Next perform the same MCA analysis and plot, but using also missed data as separate category = 9:
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), cex.axis=0.8, font.lab=2, mfrow=c(1,1))
plot(mjca(Finland2[,1:4], ps=""), mass=c(F,T))
We observe that the missing category 9 for all 4 substantive questions is situated on the right down part of the biplot, where both principal coordinates take negative values.
The symmetric plot represents the row and column profiles simultaneously in a common space. According Mike Bendixen (Marketing Bulletin, 2003, 14) this plot may lead to misinterpretation if examined in isolation or only visually because principal coordinates are presented for both rows and columns. These coordinates represent the row and column profiles and not the apexes for which the standard coordinates are required. Therefore in such cases asymmetric plots are recommended.
The part of Burt matrix on first 4 substantive questions can be obtained as:
require(ca)
Finland2.B <- mjca(Finland2[,1:4])$Burt
Since the missing values are involved in this matrix as separate category, we need to take the following part of Burt matrix:
Finland2.ABCD = Finland2.B[c(1:24), c(1:24)]
Finland2.ABCD
## A:1 A:2 A:3 A:4 A:5 A:9 B:1 B:2 B:3 B:4 B:5 B:9 C:1 C:2 C:3 C:4 C:5
## A:1 59 0 0 0 0 0 32 14 5 4 1 3 21 15 7 7 6
## A:2 0 233 0 0 0 0 53 99 38 27 12 4 69 109 20 17 18
## A:3 0 0 344 0 0 0 28 100 105 87 22 2 153 134 32 14 7
## A:4 0 0 0 231 0 0 16 47 30 102 33 3 112 93 11 10 3
## A:5 0 0 0 0 152 0 9 12 7 27 96 1 121 17 1 0 12
## A:9 0 0 0 0 0 152 12 41 23 23 28 25 69 49 8 3 4
## B:1 32 53 28 16 9 12 150 0 0 0 0 0 29 41 21 18 38
## B:2 14 99 100 47 12 41 0 313 0 0 0 0 66 175 34 29 4
## B:3 5 38 105 30 7 23 0 0 208 0 0 0 98 88 19 1 1
## B:4 4 27 87 102 27 23 0 0 0 270 0 0 161 99 4 3 1
## B:5 1 12 22 33 96 28 0 0 0 0 192 0 178 6 0 0 6
## B:9 3 4 2 3 1 25 0 0 0 0 0 38 13 8 1 0 0
## C:1 21 69 153 112 121 69 29 66 98 161 178 13 545 0 0 0 0
## C:2 15 109 134 93 17 49 41 175 88 99 6 8 0 417 0 0 0
## C:3 7 20 32 11 1 8 21 34 19 4 0 1 0 0 79 0 0
## C:4 7 17 14 10 0 3 18 29 1 3 0 0 0 0 0 51 0
## C:5 6 18 7 3 12 4 38 4 1 1 6 0 0 0 0 0 50
## C:9 3 0 4 2 1 19 3 5 1 2 2 16 0 0 0 0 0
## D:1 15 31 58 45 41 26 30 39 34 46 58 9 154 43 8 4 4
## D:2 17 97 121 86 44 47 32 146 82 104 45 3 158 208 27 13 3
## D:3 6 49 96 41 27 25 24 61 55 59 41 4 110 83 28 13 7
## D:4 7 37 43 44 12 16 33 39 26 41 19 1 65 55 9 13 16
## D:5 7 13 14 5 13 6 25 11 3 9 9 1 23 4 6 7 17
## D:9 7 6 12 10 15 32 6 17 8 11 20 20 35 24 1 1 3
## C:9 D:1 D:2 D:3 D:4 D:5 D:9
## A:1 3 15 17 6 7 7 7
## A:2 0 31 97 49 37 13 6
## A:3 4 58 121 96 43 14 12
## A:4 2 45 86 41 44 5 10
## A:5 1 41 44 27 12 13 15
## A:9 19 26 47 25 16 6 32
## B:1 3 30 32 24 33 25 6
## B:2 5 39 146 61 39 11 17
## B:3 1 34 82 55 26 3 8
## B:4 2 46 104 59 41 9 11
## B:5 2 58 45 41 19 9 20
## B:9 16 9 3 4 1 1 20
## C:1 0 154 158 110 65 23 35
## C:2 0 43 208 83 55 4 24
## C:3 0 8 27 28 9 6 1
## C:4 0 4 13 13 13 7 1
## C:5 0 4 3 7 16 17 3
## C:9 29 3 3 3 1 1 18
## D:1 3 216 0 0 0 0 0
## D:2 3 0 412 0 0 0 0
## D:3 3 0 0 244 0 0 0
## D:4 1 0 0 0 159 0 0
## D:5 1 0 0 0 0 58 0
## D:9 18 0 0 0 0 0 82
summary(Finland2.ABCD)
## A:1 A:2 A:3 A:4
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.500 1st Qu.: 3.00 1st Qu.: 3.50 1st Qu.: 2.75
## Median : 6.500 Median : 19.00 Median : 25.00 Median : 13.50
## Mean : 9.833 Mean : 38.83 Mean : 57.33 Mean : 38.50
## 3rd Qu.:14.250 3rd Qu.: 50.00 3rd Qu.: 97.00 3rd Qu.: 45.50
## Max. :59.000 Max. :233.00 Max. :344.00 Max. :231.00
## A:5 A:9 B:1 B:2
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.75 1st Qu.: 3.75 1st Qu.: 5.25 1st Qu.: 4.75
## Median : 12.00 Median : 21.00 Median : 22.50 Median : 31.50
## Mean : 25.33 Mean : 25.33 Mean : 25.00 Mean : 52.17
## 3rd Qu.: 27.00 3rd Qu.: 29.00 3rd Qu.: 32.00 3rd Qu.: 62.25
## Max. :152.00 Max. :152.00 Max. :150.00 Max. :313.00
## B:3 B:4 B:5 B:9
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 1.75 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 13.50 Median : 17.00 Median : 10.5 Median : 2.500
## Mean : 34.67 Mean : 45.00 Mean : 32.0 Mean : 6.333
## 3rd Qu.: 42.25 3rd Qu.: 66.00 3rd Qu.: 35.0 3rd Qu.: 8.250
## Max. :208.00 Max. :270.00 Max. :192.0 Max. :38.000
## C:1 C:2 C:3 C:4
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 19.00 1st Qu.: 5.5 1st Qu.: 0.75 1st Qu.: 0.0
## Median : 67.50 Median : 42.0 Median : 7.50 Median : 3.5
## Mean : 90.83 Mean : 69.5 Mean :13.17 Mean : 8.5
## 3rd Qu.:129.00 3rd Qu.: 94.5 3rd Qu.:20.25 3rd Qu.:13.0
## Max. :545.00 Max. :417.0 Max. :79.00 Max. :51.0
## C:5 C:9 D:1 D:2
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.750 1st Qu.: 0.750 1st Qu.: 3.75 1st Qu.: 3.00
## Median : 4.000 Median : 2.000 Median : 28.00 Median : 38.00
## Mean : 8.333 Mean : 4.833 Mean : 36.00 Mean : 68.67
## 3rd Qu.: 8.250 3rd Qu.: 3.250 3rd Qu.: 43.50 3rd Qu.: 98.75
## Max. :50.000 Max. :29.000 Max. :216.00 Max. :412.00
## D:3 D:4 D:5 D:9
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.75 1st Qu.: 1.0 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 26.00 Median : 16.0 Median : 6.500 Median : 9.00
## Mean : 40.67 Mean : 26.5 Mean : 9.667 Mean :13.67
## 3rd Qu.: 56.00 3rd Qu.: 39.5 3rd Qu.:13.000 3rd Qu.:18.50
## Max. :244.00 Max. :159.0 Max. :58.000 Max. :82.00
For our work we need also to calculate the indicator matrix, obtained also via the MCA function mjca:
Finland2.Z <- mjca(Finland2[,1:4], ps="", reti=T)$indmat
head(Finland2.Z)
## A1 A2 A3 A4 A5 A9 B1 B2 B3 B4 B5 B9 C1 C2 C3 C4 C5 C9 D1 D2 D3 D4 D5 D9
## 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0
## 2 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
## 3 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0
## 5 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
## 6 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
dim(Finland2.Z)
## [1] 1171 24
The relationship between both matrices is: \[\mathbf{B} = \mathbf{Z^TZ}\] The non-missing categories are obtained by excluding the missing ones. Further analysis is provided using subset CA:
Finland2.nonmissing <- -c(6,12,18,24)
Finland2.mca1 <- ca(Finland2.B, subsetrow=c(Finland2.nonmissing), subsetcol=c(Finland2.nonmissing))
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2)
summary(Finland2.mca1)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.243700 20.7 20.7 *****
## 2 0.210498 17.9 38.6 ****
## 3 0.103194 8.8 47.4 **
## 4 0.095390 8.1 55.5 **
## 5 0.075342 6.4 61.9 **
## 6 0.070659 6.0 67.9 **
## 7 0.062970 5.4 73.3 *
## 8 0.057778 4.9 78.2 *
## 9 0.049824 4.2 82.4 *
## 10 0.047380 4.0 86.5 *
## 11 0.043140 3.7 90.1 *
## 12 0.036612 3.1 93.2 *
## 13 0.029151 2.5 95.7 *
## 14 0.022491 1.9 97.6
## 15 0.017302 1.5 99.1
## 16 0.008924 0.8 99.9
## 17 0.001268 0.1 100.0
## 18 0.000285 0.0 100.0
## 19 5.3e-050 0.0 100.0
## 20 1.3e-050 0.0 100.0
## -------- -----
## Total: 1.175975 100.0
##
##
## Rows:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A1 | 13 293 56 | -522 52 14 | 1121 241 75 |
## 2 | A2 | 50 299 46 | -545 271 61 | 172 27 7 |
## 3 | A3 | 73 166 39 | -149 36 7 | -285 130 28 |
## 4 | A4 | 49 88 45 | 113 12 3 | -284 76 19 |
## 5 | A5 | 32 680 61 | 1152 602 177 | 417 79 27 |
## 6 | B1 | 32 805 64 | -584 145 45 | 1243 660 235 |
## 7 | B2 | 67 397 47 | -528 340 76 | -216 57 15 |
## 8 | B3 | 44 136 47 | -132 14 3 | -390 122 32 |
## 9 | B4 | 58 163 45 | 196 42 9 | -336 122 31 |
## 10 | B5 | 41 782 64 | 1169 742 230 | 271 40 14 |
## 11 | C1 | 116 797 39 | 563 796 151 | -8 0 0 |
## 12 | C2 | 89 588 42 | -448 358 73 | -360 231 55 |
## 13 | C3 | 17 130 53 | -674 123 31 | 152 6 2 |
## 14 | C4 | 11 197 55 | -865 126 33 | 650 71 22 |
## 15 | C5 | 11 658 65 | -437 27 8 | 2130 631 230 |
## 16 | D1 | 46 262 47 | 544 250 56 | 122 13 3 |
## 17 | D2 | 88 277 37 | -197 79 14 | -312 198 41 |
## 18 | D3 | 52 21 43 | -24 1 0 | -140 20 5 |
## 19 | D4 | 34 65 47 | -216 28 6 | 243 36 10 |
## 20 | D5 | 12 467 58 | -194 7 2 | 1595 461 150 |
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A1 | 13 293 56 | -522 52 14 | 1121 241 75 |
## 2 | A2 | 50 299 46 | -545 271 61 | 172 27 7 |
## 3 | A3 | 73 166 39 | -149 36 7 | -285 130 28 |
## 4 | A4 | 49 88 45 | 113 12 3 | -284 76 19 |
## 5 | A5 | 32 680 61 | 1152 602 177 | 417 79 27 |
## 6 | B1 | 32 805 64 | -584 145 45 | 1243 660 235 |
## 7 | B2 | 67 397 47 | -528 340 76 | -216 57 15 |
## 8 | B3 | 44 136 47 | -132 14 3 | -390 122 32 |
## 9 | B4 | 58 163 45 | 196 42 9 | -336 122 31 |
## 10 | B5 | 41 782 64 | 1169 742 230 | 271 40 14 |
## 11 | C1 | 116 797 39 | 563 796 151 | -8 0 0 |
## 12 | C2 | 89 588 42 | -448 358 73 | -360 231 55 |
## 13 | C3 | 17 130 53 | -674 123 31 | 152 6 2 |
## 14 | C4 | 11 197 55 | -865 126 33 | 650 71 22 |
## 15 | C5 | 11 658 65 | -437 27 8 | 2130 631 230 |
## 16 | D1 | 46 262 47 | 544 250 56 | 122 13 3 |
## 17 | D2 | 88 277 37 | -197 79 14 | -312 198 41 |
## 18 | D3 | 52 21 43 | -24 1 0 | -140 20 5 |
## 19 | D4 | 34 65 47 | -216 28 6 | 243 36 10 |
## 20 | D5 | 12 467 58 | -194 7 2 | 1595 461 150 |
This plot is quite similar to the plot, obtained when missing data are removed from the data set.
Next we add responding points to the plot. For this purpose we first transform the corresponding frequencies in the table and draw a sequence of points at the specified coordinates:
Finland2.sum <- apply(Finland2.Z[,c(Finland2.nonmissing)], 1, sum)
Finland2.sum[Finland2.sum==0] <- 1
Finland2.rpc <- Finland2.Z[,c(Finland2.nonmissing)] %*% Finland2.mca1$colcoord / Finland2.sum
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
Next we compute confidence regions of demographic groups. For this purpose we have to add group ellipses and use Prof. Greenacre’s program “confidenceplots”, which was given separately.
require(ellipse)
source("confidenceplots.R")
First we draw confidence plots for the demographic group partnership by generation asymmetric plot, adding points and finally draw confidence plots.
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$p<4,1], Finland2.rpc[Finland2$p<4,2], group=Finland2$p[Finland2$e<4], groupcols=rep("purple",5), shade=T,
groupnames=c("ph1","p2","nop3"), shownames=T, add=T)
Next we draw confidence plot for the demographic group education.
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$e<9,1], Finland2.rpc[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)
The confidence plot for the demographic group gender is:
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$g<3,1], Finland2.rpc[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
groupnames=c("m","f"), shownames=T, add=T)
The confidence plot for the demographic group age is:
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$a<7,1], Finland2.rpc[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)
If we plot only the confidence plots (just the ellipses) for every demographic group, the plots will look as follow:
plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc[Finland2$p<4,1], Finland2.rpc[Finland2$p<4,2], group=Finland2$p[Finland2$p<4], groupcols=rep("purple",5), shade=T,
groupnames=c("ph1","p2","nop3"), shownames=T, add=T)
plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-1,1), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc[Finland2$e<9,1], Finland2.rpc[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)
plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-0.5,0.5), ylim=c(-0.2,0.2))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc[Finland2$g<3,1], Finland2.rpc[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
groupnames=c("m","f"), shownames=T, add=T)
plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-1,1), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc[Finland2$a<7,1], Finland2.rpc[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)
Analysis of “middle” category could be provided via CA of Burt matrix.
# investigation of "middle" category, via the Burt matrix
#seq() - generates a sequence of numbers
#seq(from=3, to=46, by=6): 3,9,15,21,27,33,39,45
Finland2.middle <- seq(3,22,6)
Finland2.mca2 <- ca(Finland2.B, subsetrow=Finland2.middle, subsetcol=seq(3,22,6))
#what - Vector of two character strings specifying the contents of the plot.
#First entry sets the rows and the second entry the columns. Allowed values # are "all" (all available points, default)
#"active" (only active points are displayed)
#"passive" (only supplementary points are displayed)
#"none" (no points are displayed)
#The status (active or supplementary) of columns is set in mjca using the option
#supcol.
# mass - area, first - rows, second - columns
#font.lab - labels for font?
plot(Finland2.mca2, what=c("none", "all"), mass=c(F,T), font.lab=2)
summary(Finland2.mca2)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.069858 40.4 40.4 **********
## 2 0.047719 27.6 68.0 *******
## 3 0.034199 19.8 87.8 *****
## 4 0.021173 12.2 100.0 ***
## -------- -----
## Total: 0.172949 100.0
##
##
## Rows:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A3 | 73 563 194 | -418 384 184 | 286 179 126 |
## 2 | B3 | 44 770 256 | -623 389 246 | 616 381 353 |
## 3 | C3 | 17 943 318 | -1267 491 387 | -1215 452 522 |
## 4 | D3 | 52 317 232 | -495 317 182 | 6 0 0 |
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A3 | 73 563 194 | -418 384 184 | 286 179 126 |
## 2 | B3 | 44 770 256 | -623 389 246 | 616 381 353 |
## 3 | C3 | 17 943 318 | -1267 491 387 | -1215 452 522 |
## 4 | D3 | 52 317 232 | -495 317 182 | 6 0 0 |
Adding respondent points only for middle categories:
#Finland2.Z - indicator matrix
# apply(variable, margin, function). - Returns a vector or array or list of values obtained by applying
#a function to margins of an array or matrix
#margin specifies if you want to apply by row (margin = 1),
#by column (margin = 2), or for each element (margin = 1:2).
Finland2.sum4 <- apply(Finland2.Z[,c(Finland2.middle)], 1, sum)
Finland2.sum4[Finland2.sum4==0] <- 1
Finland2.rpc5 <- Finland2.Z[,c(Finland2.middle)] %*% Finland2.mca2$colcoord / Finland2.sum4
plot(Finland2.mca2, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc5, pch=19, col="lightblue", cex=0.8)
Only missing categories are obtained as:
Finland2.missing <- c(6,12,18,24)
Finland2.mca4 <- ca(Finland2.B, subsetrow=c(Finland2.missing), subsetcol=c(Finland2.missing))
plot(Finland2.mca4, what=c("none", "all"), mass=c(F,T), font.lab=2)
Adding respondent points only for missing categories:
Finland2.sum2 <- apply(Finland2.Z[,c(Finland2.missing)], 1, sum)
Finland2.sum2[Finland2.sum2==0] <- 1
Finland2.rpc3 <- Finland2.Z[,c(Finland2.missing)] %*% Finland2.mca4$colcoord / Finland2.sum2
plot(Finland2.mca4, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc3 , pch=19, col="lightblue", cex=0.8)
Both “middle” and missing categories are investigated again using CA of Burt matrix.
Finland2.missing <- c(6,12,18,24)
Finland2.mca3 <- ca(Finland2.B, subsetrow=c(Finland2.middle,Finland2.missing), subsetcol=c(Finland2.middle,Finland2.missing))
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2)
Adding respondent points only for both “middle” and missing categories:
Finland2.sum1 <- apply(Finland2.Z[,c(Finland2.middle,Finland2.missing)], 1, sum)
Finland2.sum1[Finland2.sum1==0] <- 1
Finland2.rpc2 <- Finland2.Z[,c(Finland2.middle,Finland2.missing)] %*% Finland2.mca3$colcoord / Finland2.sum1
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2 , pch=19, col="lightblue", cex=0.8)
Next we compute confidence regions of demographic groups for both missing and “middle” groups.
#generated assymmetric plot
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
#add points
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
#draw confidence plots
confidenceplots(Finland2.rpc2[Finland2$p<4,1], Finland2.rpc2[Finland2$p<4,2], group=Finland2$p[Finland2$p<4], groupcols=rep("purple",5), shade=T,
groupnames=c("ph1","p2","nop3"), shownames=T, add=T)
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$e<9,1], Finland2.rpc2[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$g<3,1], Finland2.rpc2[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
groupnames=c("m","f"), shownames=T, add=T)
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$a<7,1], Finland2.rpc2[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)
Since from these plots the confidence regions are not so clear, similarly as in Task 2, we next plot “just the ellipses”.
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-1,1), ylim=c(0,2))
abline(v=0, h=0, lty=2, col="grey")
#draw confidence plots
confidenceplots(Finland2.rpc2[Finland2$p<4,1], Finland2.rpc2[Finland2$p<4,2], group=Finland2$p[Finland2$e<4], groupcols=rep("purple",5), shade=T,
groupnames=c("ph1","p2","nop3"), shownames=T, add=T)
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-1,1), ylim=c(0.5,1.5))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc2[Finland2$e<9,1], Finland2.rpc2[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(0,0.5), ylim=c(0.8,1.4))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc2[Finland2$g<3,1], Finland2.rpc2[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
groupnames=c("m","f"), shownames=T, add=T)
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-0.5,1), ylim=c(0.5,1.5))
abline(v=0, h=0, lty=2, col="grey")
confidenceplots(Finland2.rpc2[Finland2$a<7,1], Finland2.rpc2[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)
From all analyses, provided in Task 3, we can conclude that MCA is a powerful tool for investigating missing and “middle” categories. Both categories involve quite big amount of uncertainty, because of which most of the other methods just ignore these cases. Here we have seen how the information from both categories could be utilized applying MCA approach.
Oleg Nenadic and Michael Greenacre, Computation of Multiple Correspondence Analysis, with code in R
Michael Greenacre, Biplots in practise
Mike Bendixen, A Practical Guide to the Use of Correspondence Analysis in Marketing Research, Marketing Bulletin, 2003, 14, Technical Note 2.
[An Example R Markdown] (http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html)
Writing Mathematic Fomulars in Markdown
As during the Assignment set 1. here we again use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.
The 4 substantive variables, which values are measured in 1-5 scale, are:
A: Married people are generally happier than unmarried people.
B: People who want children ought to get married.
C: It is all right for a couple to live together without intending to get married.
D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.
The demographic variables are:
g: gender (1=male, 2=female)
a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+)
e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics, 7= University, lower academic degree, BA, 8=University, higher academic degree, MA
p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)
The data wrangling includes the following changes:
The missing data ae removed (this has already been provided). The number of observations without missing data is N=924.
We again use a combined variable **ga = 6*(g-1) + a. The combined variable ga** describes the interaction of gender and age categories.
The preliminary treated data look as:
Finland <- read.table("Finland.txt")
Finland$ga <- 6*(Finland$g-1) + Finland$a
head(Finland)
## A B C D g a e p ga
## 1 3 3 1 2 1 2 4 3 2
## 2 3 2 3 2 1 4 2 3 4
## 3 3 3 1 3 1 3 8 1 3
## 4 3 2 2 3 2 2 6 1 8
## 5 2 2 2 3 2 4 5 1 10
## 6 3 3 1 3 1 3 7 3 3
dim(Finland)
## [1] 924 9
str(Finland)
## 'data.frame': 924 obs. of 9 variables:
## $ A : int 3 3 3 3 2 3 3 2 2 3 ...
## $ B : int 3 2 3 2 2 3 3 3 3 3 ...
## $ C : int 1 3 1 2 2 1 1 1 2 3 ...
## $ D : int 2 2 3 3 3 3 2 2 3 3 ...
## $ g : int 1 1 1 2 2 1 2 2 2 2 ...
## $ a : int 2 4 3 2 4 3 1 2 5 4 ...
## $ e : int 4 2 8 6 5 7 4 8 7 5 ...
## $ p : int 3 3 1 1 1 3 3 3 1 3 ...
## $ ga: num 2 4 3 8 10 3 7 8 11 10 ...
library(FactoMineR)
library(tidyr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keep_columns <- c("A", "B", "C", "D", "g", "a", "e", "p", "ga")
Finland <- dplyr::select(Finland, one_of(keep_columns))
summary(Finland)
## A B C D
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :2.000 Median :2.000
## Mean :3.193 Mean :3.025 Mean :1.835 Mean :2.503
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## g a e p
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :5.000 Median :1.000
## Mean :1.563 Mean :3.644 Mean :4.527 Mean :1.563
## 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:3.000
## Max. :2.000 Max. :6.000 Max. :8.000 Max. :3.000
## ga
## Min. : 1.000
## 1st Qu.: 4.000
## Median : 7.000
## Mean : 7.021
## 3rd Qu.:10.000
## Max. :12.000
We present data graphically as barplots using ggplot() function.
gather(Finland) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
We observe that the highest frequency of the answers to question A takes a middle hypothesis, while for questions B and D the highest frequency has the answer 2. Most respondents strongly agree with the hypothesis C. The rest of the barplots show the distribution of demographic variables.
There are two main constrained forms of MCA: canonical MCA and nonlinear MCA. Canonical MCA use explanatory variables to explain set of response variables, while in classical MCA there is no restriction imposed by explanatory variables since all variables are assumed to be response variables.
Nonlinear MCA is also assumed to be a constrained form of MCA because it is obtained from common MCA (known also as homogeneity analysis) by imposing restrictions on the variable ranks and levels as well as defined set of variables.
The most important advantages of nonlinear over linear MCA are that except truly numeric nonlinear MCA incorporates nominal and ordinal variables, and additionally can discover nonlinear relationships between variables.
In nonlinear MCA the variables could be nominal, ordinal or true numeric. Nominal variables consist of unordered categories. Since principal components are weighted sums of the original variables, nominal variables could not be analyzed using standard linear PCA. Ordinal variables consist of ordered categories, for example as we have also used, a Likert-type scale. Such scale values are not truly numeric because intervals between consecutive categories are not equal. Although ordinal variables display more structure than nominal ones, these variables still do not possess traditional numeric properties. The true numeric variables can be viewed as categorical with c categories, where c indicates the number of different observed values. Both ratio and interval variables are considered numeric in nonlinear PCA.
Linear MCA is suitable for variables, all measured as numeric. It is possible to exist nonlinear relationships between some of these variables. Then nonlinear MCA will be more appropriate approach.
Nonlinear PCA converts categories into numeric values because the variance could be established only for numeric values. This process is called quantification. Thus correlations are not computed between observed variables, but between quantified variables. Therefore the correlation matrix in nonlinear PCA is not fixed and depends on the type of quantification chosen for every variable.
The type of quantification is called analysis level. Different analysis levels imply different requirements.
Nonlinear MCA in R could be provided by using the homals package. This package performs simple homogeneity analysis, which corresponds to MCA. The package homals also provide some extensions to MCA, including nonlinear MCA. Nonlinear PCA could be provided using the function homals and setting appropriate options.
In the example below we use “ordinal” analysis level because the substantive variables are ordinal and the categories are given in Likert-type scale.
require(homals)
## Loading required package: homals
#algebraically, the geometric concept of dimensionality is related to
#the rank of the matrix, which has to be reduced.
#rank - Which quantification ranks. Default is number of dimensions ndim
#level - Which quantification levels. Possible values are "nominal", "ordinal",
#"numerical", and "polynomial" which can be defined as
#single character (if all variable are of the same level) or
#as vector which length corresponds to the number of variables
Finland.nlpca <- homals(Finland[,1:4], rank=1, level="ordinal")
Next the transformation plot is drawn. The horizontal axis (x) of the plot displays the categories (1 to 5) of the ordinal substantive variables. On the vertical axis (y) the category quantification for every substantive variable (A-D) are shown.
#Transformation plot: Plots the original (categorical) scale against
#the transformed (metric) scale on each dimension over the categories
#of each variable separately.
plot(Finland.nlpca, plot.type="trfplot") # relationship of transformed scale with 1-to-5 scale
Figure 1A
Figure 1B
Figure 1C
Figure 1D
From the transformation plot for the variable D we see that the ordinal category quantification are non-strictly increasing with the original category labels. The original spacing between categories is not maintained in the quantifications. Between categories 2 and 4 we see something similar to plateau, which means that some consecutive categories obtain almost the same quantification, called ties. There are two possible reasons for such ties: 1. The persons scoring in the tied categories do not structurally differ from each other considering their scores on the other variables, and therefore categories cannot be distinguished from each other. 2. The ordinal quantifications are obtained by placing an order restriction on nominal quantifications.
This plot represent a quantified variable by displaying its category points in principal component space, where the axes are given by the principal components. Here a variable is represented by a vector going through the origin (0,0) (which is also the mean of the quantified variable) and the point with as coordinates the component loadings for the variable. The component loadings are correlations between the quantified variables and the principal components, and the sum of squared component loadings indicates the variance accounted for (VAF) by the principal component. The category points are also positioned on the variable vector, and their coordinates are found by multiplying the category quantifications by the corresponding loadings on the first (for the x-coordinate) and the second (for the y-coordinate) component.
Categories with quantifications above the mean lie on the side of the origin on which the component loadings point is positioned, and categories with quantifications below the mean lie in the opposite direction. The total length of the variable vector does not indicate the importance of the variable.
#2. Category plot: Plots the rank-restricted category quantifications for
#each variable separately.
plot(Finland.nlpca, plot.type="catplot")
Figure 2A
Figure 2B
Figure 2C
Figure 2D
In these plots the loading point is not shown. I do not know how to interpret the nonlinear transformations, which lead to these 4 category plots.
In these plots only the loading vectors (origin and loading point) are displayed. Here the variables with relatively long vectors fit well into the solution and variables with relatively short vectors fit badly. When vectors are long, the cosines of the angles between the vectors indicate the correlation between the quantified variables. The length of the variable vector from the origin up to the component loading point is an indication of the variable’s total variance accounted for (VAF). Thus VAF can be interpreted as the amount of information retained when variables are represented in a lower dimensional space. Nonlinear transformations reduce the dimensionality necessary to represent the variables satisfactory.
#3. Loadings plot: Plots the loadings of the variables and connects them with the origin.
plot(Finland.nlpca, plot.type="loadplot")
Figure 3
In this example the longest length has a variable vector for D. Other variable vector lengths are very similar and not much shorter than D loading vector.
In this plot all cases (object scores) are projected onto a straight line defined by each rank restricted category quantified variable. Here except the straight line, plotted in the category plots, also all object scores are shown.
#4. Vector plot: cases projected onto straight line defined by each variable
plot(Finland.nlpca, plot.type="vecplot")
Figure 4A
Figure 4B
Figure 4C
##5.Loss plot
The loss plot shows the rank-restricted category quantifications against the unrestricted for each variable separately. It actually compares MCA and NLPCA.
plot(Finland.nlpca, plot.type="lossplot")
Figure 5A
Figure 5B
Figure 5C
From these plots we see that the linear and nonlinear MCA results are very near for variables A and D, while both model outputs differ quite a lot for the variables B and C. This could mean that linear model assumption for variables A and D is appropriate.
It plots the scores (cases) of the objects (rows in data set) on two dimensions. Therefore this plot is presented only by dots.
plot(Finland.nlpca, plot.type="objplot")
Figure 6
It is similar to the previous object plot. In the label plot the object scores are also plotted, but now for each variable separately with the corresponding category labels. It looks like contour lines forced to be straight & parallel
plot(Finland.nlpca, plot.type="labplot")
Figure 7A
Figure 7B
Figure 7C
#8. Graph plot.
It is a joint plot with connections between scores/quantifications. So, except the object scores, plotted on object plot, the variable quantifications are also shown. It should be noted that this plot works only for small data sets. Even in the provided example ot is already quite difficult to observe the plotted relationships.
plot(Finland.nlpca, plot.type="graphplot")
Figure 8
It produces a category plot with Voronoi regions. Looks like contour lines forced to be straight and parallel.
plot(Finland.nlpca, plot.type="vorplot")
Figure 9A
Figure 9B
Figure 9C
Figure 9D
For each single variable the object scores are mapped onto two dimensions and the convex hull for each response category is drawn.
# would be better to show confidence ellipses
plot(Finland.nlpca, plot.type="hullplot")
Figure 10A
Figure 10B
Figure 10C
Figure 10D
The CA method could be applied on data, which have been preprocessed in different ways. One kind of preprocessing is so called doubling, which in this example is provided on data type ratings. We assume 1-to-5 Linkert scale, which starts at one. Then the first column consists of all ratings minus 1. Since rating 1 = “strongly agree”, the transformed first column measure the strength of disagreements and therefore is called negative pole. The second column is 4 minus the first column and measure the strength of disagreements. The second column is called positive pole. This kind of data transformation is called doubling. The CA of doubled data is performed as before when we have used nontransformed data.
Next plot shows the asymmetric map.
require(ca)
Finland.doubled <- cbind( (Finland[,1:4]-1), (5-Finland[,1:4]) )
colnames(Finland.doubled) <- paste( rep(LETTERS[1:4],2), rep(c("-","+"), each=4), sep="" )
head(Finland.doubled)
Finland.doubled.ca <- ca(Finland.doubled)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, col=c("lightblue","red"), labels=c(0,2), font=2, map="rowprincipal")
Figure 11
Next we connect opposite categories in column standard coordinates and draw the contribution biplot.
# connect opposite categories in column standard coordinates
Finland.doubled.csc <- Finland.doubled.ca$colcoord
for(j in 1:4) segments(Finland.doubled.csc[j,1],Finland.doubled.csc[j,2],
Finland.doubled.csc[4+j,1],Finland.doubled.csc[4+j,2], col="red", lty=2)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, col=c("lightblue","red"), labels=c(0,2), font=2, map="rowgreen")
Finland.doubled.ccc <- Finland.doubled.ca$colcoord * sqrt(Finland.doubled.ca$colmass)
for(j in 1:4) segments(Finland.doubled.ccc[j,1],Finland.doubled.ccc[j,2],
Finland.doubled.ccc[4+j,1],Finland.doubled.ccc[4+j,2], col="red", lty=2)
Figure 12
The polar positions in contribution coordinates are shown on Figure 13.
# just the polar positions in contribution coordinates
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, what=c("none","all"), labels=c(0,2), font=2, map="rowgreen")
for(j in 1:4) segments(Finland.doubled.ccc[j,1],Finland.doubled.ccc[j,2],
Finland.doubled.ccc[4+j,1],Finland.doubled.ccc[4+j,2], col="red", lty=2)
Figure 13
The demographic group averages are shown on Figure 14
Finland.doubled.rpc <- Finland.doubled.ca$rowcoord %*% diag(Finland.doubled.ca$sv)
Figure 14
The demographic group averages and confidence regions are shown on Figure 15
# add centroids of demographic categories
Finland.doubled.rpc <- Finland.doubled.ca$rowcoord %*% diag(Finland.doubled.ca$sv)
# with confidence ellipses
source("confidenceplots.R")
require(ellipse)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, labels=c(0,2), what=c("none","all"))
Finland.doubled.cpc <- Finland.doubled.ca$colcoord %*% diag(Finland.doubled.ca$sv)
for(j in 1:4) segments(Finland.doubled.cpc[j,1],Finland.doubled.cpc[j,2],
Finland.doubled.cpc[4+j,1],Finland.doubled.cpc[4+j,2], col="red", lty=2)
confidenceplots(Finland.doubled.rpc[,1], Finland.doubled.rpc[,2], group=Finland$ga, groupcols=c(rep("blue",6),rep("red",6)), shade=T,
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)
Figure 15
Next we add confidence ellipses to the plot and show the resultt onFigure 14. Only the confidence ellipses are shown on Figure 16.
#ellipses just by themselves (watch out for aspect ratio distortion! I have to correct this in next version of my function)
confidenceplots(Finland.doubled.rpc[,1], Finland.doubled.rpc[,2], group=Finland$ga, groupcols=c(rep("blue",6),rep("red",6)), shade=T,
groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=F)
Figure 16
The regular PCA plot is shown on the next Figure 17:
Finland.pca <- prcomp(Finland[,1:4])
names(Finland.pca)
## [1] "sdev" "rotation" "center" "scale" "x"
# [1] "sdev" "rotation" "center" "scale" "x"
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(Finland.pca$x, 10.5*Finland.pca$rotation), type="n", asp=1, xlab="PCA dim 1 (37.5%)", ylab="PCA dim 2 (17.1%)")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(Finland.pca$x, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, 10*Finland.pca$rotation[,1], 10*Finland.pca$rotation[,2], length=0.1, angle=10, col="pink", lwd=2)
text(10.3*Finland.pca$rotation, labels=colnames(Finland[,1:4]), col="red", font=4)
The R function factanal() performs the analysis on standardized variables
The correlation matrix is:
round(cor(Finland[,1:4]),3)
## A B C D
## A 1.000 0.527 -0.236 -0.057
## B 0.527 1.000 -0.508 -0.138
## C -0.236 -0.508 1.000 0.291
## D -0.057 -0.138 0.291 1.000
Since 2 factors are too many for 4 variables, we have to set factors = 1.
Finland.fa <- factanal(Finland[,1:4], factors=1, rotation="none", scores="regression")
names(Finland.fa)
## [1] "converged" "loadings" "uniquenesses" "correlation"
## [5] "criteria" "factors" "dof" "method"
## [9] "scores" "STATISTIC" "PVAL" "n.obs"
## [13] "call"
print(Finland.fa)
##
## Call:
## factanal(x = Finland[, 1:4], factors = 1, scores = "regression", rotation = "none")
##
## Uniquenesses:
## A B C D
## 0.715 0.028 0.735 0.979
##
## Loadings:
## Factor1
## A 0.534
## B 0.986
## C -0.515
## D -0.144
##
## Factor1
## SS loadings 1.543
## Proportion Var 0.386
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 65.38 on 2 degrees of freedom.
## The p-value is 6.35e-15
Factor analysis results:
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(-Finland.fa$loadings, -0.5*Finland.fa$scores), asp=1, type="n", xlab="FA factor 1 (26.6%)", ylab="FA factor 2 (7.3%")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(-0.5*Finland.fa$scores, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, -Finland.fa$loadings[,1], -Finland.fa$loadings[,1], length=0.1, angle=10, col="pink", lwd=2)
text(-1.05*Finland.fa$loadings, labels=colnames(Finland[,1:4]), col="red", font=4)
FA with rotation:
Finland.fa <- factanal(Finland[,1:4], factors=1, rotation="varimax", scores="regression")
print(Finland.fa)
##
## Call:
## factanal(x = Finland[, 1:4], factors = 1, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## A B C D
## 0.715 0.028 0.735 0.979
##
## Loadings:
## Factor1
## A 0.534
## B 0.986
## C -0.515
## D -0.144
##
## Factor1
## SS loadings 1.543
## Proportion Var 0.386
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 65.38 on 2 degrees of freedom.
## The p-value is 6.35e-15
Plot FA results:
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(-Finland.fa$loadings, -0.5*Finland.fa$scores), asp=1, type="n", xlab="FA factor 1 (18.6%)", ylab="FA factor 2 (15.3%")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(-0.5*Finland.fa$scores, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, -Finland.fa$loadings[,1], -Finland.fa$loadings[,1], length=0.1, angle=10, col="pink", lwd=2)
text(-1.05*Finland.fa$loadings, labels=colnames(Finland[,1:4]), col="red", font=4)
Finland is the set of complete cases
require(ca)
Finland.B <- mjca(Finland[,1:4], ps="")$Burt
Finland.Z <- mjca(Finland[,1:4], ps="", reti=T)$indmat
Finland.csc <- mjca(Finland[,1:4], ps="")$colcoord
rownames(Finland.Z) <- 1:nrow(Finland.Z)
colnames(Finland.Z) <- colnames(Finland.B)
# notice that the division by 8 in the next line is becasue of the 9 variables in Finland
Finland.rpc <- Finland.Z %*% Finland.csc/9
summary(mjca(Finland[,1:4]))
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.116786 44.9 44.9 *************
## 2 0.090602 34.8 79.7 **********
## 3 0.009457 3.6 83.3 *
## 4 0.006314 2.4 85.7 *
## 5 0.001032 0.4 86.1
## 6 0.000624 0.2 86.4
## 7 00000000 0.0 86.4
## -------- -----
## Total: 0.260281
##
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | A:1 | 12 848 57 | -374 159 15 | 778 689 83 |
## 2 | A:2 | 58 847 46 | -351 769 61 | 112 78 8 |
## 3 | A:3 | 85 545 38 | -90 122 6 | -168 423 26 |
## 4 | A:4 | 58 375 44 | 68 45 2 | -184 329 22 |
## 5 | A:5 | 37 830 63 | 785 759 193 | 240 71 23 |
## 6 | B:1 | 34 834 65 | -333 121 32 | 807 713 245 |
## 7 | B:2 | 67 738 47 | -368 666 78 | -120 71 11 |
## 8 | B:3 | 47 543 46 | -115 100 5 | -242 443 30 |
## 9 | B:4 | 62 460 45 | 135 127 10 | -219 333 33 |
## 10 | B:5 | 40 804 66 | 836 782 236 | 141 22 9 |
## 11 | C:1 | 118 902 39 | 379 899 145 | -25 4 1 |
## 12 | C:2 | 91 898 42 | -312 611 76 | -214 288 46 |
## 13 | C:3 | 17 636 52 | -429 588 27 | 122 47 3 |
## 14 | C:4 | 12 756 54 | -542 510 30 | 376 246 19 |
## 15 | C:5 | 12 845 66 | -207 20 4 | 1327 825 231 |
## 16 | D:1 | 48 970 47 | 382 924 60 | 85 46 4 |
## 17 | D:2 | 94 829 36 | -119 227 11 | -193 602 39 |
## 18 | D:3 | 56 277 42 | -11 4 0 | -89 273 5 |
## 19 | D:4 | 38 557 46 | -149 305 7 | 135 252 8 |
## 20 | D:5 | 14 960 59 | -59 3 0 | 1008 956 155 |
We use 4 dimensions (which is also the maximum factor analysis will allow) and loop on k-means algorithm to decide how many clusters
Finland.BW <- rep(0, 10)
for(nc in 2:10) {
Finland.km <- kmeans(Finland.rpc[,1:4], centers=nc, nstart=20, iter.max=200)
Finland.BW[nc] <- Finland.km$betweenss/Finland.km$totss
}
Finland.BW
## [1] 0.0000000 0.2336672 0.4374201 0.5789785 0.6871295 0.7217150 0.7460767
## [8] 0.7698848 0.7823934 0.8072644
Plot the proportion of between-cluster variance:
par(mar=c(4.2,4,1,2), cex.axis=0.8, mfrow=c(1,2))
plot(Finland.BW, xlab="Nr of clusters", ylab="BSS/TSS")
Plot the increments in between-cluster variance:
Finland.BWinc <- Finland.BW[2:10]-Finland.BW[1:9]
plot(1:10, c(NA,NA, Finland.BWinc[2:9]), xlab="Nr of clusters", ylab="improvement")
lines(3:10, Finland.BWinc[2:9], col="red", lwd=2)
If it looks like 5-cluster solution, than it is a good choice
Finland.km5 <- kmeans(Finland.rpc[,1:4], centers=5, nstart=20, iter.max=200)
Finland.km5$betweenss/Finland.km5$totss
## [1] 0.6871295
Cluster sizes:
Finland.km5$size
## [1] 212 105 230 210 167
Relate clusters to 9 variables in Finland and average those in a cluster by the original Finland counts:
Finland.means <- matrix(0,nrow=5,ncol=4)
rownames(Finland.means) <- c("clus1","clus2","clus3","clus4","clus5")
colnames(Finland.means) <- colnames(Finland)[1:4]
for(j in 1:4) Finland.means[,j] <- tapply(Finland[,j], Finland.km5$cluster, mean)
round(Finland.means,1)
## A B C D
## clus1 2.8 2.8 1.8 2.5
## clus2 2.2 1.2 3.4 3.3
## clus3 3.6 3.7 1.5 2.5
## clus4 2.6 2.1 2.1 2.4
## clus5 4.4 4.8 1.1 2.2
Homogeneity Analysis in R:The Package homals
Package ÃÂÃÂhomalsÃÂÃÂ
Oleg Nenadic and Michael Greenacre, Computation of Multiple Correspondence Analysis, with code in R
Michael Greenacre, Biplots in practice
Mike Bendixen, A Practical Guide to the Use of Correspondence Analysis in Marketing Research, Marketing Bulletin, 2003, 14, Technical Note 2.
Michael Greenacre, Correspondence Analysis in Practice, Third Edition
Writing Mathematic Formulas in Markdown