MULTIPLE CORRESPONDANCE ANALYSIS - Assignment 1

conducted by Neli Noykova.

The data

In this exercise set we use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.

The 4 substantive variables, which values are masured in 1-5 scale, are:

A: Married people are generally happier than unmarried people. B: People who want children ought to get married. C: It is all right for a couple to live together without intending to get married. D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.

The demographic variables are: g: gender (1=male, 2=female) a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+) e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics , 7= University, lower academic degree, BA, 8=University, higher academic degree, MA p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)

The data wrangling includes the following changes:

  1. The missing data ae removed (this has already been provided). The number of observations without missing data is N=924.

  2. For Task 2 a combined variable ga, describing the intraction of gender and age categories, is formed as: **ga = 6*(g-1) + a**

Graphical overview of the data and summaries of the variables

The preliminary treated data look as:

Finland <- read.table("Finland.txt")
head(Finland)
##   A B C D g a e p
## 1 3 3 1 2 1 2 4 3
## 2 3 2 3 2 1 4 2 3
## 3 3 3 1 3 1 3 8 1
## 4 3 2 2 3 2 2 6 1
## 5 2 2 2 3 2 4 5 1
## 6 3 3 1 3 1 3 7 3
dim(Finland)
## [1] 924   8
str(Finland)
## 'data.frame':    924 obs. of  8 variables:
##  $ A: int  3 3 3 3 2 3 3 2 2 3 ...
##  $ B: int  3 2 3 2 2 3 3 3 3 3 ...
##  $ C: int  1 3 1 2 2 1 1 1 2 3 ...
##  $ D: int  2 2 3 3 3 3 2 2 3 3 ...
##  $ g: int  1 1 1 2 2 1 2 2 2 2 ...
##  $ a: int  2 4 3 2 4 3 1 2 5 4 ...
##  $ e: int  4 2 8 6 5 7 4 8 7 5 ...
##  $ p: int  3 3 1 1 1 3 3 3 1 3 ...

Task 1: Cross-tabulations and correspondance analysis (CA).

The goal is to discover interesting relations among the measurements on A, …, D, g, …, p. Correspondence analysis is a descriptive/exploratory technique designed to analyse simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. CA may also be defined as a special case of Principal Components Analysis of the rows and columns of a table, especially applicable to a cross-tabulation. Principal components analysis is used for tables consisting of continuous measurement, whereas CA is applied to contingency tables (i.e. cross-tabulations). The goal of CA is to transform a table of numerical information into a graphical display, in which each row and each column is depicted as a point. Thus CA can be used for data visualization or data pre-processing before applying methods for supervised learning. We can apply this technique for dimensionality reduction.

First we provide a cross-tabulation between the age variable a and answer A: Married people are generally happier than unmarried people. The variable a consists of 6 age groups, corresponding to rownames of crosstable: a1 = 16-25, a2=26-35, a3 = 36-45, a4 = 46-55, a5 = 56-65, a6 = 66+ The categorical variable A assume 5 possible values, which correspond to column names of cross-table: SA - strongly agree; A - agree, NN - neither agree or disagree, D - disagree, SD - strongly disagree.

tab <-table(Finland[,"a"], Finland[,"A"])

rownames(tab) <- c("a1","a2","a3","a4","a5","a6") 
colnames(tab) <- c("SA","A","NN","D","SD")

tab
##     
##      SA  A NN  D SD
##   a1  5 18 42 29 29
##   a2  6 25 44 34 34
##   a3  6 30 55 30 17
##   a4  9 37 60 46 34
##   a5 15 67 58 49 15
##   a6  5 37 56 26  6

From this table we see that the biggest part of age group a5 agree with question A. Since these values are absolute, we need more precise information wich could be obtained row and column profiles, calculated from this cross-table. We need also to calculate the chi-square statistic as it has been shown durint the second exercise session.

An alternative means of extracting the nature of the dependency between the rows and columns of the contingency table is to represent the row or column profiles graphically.

require(ca)
## Loading required package: ca
plot(ca(tab))

This plot is assymetric because the row profiles are plotted simultaneously with apexes representing the columns. a3 is strongly “associated” with answer NN, which is clearly the case from the profile presented in the cross-table. Likewise we observe some proximity of a5 to the apex representing A answer to question A. The fact that a1, a2 and a4 are positioned relatively far away from answers SA and A means that these younger people rather disagree then agree with the statement A. The age group a6 is far from all answers of A, which probably means that people from this age group do not care so much about statement A.

As a second example we conduct cross-tabulation between the variable p (if there is a partner and they live together) and answer **C*: C: It is all right for a couple to live together without intending to get married. The obtained cross-table and resulting plot from CA are:

#Cross-tabulations
#between g and D
tab2 <-table(Finland[,"p"], Finland[,"C"])

rownames(tab2) <- c("Live together","Not live together", "No partner") 
colnames(tab2) <- c("SA","A","NN","D","SD")
tab2
##                    
##                      SA   A  NN   D  SD
##   Live together     291 247  44  32  29
##   Not live together  26  10   5   1   0
##   No partner        119  79  15  11  15
require(ca)
plot(ca(tab2))

From the cross-tabulation and CA plot becomes clear that people, who live together, tend to give positive answer to the statement C.

As a third example we investigate relationship between education e and statement A:

tab1 <-table(Finland[,"e"], Finland[,"A"])
rownames(tab1) <- c("Primary","Comprehensive", "Post-comprehensive", "General", "Vocational", "Polytechnics", "University, B", "University, MSc") 
colnames(tab1) <- c("SA","A","NN","D","SD")
tab1
##                     
##                      SA  A NN  D SD
##   Primary             3 28 34 14  3
##   Comprehensive       6  9 18 16 10
##   Post-comprehensive 16 37 71 54 26
##   General             1 20 39 18 18
##   Vocational          4 49 63 53 39
##   Polytechnics        7 25 28 22 16
##   University, B       2 11 23 16  5
##   University, MSc     7 35 39 21 18
require(ca)
plot(ca(tab1))

From this plot becomes clear that no one of the educational gruops strongly afrees with the statement A. The highly educated people with university degrees tend to answer using middle alternative.

Task 2: Cross-tabulation of ga demographic variable against one of the substantive variables. Performing CA.

The variable ga, which is the interaction of gender and age categories, is created as:

Finland$ga <- 6*(Finland$g-1) + Finland$a
head(Finland)
##   A B C D g a e p ga
## 1 3 3 1 2 1 2 4 3  2
## 2 3 2 3 2 1 4 2 3  4
## 3 3 3 1 3 1 3 8 1  3
## 4 3 2 2 3 2 2 6 1  8
## 5 2 2 2 3 2 4 5 1 10
## 6 3 3 1 3 1 3 7 3  3
str(Finland)
## 'data.frame':    924 obs. of  9 variables:
##  $ A : int  3 3 3 3 2 3 3 2 2 3 ...
##  $ B : int  3 2 3 2 2 3 3 3 3 3 ...
##  $ C : int  1 3 1 2 2 1 1 1 2 3 ...
##  $ D : int  2 2 3 3 3 3 2 2 3 3 ...
##  $ g : int  1 1 1 2 2 1 2 2 2 2 ...
##  $ a : int  2 4 3 2 4 3 1 2 5 4 ...
##  $ e : int  4 2 8 6 5 7 4 8 7 5 ...
##  $ p : int  3 3 1 1 1 3 3 3 1 3 ...
##  $ ga: num  2 4 3 8 10 3 7 8 11 10 ...

The cross-tabulation between ga and A is computed as:

tab3 <-table(Finland[,"ga"], Finland[,"C"])
rownames(tab3) <- c("Ma1","Ma2", "Ma3", "Ma4", "Ma5", "Ma6", "Fa1","Fa2", "Fa3", "Fa4", "Fa5", "Fa6") 
colnames(tab3) <- c("SA","A","NN","D","SD")
tab3
##      
##       SA  A NN  D SD
##   Ma1 30 11  3  0  4
##   Ma2 46 15  0  1  0
##   Ma3 26 31  0  4  1
##   Ma4 38 38  7  2  2
##   Ma5 29 36  9  7  3
##   Ma6  8 37  6  8  2
##   Fa1 38 21  1  6  9
##   Fa2 59 16  1  0  5
##   Fa3 43 25  4  2  2
##   Fa4 63 24  7  2  3
##   Fa5 42 45 18  8  7
##   Fa6 14 37  8  4  6

The CA between ga and A:

require(ca)
plot(ca(tab3))

From this plot becomes clear that female of age group 4 most often strongly agree with the statement A, while males in age group 5 most often agree with A.

Task 3:Computation of the Burt matrix on the questions and demographics together.

The most classical and standard approach to MCA is to apply a simple CA to the indicatormatrix Z. The indicator matrix Z = {zij} corresponds to a binary coding of the factors - instead of using a factor with Jq levels one uses Jq columns containing binary values, also called dummy variables. The Burt matrix C is obtained directly from the indicator matrix Z: C = ZTZ. Burt matrix concatenates all two-way cross-tabulations between pairs of variables

The Burt matrix is computed as:

require(ca)
Finland.B <- mjca(Finland)$Burt
head(Finland.B)
##     A:1 A:2 A:3 A:4 A:5 B:1 B:2 B:3 B:4 B:5 C:1 C:2 C:3 C:4 C:5 D:1 D:2
## A:1  46   0   0   0   0  28  12   4   2   0  15  13   6   6   6  13  15
## A:2   0 214   0   0   0  50  92  38  25   9  64 101  17  15  17  28  91
## A:3   0   0 315   0   0  24  92  95  83  21 145 120  30  14   6  55 115
## A:4   0   0   0 214   0  15  42  30  97  30 104  88  10   9   3  41  84
## A:5   0   0   0   0 135   9  10   6  24  86 108  14   1   0  12  40  43
## B:1  28  50  24  15   9 126   0   0   0   0  26  34  17  15  34  29  24
##     D:3 D:4 D:5 g:1 g:2 a:1 a:2 a:3 a:4 a:5 a:6 e:1 e:2 e:3 e:4 e:5 e:6
## A:1   5   7   6  28  18   5   6   6   9  15   5   3   6  16   1   4   7
## A:2  46  36  13 113 101  18  25  30  37  67  37  28   9  37  20  49  25
## A:3  88  43  14 136 179  42  44  55  60  58  56  34  18  71  39  63  28
## A:4  41  43   5  85 129  29  34  30  46  49  26  14  16  54  18  53  22
## A:5  27  12  13  42  93  29  34  17  34  15   6   3  10  26  18  39  16
## B:1  20  29  24  59  67  15   8  16  16  41  30  15  10  27  12  32   7
##     e:7 e:8 p:1 p:2 p:3 ga:1 ga:2 ga:3 ga:4 ga:5 ga:6 ga:7 ga:8 ga:9 ga:10
## A:1   2   7  35   1  10    3    3    4    6   10    2    2    3    2     3
## A:2  11  35 171   8  35   10   12   16   22   29   24    8   13   14    15
## A:3  23  39 224  12  79   16   21   21   32   21   25   26   23   34    28
## A:4  16  21 141  14  59    8   14   13   20   20   10   21   20   17    26
## A:5   5  18  72   7  56   11   12    8    7    4    0   18   22    9    27
## B:1   4  19  97   2  27    7    3    8    8   19   14    8    5    8     8
##     ga:11 ga:12
## A:1     5     3
## A:2    38    13
## A:3    37    31
## A:4    29    16
## A:5    11     6
## B:1    22    16
summary(Finland.B)
##       A:1              A:2              A:3              A:4        
##  Min.   : 0.000   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 3.000   1st Qu.: 13.00   1st Qu.: 21.00   1st Qu.: 14.00  
##  Median : 6.000   Median : 25.00   Median : 34.00   Median : 22.00  
##  Mean   : 8.118   Mean   : 37.76   Mean   : 55.59   Mean   : 37.76  
##  3rd Qu.:10.000   3rd Qu.: 38.00   3rd Qu.: 67.00   3rd Qu.: 44.50  
##  Max.   :46.000   Max.   :214.00   Max.   :315.00   Max.   :214.00  
##       A:5              B:1              B:2              B:3        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  6.50   1st Qu.:  8.00   1st Qu.: 14.00   1st Qu.: 10.00  
##  Median : 13.00   Median : 16.00   Median : 27.00   Median : 20.00  
##  Mean   : 23.82   Mean   : 22.24   Mean   : 43.76   Mean   : 30.53  
##  3rd Qu.: 28.00   3rd Qu.: 27.50   3rd Qu.: 52.00   3rd Qu.: 37.50  
##  Max.   :135.00   Max.   :126.00   Max.   :248.00   Max.   :173.00  
##       B:4              B:5              C:1              C:2        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 13.50   1st Qu.:  5.00   1st Qu.: 26.00   1st Qu.: 19.50  
##  Median : 25.00   Median : 16.00   Median : 59.00   Median : 36.00  
##  Mean   : 40.76   Mean   : 25.76   Mean   : 76.94   Mean   : 59.29  
##  3rd Qu.: 50.50   3rd Qu.: 31.50   3rd Qu.:102.50   3rd Qu.: 77.00  
##  Max.   :231.00   Max.   :146.00   Max.   :436.00   Max.   :336.00  
##       C:3             C:4              C:5              D:1        
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 3.00   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 12.00  
##  Median : 7.00   Median : 5.000   Median : 5.000   Median : 22.00  
##  Mean   :11.29   Mean   : 7.765   Mean   : 7.765   Mean   : 31.24  
##  3rd Qu.:17.00   3rd Qu.:12.000   3rd Qu.: 9.500   3rd Qu.: 38.00  
##  Max.   :64.00   Max.   :44.000   Max.   :44.000   Max.   :177.00  
##       D:2              D:3              D:4              D:5      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0  
##  1st Qu.: 22.50   1st Qu.: 15.00   1st Qu.: 10.00   1st Qu.: 3.5  
##  Median : 37.00   Median : 22.00   Median : 15.00   Median : 6.0  
##  Mean   : 61.41   Mean   : 36.53   Mean   : 24.88   Mean   : 9.0  
##  3rd Qu.: 85.00   3rd Qu.: 45.50   3rd Qu.: 30.50   3rd Qu.:11.0  
##  Max.   :348.00   Max.   :207.00   Max.   :141.00   Max.   :51.0  
##       g:1              g:2              a:1              a:2        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 26.50   1st Qu.: 33.50   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 61.00   Median : 76.00   Median : 13.00   Median : 17.00  
##  Mean   : 71.29   Mean   : 91.76   Mean   : 21.71   Mean   : 25.24  
##  3rd Qu.: 87.00   3rd Qu.:117.00   3rd Qu.: 30.00   3rd Qu.: 35.50  
##  Max.   :404.00   Max.   :520.00   Max.   :123.00   Max.   :143.00  
##       a:3              a:4              a:5             a:6        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.0   1st Qu.:  0.00  
##  Median : 16.00   Median : 16.00   Median : 16.0   Median : 11.00  
##  Mean   : 24.35   Mean   : 32.82   Mean   : 36.0   Mean   : 22.94  
##  3rd Qu.: 33.00   3rd Qu.: 46.50   3rd Qu.: 53.5   3rd Qu.: 31.00  
##  Max.   :138.00   Max.   :186.00   Max.   :204.0   Max.   :130.00  
##       e:1             e:2             e:3             e:4       
##  Min.   : 0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.00  
##  1st Qu.: 1.00   1st Qu.: 2.50   1st Qu.: 12.5   1st Qu.: 4.00  
##  Median : 8.00   Median : 7.00   Median : 25.0   Median :11.00  
##  Mean   :14.47   Mean   :10.41   Mean   : 36.0   Mean   :16.94  
##  3rd Qu.:22.50   3rd Qu.:15.50   3rd Qu.: 48.0   3rd Qu.:22.00  
##  Max.   :82.00   Max.   :59.00   Max.   :204.0   Max.   :96.00  
##       e:5              e:6             e:7             e:8        
##  Min.   :  0.00   Min.   : 0.00   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:  9.00   1st Qu.: 3.50   1st Qu.: 2.00   1st Qu.:  5.00  
##  Median : 24.00   Median :13.00   Median : 6.00   Median : 16.00  
##  Mean   : 36.71   Mean   :17.29   Mean   :10.06   Mean   : 21.18  
##  3rd Qu.: 51.00   3rd Qu.:22.50   3rd Qu.:13.00   3rd Qu.: 28.00  
##  Max.   :208.00   Max.   :98.00   Max.   :57.00   Max.   :120.00  
##       p:1             p:2              p:3              ga:1       
##  Min.   :  0.0   Min.   : 0.000   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 46.0   1st Qu.: 3.000   1st Qu.: 16.00   1st Qu.: 0.000  
##  Median : 88.0   Median : 5.000   Median : 34.00   Median : 3.000  
##  Mean   :113.5   Mean   : 7.412   Mean   : 42.18   Mean   : 8.471  
##  3rd Qu.:142.0   3rd Qu.: 9.500   3rd Qu.: 53.50   3rd Qu.:11.000  
##  Max.   :643.0   Max.   :42.000   Max.   :239.00   Max.   :48.000  
##       ga:2            ga:3            ga:4            ga:5      
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00  
##  Median : 3.00   Median : 5.00   Median : 7.00   Median : 7.00  
##  Mean   :10.94   Mean   :10.94   Mean   :15.35   Mean   :14.82  
##  3rd Qu.:14.50   3rd Qu.:14.50   3rd Qu.:21.00   3rd Qu.:19.00  
##  Max.   :62.00   Max.   :62.00   Max.   :87.00   Max.   :84.00  
##       ga:6            ga:7            ga:8            ga:9      
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00  
##  Median : 4.00   Median : 6.00   Median : 5.00   Median : 4.00  
##  Mean   :10.76   Mean   :13.24   Mean   :14.29   Mean   :13.41  
##  3rd Qu.:12.50   3rd Qu.:19.50   3rd Qu.:19.50   3rd Qu.:17.50  
##  Max.   :61.00   Max.   :75.00   Max.   :81.00   Max.   :76.00  
##      ga:10           ga:11            ga:12      
##  Min.   : 0.00   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.:  0.00   1st Qu.: 0.00  
##  Median : 6.00   Median : 10.00   Median : 5.00  
##  Mean   :17.47   Mean   : 21.18   Mean   :12.18  
##  3rd Qu.:24.00   3rd Qu.: 27.00   3rd Qu.:15.50  
##  Max.   :99.00   Max.   :120.00   Max.   :69.00

We take the stacked tables of A and D substantive variables from the whole Burt matrix:

Finland.AD = Finland.B[c(1:5,16:20), c(1:5,16:20)]

The principal intervals (eigenvalues) are:

summary(Finland.AD)
##       A:1            A:2             A:3              A:4       
##  Min.   : 0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 0.0   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:  0.0  
##  Median : 5.5   Median : 20.5   Median : 28.50   Median : 23.0  
##  Mean   : 9.2   Mean   : 42.8   Mean   : 63.00   Mean   : 42.8  
##  3rd Qu.:11.5   3rd Qu.: 43.5   3rd Qu.: 79.75   3rd Qu.: 42.5  
##  Max.   :46.0   Max.   :214.0   Max.   :315.00   Max.   :214.0  
##       A:5              D:1              D:2              D:3        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 12.50   Median : 20.50   Median : 29.00   Median : 16.00  
##  Mean   : 27.00   Mean   : 35.40   Mean   : 69.60   Mean   : 41.40  
##  3rd Qu.: 36.75   3rd Qu.: 40.75   3rd Qu.: 89.25   3rd Qu.: 44.75  
##  Max.   :135.00   Max.   :177.00   Max.   :348.00   Max.   :207.00  
##       D:4              D:5      
##  Min.   :  0.00   Min.   : 0.0  
##  1st Qu.:  0.00   1st Qu.: 0.0  
##  Median :  9.50   Median : 5.5  
##  Mean   : 28.20   Mean   :10.2  
##  3rd Qu.: 41.25   3rd Qu.:13.0  
##  Max.   :141.00   Max.   :51.0

The computation of MCA is the application of the (simple) CA algorithm to the Burt matrix C.

The following CA is conducted:

plot(Finland.AD, main="MCA=CA of Burt")

Task 4: Multiple correspondence analysis (MCA) on the four substantive questions.

Here I use the script from Exercise set 2 about joint correspondance analysis and its fit to two-way tables:

Finland.mca <- mjca(Finland[,1:4], ps="", lambda="JCA")
par(mar=c(4.2,4,3,1), mfrow=c(1,1), font.lab=2)
plot(Finland.mca, main="Joint correspondence analysis")

Task 5: MCA on the four substantive questions: computing the case points, and adding confidence ellipses for some demographic groups.

The example code from Exercise set 3 is used, only data are different.

5.1 MCA on the four substantive questions

MCA is the CA of indicator matrix:

require(ca)
Finland.mca1 <- mjca(Finland[,1:4], lambda="indicator", ps="", reti=T)
sum(Finland.mca1$sv^2)
## [1] 4
summary(Finland.mca1)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.506305  12.7  12.7  ***                      
##  2      0.475751  11.9  24.6  ***                      
##  3      0.322936   8.1  32.6  **                       
##  4      0.309594   7.7  40.4  **                       
##  5      0.274095   6.9  47.2  **                       
##  6      0.268731   6.7  53.9  **                       
##  7      0.250166   6.3  60.2  **                       
##  8      0.236639   5.9  66.1  *                        
##  9      0.225387   5.6  71.7  *                        
##  10     0.217098   5.4  77.2  *                        
##  11     0.209046   5.2  82.4  *                        
##  12     0.186252   4.7  87.0  *                        
##  13     0.162414   4.1  91.1  *                        
##  14     0.142888   3.6  94.7  *                        
##  15     0.125436   3.1  97.8  *                        
##  16     0.087262   2.2 100.0  *                        
##         -------- -----                                 
##  Total: 4.000000 100.0                                 
## 
## 
## Columns:
##      name   mass  qlt  inr     k=1 cor ctr    k=2 cor ctr  
## 1  |   A1 |   12  198   57 |  -779  32  15 | 1783 167  83 |
## 2  |   A2 |   58  181   46 |  -730 161  61 |  257  20   8 |
## 3  |   A3 |   85   95   38 |  -187  18   6 | -384  76  26 |
## 4  |   A4 |   58   60   44 |   142   6   2 | -421  54  22 |
## 5  |   A5 |   37  509   63 |  1635 458 193 |  550  52  23 |
## 6  |   B1 |   34  616   65 |  -694  76  32 | 1850 540 245 |
## 7  |   B2 |   67  244   47 |  -767 216  78 | -276  28  11 |
## 8  |   B3 |   47   84   46 |  -239  13   5 | -555  71  30 |
## 9  |   B4 |   62  110   45 |   281  26  10 | -501  84  33 |
## 10 |   B5 |   40  588   66 |  1740 568 236 |  322  19   9 |
## 11 |   C1 |  118  559   39 |   789 556 145 |  -57   3   1 |
## 12 |   C2 |   91  379   42 |  -650 241  76 | -491 138  46 |
## 13 |   C3 |   17   65   52 |  -894  59  27 |  279   6   3 |
## 14 |   C4 |   12  101   54 | -1130  64  30 |  863  37  19 |
## 15 |   C5 |   12  472   66 |  -430   9   4 | 3041 462 231 |
## 16 |   D1 |   48  159   47 |   795 150  60 |  196   9   4 |
## 17 |   D2 |   94  155   36 |  -247  37  11 | -443 118  39 |
## 18 |   D3 |   56   12   42 |   -23   0   0 | -203  12   5 |
## 19 |   D4 |   38   35   46 |  -310  17   7 |  310  17   8 |
## 20 |   D5 |   14  313   59 |  -123   1   0 | 2310 312 155 |

We calculate: - the number of categories:

J <- 4*5 -1

The number of substantive questions is Q=4. Then:

Q <- 4 
(J-Q)/Q
## [1] 3.75

We check the contents of these components of the mjca object (eigenvalues (squared singular values),standard coordinates of columns, standard coordinates of first 10 rows, variable names, category names, first 10 rows of 924 x 20 indicator matrix, first 20 rows and columns of 20 x 20 Burt matrix) :

names(Finland.mca1)
##  [1] "sv"         "lambda"     "inertia.e"  "inertia.t"  "inertia.et"
##  [6] "levelnames" "factors"    "levels.n"   "nd"         "nd.max"    
## [11] "rownames"   "rowmass"    "rowdist"    "rowinertia" "rowcoord"  
## [16] "rowpcoord"  "rowctr"     "rowcor"     "colnames"   "colmass"   
## [21] "coldist"    "colinertia" "colcoord"   "colpcoord"  "colctr"    
## [26] "colcor"     "colsup"     "subsetcol"  "Burt"       "Burt.upd"  
## [31] "subinertia" "JCA.iter"   "indmat"     "call"
Finland.mca1$sv^2                  # eigenvalues (squared singular values) on all 31 dimensions
##  [1] 0.50630492 0.47575147 0.32293568 0.30959403 0.27409529 0.26873127
##  [7] 0.25016609 0.23663902 0.22538662 0.21709794 0.20904570 0.18625197
## [13] 0.16241364 0.14288849 0.12543559 0.08726229
Finland.mca1$colcoord[,1:4]        # standard coordinates of columns
##              [,1]        [,2]        [,3]       [,4]
##  [1,] -1.09485075  2.58466228  0.38117469 -1.0611142
##  [2,] -1.02632480  0.37232989 -0.47157445  1.4201562
##  [3,] -0.26345610 -0.55714283 -0.80286444 -1.3121928
##  [4,]  0.19957060 -0.61095108  2.14671089 -0.1271033
##  [5,]  2.29834966  0.79755898 -0.91193285  1.3736196
##  [6,] -0.97470923  2.68220803  0.48209843 -0.6463525
##  [7,] -1.07792715 -0.39988664 -0.66328834  1.5112705
##  [8,] -0.33602699 -0.80484225 -1.22794762 -1.6797485
##  [9,]  0.39518935 -0.72610435  1.92930121 -0.7637834
## [10,]  2.44509059  0.46699649 -0.88686667  1.1895602
## [11,]  1.10913902 -0.08237802  0.03901019 -0.2765203
## [12,] -0.91287622 -0.71121198  0.25165618  0.6290356
## [13,] -1.25648834  0.40439859 -2.40903951 -1.9962482
## [14,] -1.58745288  1.25079576  0.30637839  1.2613963
## [15,] -0.60443232  4.40835272  0.88938548 -0.4212425
## [16,]  1.11702020  0.28392642 -0.07957831 -0.3599156
## [17,] -0.34677975 -0.64190277 -0.07750072  0.8506859
## [18,] -0.03261598 -0.29495380 -0.87687783 -0.9136392
## [19,] -0.43594396  0.44926861  1.62187492 -0.1096170
## [20,] -0.17281598  3.34972059 -0.11990279 -0.5442025
Finland.mca1$rowcoord[1:10,1:4]    # standard coordinates of first 10 rows
##              [,1]       [,2]       [,3]        [,4]
##  [1,]  0.05722575 -0.7561703 -0.9103450 -1.08632489
##  [2,] -1.03458890 -0.4329605 -1.7389019 -0.42526271
##  [3,]  0.16760566 -0.6304181 -1.2620137 -1.87904947
##  [4,] -0.80348254 -0.7115631 -0.9200550 -0.03842743
##  [5,] -1.07151274 -0.3746743 -0.7743111  1.18923762
##  [6,]  0.16760566 -0.6304181 -1.2620137 -1.87904947
##  [7,]  0.05722575 -0.7561703 -0.9103450 -1.08632489
##  [8,] -0.21080445 -0.4192814 -0.7646011  0.14134015
##  [9,] -0.81084974 -0.5214511 -1.0227208 -0.24451133
## [10,] -0.66354598 -0.4539851 -2.3389802 -2.65173623
Finland.mca1$colnames              # variable names
## [1] "A" "B" "C" "D"
Finland.mca1$levelnames            # category names
##  [1] "A1" "A2" "A3" "A4" "A5" "B1" "B2" "B3" "B4" "B5" "C1" "C2" "C3" "C4"
## [15] "C5" "D1" "D2" "D3" "D4" "D5"
Finland.mca1$indmat[1:10,]         # first 10 rows of 924 x 20 indicator matrix
##    A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4 D5
## 1   0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  0  0
## 2   0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  0  1  0  0  0
## 3   0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0  0  1  0  0
## 4   0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0  1  0  0
## 5   0  1  0  0  0  0  1  0  0  0  0  1  0  0  0  0  0  1  0  0
## 6   0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0  0  1  0  0
## 7   0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  0  0
## 8   0  1  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  0  0
## 9   0  1  0  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0
## 10  0  0  1  0  0  0  0  1  0  0  0  0  1  0  0  0  0  1  0  0
Finland.mca1$Burt[1:20, 1:20]      # first 20 rows and columns of 20 x 20 Burt matrix
##    A1  A2  A3  A4  A5  B1  B2  B3  B4  B5  C1  C2 C3 C4 C5  D1  D2  D3  D4
## A1 46   0   0   0   0  28  12   4   2   0  15  13  6  6  6  13  15   5   7
## A2  0 214   0   0   0  50  92  38  25   9  64 101 17 15 17  28  91  46  36
## A3  0   0 315   0   0  24  92  95  83  21 145 120 30 14  6  55 115  88  43
## A4  0   0   0 214   0  15  42  30  97  30 104  88 10  9  3  41  84  41  43
## A5  0   0   0   0 135   9  10   6  24  86 108  14  1  0 12  40  43  27  12
## B1 28  50  24  15   9 126   0   0   0   0  26  34 17 15 34  29  24  20  29
## B2 12  92  92  42  10   0 248   0   0   0  52 141 27 25  3  30 119  53  37
## B3  4  38  95  30   6   0   0 173   0   0  79  75 17  1  1  28  75  45  23
## B4  2  25  83  97  24   0   0   0 231   0 142  82  3  3  1  41  90  55  37
## B5  0   9  21  30  86   0   0   0   0 146 137   4  0  0  5  49  40  34  15
## C1 15  64 145 104 108  26  52  79 142 137 436   0  0  0  0 128 137  93  57
## C2 13 101 120  88  14  34 141  75  82   4   0 336  0  0  0  35 175  73  51
## C3  6  17  30  10   1  17  27  17   3   0   0   0 64  0  0   7  22  22   7
## C4  6  15  14   9   0  15  25   1   3   0   0   0  0 44  0   3  12  12  11
## C5  6  17   6   3  12  34   3   1   1   5   0   0  0  0 44   4   2   7  15
## D1 13  28  55  41  40  29  30  28  41  49 128  35  7  3  4 177   0   0   0
## D2 15  91 115  84  43  24 119  75  90  40 137 175 22 12  2   0 348   0   0
## D3  5  46  88  41  27  20  53  45  55  34  93  73 22 12  7   0   0 207   0
## D4  7  36  43  43  12  29  37  23  37  15  57  51  7 11 15   0   0   0 141
## D5  6  13  14   5  13  24   9   2   8   8  21   2  6  6 16   0   0   0   0
##    D5
## A1  6
## A2 13
## A3 14
## A4  5
## A5 13
## B1 24
## B2  9
## B3  2
## B4  8
## B5  8
## C1 21
## C2  2
## C3  6
## C4  6
## C5 16
## D1  0
## D2  0
## D3  0
## D4  0
## D5 51

Next we compute positions of cases and plot them.

Finland.rpc <- Finland.mca1$indmat %*% Finland.mca1$colcoord[,1:2] / 4
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca1, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)

The MCA is calculated as a CA of Burt matrix:

Finland.mca2 <- mjca(Finland[,1:4], lambda="Burt", ps="")
sum(Finland.mca2$sv^2)
## [1] 1.195211
summary(Finland.mca2)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.256345  21.4  21.4  *****                    
##  2      0.226339  18.9  40.4  *****                    
##  3      0.104287   8.7  49.1  **                       
##  4      0.095848   8.0  57.1  **                       
##  5      0.075128   6.3  63.4  **                       
##  6      0.072216   6.0  69.5  **                       
##  7      0.062583   5.2  74.7  *                        
##  8      0.055998   4.7  79.4  *                        
##  9      0.050799   4.3  83.6  *                        
##  10     0.047132   3.9  87.6  *                        
##  11     0.043700   3.7  91.2  *                        
##  12     0.034690   2.9  94.1  *                        
##  13     0.026378   2.2  96.3  *                        
##  14     0.020417   1.7  98.0                           
##  15     0.015734   1.3  99.4                           
##  16     0.007615   0.6 100.0                           
##         -------- -----                                 
##  Total: 1.195211 100.0                                 
## 
## 
## Columns:
##      name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1  |   A1 |   12  335   57 | -554  57  15 | 1230 278  83 |
## 2  |   A2 |   58  317   46 | -520 284  61 |  177  33   8 |
## 3  |   A3 |   85  165   38 | -133  33   6 | -265 132  26 |
## 4  |   A4 |   58  104   44 |  101  11   2 | -291  93  22 |
## 5  |   A5 |   37  724   63 | 1164 654 193 |  379  70  23 |
## 6  |   B1 |   34  825   65 | -494 107  32 | 1276 718 245 |
## 7  |   B2 |   67  400   47 | -546 357  78 | -190  43  11 |
## 8  |   B3 |   47  148   46 | -170  24   5 | -383 124  30 |
## 9  |   B4 |   62  186   45 |  200  47  10 | -345 139  33 |
## 10 |   B5 |   40  790   66 | 1238 765 236 |  222  25   9 |
## 11 |   C1 |  118  793   39 |  562 789 145 |  -39   4   1 |
## 12 |   C2 |   91  589   42 | -462 384  76 | -338 206  46 |
## 13 |   C3 |   17  123   52 | -636 113  27 |  192  10   3 |
## 14 |   C4 |   12  184   54 | -804 119  30 |  595  65  19 |
## 15 |   C5 |   12  681   66 | -306  14   4 | 2097 666 231 |
## 16 |   D1 |   48  288   47 |  566 273  60 |  135  16   4 |
## 17 |   D2 |   94  270   36 | -176  67  11 | -305 203  39 |
## 18 |   D3 |   56   22   42 |  -17   0   0 | -140  22   5 |
## 19 |   D4 |   38   65   46 | -221  34   7 |  214  32   8 |
## 20 |   D5 |   14  502   59 |  -87   2   0 | 1594 500 155 |

The default plot shows principal coordinates:

plot(Finland.mca2) 

The adjusted MCA is the default of mjca function:

Finland.mca3 <- mjca(Finland[,1:4], ps="")
summary(Finland.mca3)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.116786  44.9  44.9  *************            
##  2      0.090602  34.8  79.7  **********               
##  3      0.009457   3.6  83.3  *                        
##  4      0.006314   2.4  85.7  *                        
##  5      0.001032   0.4  86.1                           
##  6      0.000624   0.2  86.4                           
##  7      00000000   0.0  86.4                           
##         -------- -----                                 
##  Total: 0.260281                                       
## 
## 
## Columns:
##      name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1  |   A1 |   12  848   57 | -374 159  15 |  778 689  83 |
## 2  |   A2 |   58  847   46 | -351 769  61 |  112  78   8 |
## 3  |   A3 |   85  545   38 |  -90 122   6 | -168 423  26 |
## 4  |   A4 |   58  375   44 |   68  45   2 | -184 329  22 |
## 5  |   A5 |   37  830   63 |  785 759 193 |  240  71  23 |
## 6  |   B1 |   34  834   65 | -333 121  32 |  807 713 245 |
## 7  |   B2 |   67  738   47 | -368 666  78 | -120  71  11 |
## 8  |   B3 |   47  543   46 | -115 100   5 | -242 443  30 |
## 9  |   B4 |   62  460   45 |  135 127  10 | -219 333  33 |
## 10 |   B5 |   40  804   66 |  836 782 236 |  141  22   9 |
## 11 |   C1 |  118  902   39 |  379 899 145 |  -25   4   1 |
## 12 |   C2 |   91  898   42 | -312 611  76 | -214 288  46 |
## 13 |   C3 |   17  636   52 | -429 588  27 |  122  47   3 |
## 14 |   C4 |   12  756   54 | -542 510  30 |  376 246  19 |
## 15 |   C5 |   12  845   66 | -207  20   4 | 1327 825 231 |
## 16 |   D1 |   48  970   47 |  382 924  60 |   85  46   4 |
## 17 |   D2 |   94  829   36 | -119 227  11 | -193 602  39 |
## 18 |   D3 |   56  277   42 |  -11   4   0 |  -89 273   5 |
## 19 |   D4 |   38  557   46 | -149 305   7 |  135 252   8 |
## 20 |   D5 |   14  960   59 |  -59   3   0 | 1008 956 155 |

A joint correspondence analysis is used for optimal fit to two-way tables:

Finland.mca4 <- mjca(Finland[,1:4], ps="", lambda="JCA")
par(mar=c(4.2,4,3,1), mfrow=c(1,1), font.lab=2)
plot(Finland.mca4, main="Joint correspondence analysis")

summary(Finland.mca4)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value   
##  1      0.156300
##  2      0.101794
##  3      0.006669
##  4      0.005138
##  5      0.003304
##  6      0.000545
##  7      0.000228
##         --------
##  Total: 0.285989
## 
##  Diagonal inertia discounted from eigenvalues: 0.090779
##  Percentage explained by JCA in 2 dimensions: 85.7%
##  (Eigenvalues are not nested)
##  [Iterations in JCA: 50 , epsilon = 0.0008615]
## 
## 
## Columns:
##      name   mass  inr    k=1   k=2   cor ctr  
## 1  |   A1 |   12   57 | -522  -606 | 856  42 |
## 2  |   A2 |   58   46 | -364   -25 | 879  37 |
## 3  |   A3 |   85   38 |  -74   170 | 564  14 |
## 4  |   A4 |   58   44 |  103   145 | 362  10 |
## 5  |   A5 |   37   63 |  765  -381 | 912 121 |
## 6  |   B1 |   34   65 | -722  -880 | 970 136 |
## 7  |   B2 |   67   47 | -457   209 | 848  52 |
## 8  |   B3 |   47   46 |  -87   300 | 600  17 |
## 9  |   B4 |   62   45 |  255   225 | 528  21 |
## 10 |   B5 |   40   66 | 1100  -309 | 954 151 |
## 11 |   C1 |  118   39 |  377   -41 | 943  80 |
## 12 |   C2 |   91   42 | -259   280 | 918  60 |
## 13 |   C3 |   17   52 | -460   -53 | 711  17 |
## 14 |   C4 |   12   54 | -606  -293 | 793  24 |
## 15 |   C5 |   12   66 | -482 -1367 | 917 104 |
## 16 |   D1 |   48   47 |  270  -145 | 745  25 |
## 17 |   D2 |   94   36 |  -59   201 | 795  21 |
## 18 |   D3 |   56   42 |   10    81 | 257   2 |
## 19 |   D4 |   38   46 | -135  -103 | 487   6 |
## 20 |   D5 |   14   59 | -206  -911 | 917  60 |

5.2 Demographics is superimposed on an MCA

Call the used code about confidence plots, package ellipse, and calculated individual points:

source("confidenceplots.R")
require(ellipse)
## Loading required package: ellipse
Finland.rpc <- Finland.mca1$indmat %*% Finland.mca1$colcoord[,1:2] / 4

The demographic groups are: education e, partnership p and ga groups.Their confidence plots are :

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=rep("chocolate",6), shownames=T, add=T)

lines(matrix(c(-0.5,-0.6, -0.5,0.4, 0.5,0.4, 0.5,-0.6, -0.5,-0.6), ncol=2, byrow=T), lty=2, col="gray30", lwd=2)

We zoom in to the group means and confidence ellipses:

plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=rep("chocolate",6), shownames=T, add=T)

plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), 
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)

And add shading:

plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)

Next we go back to original plot, with category labels, gender-age labels slightly smaller (cex=0.8)

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T, cex=0.8)

Back to other demographic variable, education:

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), mfrow=c(1,1), font.lab=2)
plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"))
points(Finland.rpc, pch=19, cex=0.5, col="lightblue")
text(Finland.mca1$colcoord, labels=Finland.mca1$levelnames, col="red", font=4, cex=0.8)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shade=T,
                shownames=T, add=T, cex=0.8)

Plot confidence ellipses for all demographic groups together in a zoom:

plot(Finland.mca3, labels=c(0,0), map="rowprincipal", col=c("black","white"), xlim=c(-0.5,0.5), ylim=c(-0.6,0.4))
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"p"]), groupcols=rep("purple",6), shade=T,
                groupnames=c("liveTog","liveSep","Single"), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"e"]), groupcols=rep("forestgreen",6), shade=T,
                groupnames=c("e0","e1","e2","e3","e4","e5"), shownames=T, add=T)
confidenceplots(Finland.rpc[,1], Finland.rpc[,2], group=factor(Finland[,"ga"]), groupcols=c(rep("blue",6),rep("red",6)), shade=T,
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)

MULTIPLE CORRESPONDANCE ANALYSIS - Assignment 2

The data

As in the previous Assignment 1, here we again use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.

The 4 substantive variables, which values are measured in 1-5 scale, are:

A: Married people are generally happier than unmarried people. B: People who want children ought to get married. C: It is all right for a couple to live together without intending to get married. D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.

The demographic variables are: g: gender (1=male, 2=female) a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+) e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics , 7= University, lower academic degree, BA, 8=University, higher academic degree, MA p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)

Here we do not provide preliminary data treatment. We include all original variables, and also all missing data, denoted with number 9. During the data analyses we simply assign a new category, 9, to the missing data.

Graphical overview of the data and summaries of the variables

The original data, including missing values denoted by number 9, look as:

Finland2 <- read.table("FinlandWithMissing.txt")
head(Finland2)
##   A B C D g a e p
## 1 3 3 1 2 1 2 4 3
## 2 3 2 3 2 1 4 2 3
## 3 3 3 1 3 1 3 8 1
## 4 9 4 1 1 2 2 4 2
## 5 3 2 2 3 2 2 6 1
## 6 9 4 2 1 2 5 6 3
dim(Finland2)
## [1] 1171    8
str(Finland2)
## 'data.frame':    1171 obs. of  8 variables:
##  $ A: int  3 3 3 9 3 9 2 1 3 3 ...
##  $ B: int  3 2 3 4 2 4 2 1 3 3 ...
##  $ C: int  1 3 1 1 2 2 2 1 1 1 ...
##  $ D: int  2 2 3 1 3 1 3 9 3 2 ...
##  $ g: int  1 1 1 2 2 2 2 2 1 2 ...
##  $ a: int  2 4 3 2 2 5 4 2 3 1 ...
##  $ e: int  4 2 8 4 6 6 5 5 7 4 ...
##  $ p: int  3 3 1 2 1 3 1 3 3 3 ...

Task 1: Multiple correspondence analysis (MCA) on the four substantive questions. Comparing the results with the case of excluded missing data.

As it was noted during the lectures, since we analyze the data at the nominal level, a missing value is treated as an additional category (in the example here category 9), which is included during the subset analysis.

We show the results for both data sets (without and with missing data) using the default symmetric plot, where both coordinates are principle. We recall the main formulas from MCA (from lecture slides) in order to show the mathematical expression of these coordinates.

If we denote the data table (Finland.txt or FinlandWithMissing.txt) as N, then the correspondence matrix P is obtained as:

\[\mathbf{P} = \left( \frac{1}{n} \right)\mathbf{N} \]

If we denote the row and column marginal totals (masses) of P as r and c respectively, and \[\mathbf{D_r}\] and \[\mathbf{D_c}\] are the diagonal matrices of these masses, then after applying singular value decomposition (SVD) we obtain:

\[\mathbf{S} = \mathbf{D^{\left(-1/2\right)}_r} \mathbf{\left(P-rc^T \right)}\mathbf{D^{\left(-1/2\right)}_c} \] which is equivalent to

\[\mathbf{S} = \mathbf{D^{\left(1/2\right)}_r} \mathbf{\left(D^{\left(-1\right)}_rPD^{\left(-1\right)}_c-11^T\right)} \mathbf{D^{\left(1/2\right)}_c} \]

According SVD \[\mathbf{S} = \mathbf{UD_\alpha V^T}\] Then the principal coordinates are presented as: - for rows:

\[\mathbf{F} = \mathbf{D^{\left(-1/2\right)}_rUD_{\alpha}}\] - for columns:

\[\mathbf{G} = \mathbf{D^{\left(-1/2\right)}_c V D_{\alpha}}\]

The total variance (inertia) is the sum of squares of the elements of \[\mathbf{S} = \mathbf{trace\left(SS^T\right) = {\chi}^2/n}\]

The Chi-square distance \[{\chi}^2 = \sum_{i=1}^r\sum_{j=1}^c\frac{\left(fo_{ij}-fe_{ij}\right)^2}{fe_{ij}}\]

In this expression the observed and expected frequencies of the cell in row i and column j are denoted by \[fo_{ij}\] and \[fe_{ij}\] respectively.

We first provide MCA on data Finland.txt, where missed data are excluded and draw the default symmetric plot.

require(ca)
Finland <- read.table("Finland.txt")
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), cex.axis=0.8, font.lab=2, mfrow=c(1,1))
plot(mjca(Finland[,1:4], ps=""), mass=c(F,T))

Next perform the same MCA analysis and plot, but using also missed data as separate category = 9:

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), cex.axis=0.8, font.lab=2, mfrow=c(1,1))
plot(mjca(Finland2[,1:4], ps=""), mass=c(F,T))

We observe that the missing category 9 for all 4 substantive questions is situated on the right down part of the biplot, where both principal coordinates take negative values.

The symmetric plot represents the row and column profiles simultaneously in a common space. According Mike Bendixen (Marketing Bulletin, 2003, 14) this plot may lead to misinterpretation if examined in isolation or only visually because principal coordinates are presented for both rows and columns. These coordinates represent the row and column profiles and not the apexes for which the standard coordinates are required. Therefore in such cases asymmetric plots are recommended.

Task 2: Burt matrix. Performing CA using a subset of non-missing rows and columns. Confidence regions of the demographic groups.

The part of Burt matrix on first 4 substantive questions can be obtained as:

require(ca)
Finland2.B <- mjca(Finland2[,1:4])$Burt

Since the missing values are involved in this matrix as separate category, we need to take the following part of Burt matrix:

Finland2.ABCD = Finland2.B[c(1:24), c(1:24)]
Finland2.ABCD
##     A:1 A:2 A:3 A:4 A:5 A:9 B:1 B:2 B:3 B:4 B:5 B:9 C:1 C:2 C:3 C:4 C:5
## A:1  59   0   0   0   0   0  32  14   5   4   1   3  21  15   7   7   6
## A:2   0 233   0   0   0   0  53  99  38  27  12   4  69 109  20  17  18
## A:3   0   0 344   0   0   0  28 100 105  87  22   2 153 134  32  14   7
## A:4   0   0   0 231   0   0  16  47  30 102  33   3 112  93  11  10   3
## A:5   0   0   0   0 152   0   9  12   7  27  96   1 121  17   1   0  12
## A:9   0   0   0   0   0 152  12  41  23  23  28  25  69  49   8   3   4
## B:1  32  53  28  16   9  12 150   0   0   0   0   0  29  41  21  18  38
## B:2  14  99 100  47  12  41   0 313   0   0   0   0  66 175  34  29   4
## B:3   5  38 105  30   7  23   0   0 208   0   0   0  98  88  19   1   1
## B:4   4  27  87 102  27  23   0   0   0 270   0   0 161  99   4   3   1
## B:5   1  12  22  33  96  28   0   0   0   0 192   0 178   6   0   0   6
## B:9   3   4   2   3   1  25   0   0   0   0   0  38  13   8   1   0   0
## C:1  21  69 153 112 121  69  29  66  98 161 178  13 545   0   0   0   0
## C:2  15 109 134  93  17  49  41 175  88  99   6   8   0 417   0   0   0
## C:3   7  20  32  11   1   8  21  34  19   4   0   1   0   0  79   0   0
## C:4   7  17  14  10   0   3  18  29   1   3   0   0   0   0   0  51   0
## C:5   6  18   7   3  12   4  38   4   1   1   6   0   0   0   0   0  50
## C:9   3   0   4   2   1  19   3   5   1   2   2  16   0   0   0   0   0
## D:1  15  31  58  45  41  26  30  39  34  46  58   9 154  43   8   4   4
## D:2  17  97 121  86  44  47  32 146  82 104  45   3 158 208  27  13   3
## D:3   6  49  96  41  27  25  24  61  55  59  41   4 110  83  28  13   7
## D:4   7  37  43  44  12  16  33  39  26  41  19   1  65  55   9  13  16
## D:5   7  13  14   5  13   6  25  11   3   9   9   1  23   4   6   7  17
## D:9   7   6  12  10  15  32   6  17   8  11  20  20  35  24   1   1   3
##     C:9 D:1 D:2 D:3 D:4 D:5 D:9
## A:1   3  15  17   6   7   7   7
## A:2   0  31  97  49  37  13   6
## A:3   4  58 121  96  43  14  12
## A:4   2  45  86  41  44   5  10
## A:5   1  41  44  27  12  13  15
## A:9  19  26  47  25  16   6  32
## B:1   3  30  32  24  33  25   6
## B:2   5  39 146  61  39  11  17
## B:3   1  34  82  55  26   3   8
## B:4   2  46 104  59  41   9  11
## B:5   2  58  45  41  19   9  20
## B:9  16   9   3   4   1   1  20
## C:1   0 154 158 110  65  23  35
## C:2   0  43 208  83  55   4  24
## C:3   0   8  27  28   9   6   1
## C:4   0   4  13  13  13   7   1
## C:5   0   4   3   7  16  17   3
## C:9  29   3   3   3   1   1  18
## D:1   3 216   0   0   0   0   0
## D:2   3   0 412   0   0   0   0
## D:3   3   0   0 244   0   0   0
## D:4   1   0   0   0 159   0   0
## D:5   1   0   0   0   0  58   0
## D:9  18   0   0   0   0   0  82
summary(Finland2.ABCD)
##       A:1              A:2              A:3              A:4        
##  Min.   : 0.000   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 2.500   1st Qu.:  3.00   1st Qu.:  3.50   1st Qu.:  2.75  
##  Median : 6.500   Median : 19.00   Median : 25.00   Median : 13.50  
##  Mean   : 9.833   Mean   : 38.83   Mean   : 57.33   Mean   : 38.50  
##  3rd Qu.:14.250   3rd Qu.: 50.00   3rd Qu.: 97.00   3rd Qu.: 45.50  
##  Max.   :59.000   Max.   :233.00   Max.   :344.00   Max.   :231.00  
##       A:5              A:9              B:1              B:2        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.75   1st Qu.:  3.75   1st Qu.:  5.25   1st Qu.:  4.75  
##  Median : 12.00   Median : 21.00   Median : 22.50   Median : 31.50  
##  Mean   : 25.33   Mean   : 25.33   Mean   : 25.00   Mean   : 52.17  
##  3rd Qu.: 27.00   3rd Qu.: 29.00   3rd Qu.: 32.00   3rd Qu.: 62.25  
##  Max.   :152.00   Max.   :152.00   Max.   :150.00   Max.   :313.00  
##       B:3              B:4              B:5             B:9        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.:  1.75   1st Qu.:  0.0   1st Qu.: 0.000  
##  Median : 13.50   Median : 17.00   Median : 10.5   Median : 2.500  
##  Mean   : 34.67   Mean   : 45.00   Mean   : 32.0   Mean   : 6.333  
##  3rd Qu.: 42.25   3rd Qu.: 66.00   3rd Qu.: 35.0   3rd Qu.: 8.250  
##  Max.   :208.00   Max.   :270.00   Max.   :192.0   Max.   :38.000  
##       C:1              C:2             C:3             C:4      
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.00   Min.   : 0.0  
##  1st Qu.: 19.00   1st Qu.:  5.5   1st Qu.: 0.75   1st Qu.: 0.0  
##  Median : 67.50   Median : 42.0   Median : 7.50   Median : 3.5  
##  Mean   : 90.83   Mean   : 69.5   Mean   :13.17   Mean   : 8.5  
##  3rd Qu.:129.00   3rd Qu.: 94.5   3rd Qu.:20.25   3rd Qu.:13.0  
##  Max.   :545.00   Max.   :417.0   Max.   :79.00   Max.   :51.0  
##       C:5              C:9              D:1              D:2        
##  Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 0.750   1st Qu.: 0.750   1st Qu.:  3.75   1st Qu.:  3.00  
##  Median : 4.000   Median : 2.000   Median : 28.00   Median : 38.00  
##  Mean   : 8.333   Mean   : 4.833   Mean   : 36.00   Mean   : 68.67  
##  3rd Qu.: 8.250   3rd Qu.: 3.250   3rd Qu.: 43.50   3rd Qu.: 98.75  
##  Max.   :50.000   Max.   :29.000   Max.   :216.00   Max.   :412.00  
##       D:3              D:4             D:5              D:9       
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:  3.75   1st Qu.:  1.0   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 26.00   Median : 16.0   Median : 6.500   Median : 9.00  
##  Mean   : 40.67   Mean   : 26.5   Mean   : 9.667   Mean   :13.67  
##  3rd Qu.: 56.00   3rd Qu.: 39.5   3rd Qu.:13.000   3rd Qu.:18.50  
##  Max.   :244.00   Max.   :159.0   Max.   :58.000   Max.   :82.00

For our work we need also to calculate the indicator matrix, obtained also via the MCA function mjca:

Finland2.Z <- mjca(Finland2[,1:4], ps="", reti=T)$indmat
head(Finland2.Z)
##   A1 A2 A3 A4 A5 A9 B1 B2 B3 B4 B5 B9 C1 C2 C3 C4 C5 C9 D1 D2 D3 D4 D5 D9
## 1  0  0  1  0  0  0  0  0  1  0  0  0  1  0  0  0  0  0  0  1  0  0  0  0
## 2  0  0  1  0  0  0  0  1  0  0  0  0  0  0  1  0  0  0  0  1  0  0  0  0
## 3  0  0  1  0  0  0  0  0  1  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0
## 4  0  0  0  0  0  1  0  0  0  1  0  0  1  0  0  0  0  0  1  0  0  0  0  0
## 5  0  0  1  0  0  0  0  1  0  0  0  0  0  1  0  0  0  0  0  0  1  0  0  0
## 6  0  0  0  0  0  1  0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0
dim(Finland2.Z)
## [1] 1171   24

The relationship between both matrices is: \[\mathbf{B} = \mathbf{Z^TZ}\] The non-missing categories are obtained by excluding the missing ones. Further analysis is provided using subset CA:

Finland2.nonmissing <- -c(6,12,18,24)
Finland2.mca1 <- ca(Finland2.B, subsetrow=c(Finland2.nonmissing), subsetcol=c(Finland2.nonmissing))
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2)

summary(Finland2.mca1)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.243700  20.7  20.7  *****                    
##  2      0.210498  17.9  38.6  ****                     
##  3      0.103194   8.8  47.4  **                       
##  4      0.095390   8.1  55.5  **                       
##  5      0.075342   6.4  61.9  **                       
##  6      0.070659   6.0  67.9  **                       
##  7      0.062970   5.4  73.3  *                        
##  8      0.057778   4.9  78.2  *                        
##  9      0.049824   4.2  82.4  *                        
##  10     0.047380   4.0  86.5  *                        
##  11     0.043140   3.7  90.1  *                        
##  12     0.036612   3.1  93.2  *                        
##  13     0.029151   2.5  95.7  *                        
##  14     0.022491   1.9  97.6                           
##  15     0.017302   1.5  99.1                           
##  16     0.008924   0.8  99.9                           
##  17     0.001268   0.1 100.0                           
##  18     0.000285   0.0 100.0                           
##  19     5.3e-050   0.0 100.0                           
##  20     1.3e-050   0.0 100.0                           
##         -------- -----                                 
##  Total: 1.175975 100.0                                 
## 
## 
## Rows:
##      name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1  |   A1 |   13  293   56 | -522  52  14 | 1121 241  75 |
## 2  |   A2 |   50  299   46 | -545 271  61 |  172  27   7 |
## 3  |   A3 |   73  166   39 | -149  36   7 | -285 130  28 |
## 4  |   A4 |   49   88   45 |  113  12   3 | -284  76  19 |
## 5  |   A5 |   32  680   61 | 1152 602 177 |  417  79  27 |
## 6  |   B1 |   32  805   64 | -584 145  45 | 1243 660 235 |
## 7  |   B2 |   67  397   47 | -528 340  76 | -216  57  15 |
## 8  |   B3 |   44  136   47 | -132  14   3 | -390 122  32 |
## 9  |   B4 |   58  163   45 |  196  42   9 | -336 122  31 |
## 10 |   B5 |   41  782   64 | 1169 742 230 |  271  40  14 |
## 11 |   C1 |  116  797   39 |  563 796 151 |   -8   0   0 |
## 12 |   C2 |   89  588   42 | -448 358  73 | -360 231  55 |
## 13 |   C3 |   17  130   53 | -674 123  31 |  152   6   2 |
## 14 |   C4 |   11  197   55 | -865 126  33 |  650  71  22 |
## 15 |   C5 |   11  658   65 | -437  27   8 | 2130 631 230 |
## 16 |   D1 |   46  262   47 |  544 250  56 |  122  13   3 |
## 17 |   D2 |   88  277   37 | -197  79  14 | -312 198  41 |
## 18 |   D3 |   52   21   43 |  -24   1   0 | -140  20   5 |
## 19 |   D4 |   34   65   47 | -216  28   6 |  243  36  10 |
## 20 |   D5 |   12  467   58 | -194   7   2 | 1595 461 150 |
## 
## Columns:
##      name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1  |   A1 |   13  293   56 | -522  52  14 | 1121 241  75 |
## 2  |   A2 |   50  299   46 | -545 271  61 |  172  27   7 |
## 3  |   A3 |   73  166   39 | -149  36   7 | -285 130  28 |
## 4  |   A4 |   49   88   45 |  113  12   3 | -284  76  19 |
## 5  |   A5 |   32  680   61 | 1152 602 177 |  417  79  27 |
## 6  |   B1 |   32  805   64 | -584 145  45 | 1243 660 235 |
## 7  |   B2 |   67  397   47 | -528 340  76 | -216  57  15 |
## 8  |   B3 |   44  136   47 | -132  14   3 | -390 122  32 |
## 9  |   B4 |   58  163   45 |  196  42   9 | -336 122  31 |
## 10 |   B5 |   41  782   64 | 1169 742 230 |  271  40  14 |
## 11 |   C1 |  116  797   39 |  563 796 151 |   -8   0   0 |
## 12 |   C2 |   89  588   42 | -448 358  73 | -360 231  55 |
## 13 |   C3 |   17  130   53 | -674 123  31 |  152   6   2 |
## 14 |   C4 |   11  197   55 | -865 126  33 |  650  71  22 |
## 15 |   C5 |   11  658   65 | -437  27   8 | 2130 631 230 |
## 16 |   D1 |   46  262   47 |  544 250  56 |  122  13   3 |
## 17 |   D2 |   88  277   37 | -197  79  14 | -312 198  41 |
## 18 |   D3 |   52   21   43 |  -24   1   0 | -140  20   5 |
## 19 |   D4 |   34   65   47 | -216  28   6 |  243  36  10 |
## 20 |   D5 |   12  467   58 | -194   7   2 | 1595 461 150 |

This plot is quite similar to the plot, obtained when missing data are removed from the data set.

Next we add responding points to the plot. For this purpose we first transform the corresponding frequencies in the table and draw a sequence of points at the specified coordinates:

Finland2.sum <- apply(Finland2.Z[,c(Finland2.nonmissing)], 1, sum) 
Finland2.sum[Finland2.sum==0] <- 1
Finland2.rpc <- Finland2.Z[,c(Finland2.nonmissing)] %*% Finland2.mca1$colcoord / Finland2.sum
plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)

Next we compute confidence regions of demographic groups. For this purpose we have to add group ellipses and use Prof. Greenacre’s program “confidenceplots”, which was given separately.

require(ellipse)
source("confidenceplots.R")

First we draw confidence plots for the demographic group partnership by generation asymmetric plot, adding points and finally draw confidence plots.

plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$p<4,1], Finland2.rpc[Finland2$p<4,2], group=Finland2$p[Finland2$e<4], groupcols=rep("purple",5), shade=T,
                groupnames=c("ph1","p2","nop3"), shownames=T, add=T)

Next we draw confidence plot for the demographic group education.

plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$e<9,1], Finland2.rpc[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
                groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)

The confidence plot for the demographic group gender is:

plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$g<3,1], Finland2.rpc[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
                groupnames=c("m","f"), shownames=T, add=T)

The confidence plot for the demographic group age is:

plot(Finland2.mca1, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc[Finland2$a<7,1], Finland2.rpc[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
                groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)

If we plot only the confidence plots (just the ellipses) for every demographic group, the plots will look as follow:

plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc[Finland2$p<4,1], Finland2.rpc[Finland2$p<4,2], group=Finland2$p[Finland2$p<4], groupcols=rep("purple",5), shade=T,
                groupnames=c("ph1","p2","nop3"), shownames=T, add=T)

plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-1,1), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc[Finland2$e<9,1], Finland2.rpc[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
                groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)

plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-0.5,0.5), ylim=c(-0.2,0.2))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc[Finland2$g<3,1], Finland2.rpc[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
                groupnames=c("m","f"), shownames=T, add=T)

plot(Finland2.rpc, type="n", asp=1, xlab="Subset CA dim1 (20.7%)", ylab="Subset CA dim1 (17.9%)", xlim=c(-1,1), ylim=c(-0.5,0.5))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc[Finland2$a<7,1], Finland2.rpc[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
                groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)

Task 3: Analysis of the “middle” and “missing” values of the substantive variables. Confidence intervals of the demographic groups.

Analysis of “middle” category could be provided via CA of Burt matrix.

# investigation of "middle" category, via the Burt matrix
#seq() - generates a sequence of numbers
#seq(from=3,  to=46, by=6): 3,9,15,21,27,33,39,45
Finland2.middle <- seq(3,22,6) 
Finland2.mca2 <- ca(Finland2.B, subsetrow=Finland2.middle, subsetcol=seq(3,22,6))
#what - Vector of two character strings specifying the contents of the plot.
#First entry sets the rows and the second entry the columns. Allowed values # are "all" (all available points, default)
#"active" (only active points are displayed)
#"passive" (only supplementary points are displayed)
#"none" (no points are displayed)
#The status (active or supplementary) of columns is set in mjca using the option
#supcol.
# mass - area, first - rows, second - columns
#font.lab - labels for font? 
plot(Finland2.mca2, what=c("none", "all"), mass=c(F,T), font.lab=2)

summary(Finland2.mca2)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.069858  40.4  40.4  **********               
##  2      0.047719  27.6  68.0  *******                  
##  3      0.034199  19.8  87.8  *****                    
##  4      0.021173  12.2 100.0  ***                      
##         -------- -----                                 
##  Total: 0.172949 100.0                                 
## 
## 
## Rows:
##     name   mass  qlt  inr     k=1 cor ctr     k=2 cor ctr  
## 1 |   A3 |   73  563  194 |  -418 384 184 |   286 179 126 |
## 2 |   B3 |   44  770  256 |  -623 389 246 |   616 381 353 |
## 3 |   C3 |   17  943  318 | -1267 491 387 | -1215 452 522 |
## 4 |   D3 |   52  317  232 |  -495 317 182 |     6   0   0 |
## 
## Columns:
##     name   mass  qlt  inr     k=1 cor ctr     k=2 cor ctr  
## 1 |   A3 |   73  563  194 |  -418 384 184 |   286 179 126 |
## 2 |   B3 |   44  770  256 |  -623 389 246 |   616 381 353 |
## 3 |   C3 |   17  943  318 | -1267 491 387 | -1215 452 522 |
## 4 |   D3 |   52  317  232 |  -495 317 182 |     6   0   0 |

Adding respondent points only for middle categories:

#Finland2.Z - indicator matrix
# apply(variable, margin, function). - Returns a vector or array or list of values obtained by applying 
#a function to margins of an array or matrix
#margin specifies if you want to apply by row (margin = 1), 
#by column (margin = 2), or for each element (margin = 1:2).
Finland2.sum4 <- apply(Finland2.Z[,c(Finland2.middle)], 1, sum) 
Finland2.sum4[Finland2.sum4==0] <- 1
Finland2.rpc5 <- Finland2.Z[,c(Finland2.middle)] %*% Finland2.mca2$colcoord / Finland2.sum4
plot(Finland2.mca2, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc5, pch=19, col="lightblue", cex=0.8)

Only missing categories are obtained as:

Finland2.missing <- c(6,12,18,24)
Finland2.mca4 <- ca(Finland2.B, subsetrow=c(Finland2.missing), subsetcol=c(Finland2.missing))
plot(Finland2.mca4, what=c("none", "all"), mass=c(F,T), font.lab=2)

Adding respondent points only for missing categories:

Finland2.sum2 <- apply(Finland2.Z[,c(Finland2.missing)], 1, sum) 
Finland2.sum2[Finland2.sum2==0] <- 1
Finland2.rpc3 <- Finland2.Z[,c(Finland2.missing)] %*% Finland2.mca4$colcoord / Finland2.sum2
plot(Finland2.mca4, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc3 , pch=19, col="lightblue", cex=0.8)

Both “middle” and missing categories are investigated again using CA of Burt matrix.

Finland2.missing <- c(6,12,18,24)
Finland2.mca3 <- ca(Finland2.B, subsetrow=c(Finland2.middle,Finland2.missing), subsetcol=c(Finland2.middle,Finland2.missing))
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2)

Adding respondent points only for both “middle” and missing categories:

Finland2.sum1 <- apply(Finland2.Z[,c(Finland2.middle,Finland2.missing)], 1, sum) 
Finland2.sum1[Finland2.sum1==0] <- 1
Finland2.rpc2 <- Finland2.Z[,c(Finland2.middle,Finland2.missing)] %*% Finland2.mca3$colcoord / Finland2.sum1
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2 , pch=19, col="lightblue", cex=0.8)

Next we compute confidence regions of demographic groups for both missing and “middle” groups.

  • confidence plot for the demographic group partnership
#generated assymmetric plot
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
#add points
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
#draw confidence plots
confidenceplots(Finland2.rpc2[Finland2$p<4,1], Finland2.rpc2[Finland2$p<4,2], group=Finland2$p[Finland2$p<4], groupcols=rep("purple",5), shade=T,
                groupnames=c("ph1","p2","nop3"), shownames=T, add=T)

  • confidence plot for the demographic group education
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$e<9,1], Finland2.rpc2[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
                groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)

  • confidence plot for the demographic group gender
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$g<3,1], Finland2.rpc2[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
                groupnames=c("m","f"), shownames=T, add=T)

  • confidence plot for the demographic group age
plot(Finland2.mca3, what=c("none", "all"), mass=c(F,T), font.lab=2, map="rowprincipal")
points(Finland2.rpc2, pch=19, col="lightblue", cex=0.8)
confidenceplots(Finland2.rpc2[Finland2$a<7,1], Finland2.rpc2[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
                groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)

Since from these plots the confidence regions are not so clear, similarly as in Task 2, we next plot “just the ellipses”.

  • for the demographic group partnership:
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-1,1), ylim=c(0,2))
abline(v=0, h=0, lty=2, col="grey") 
#draw confidence plots
confidenceplots(Finland2.rpc2[Finland2$p<4,1], Finland2.rpc2[Finland2$p<4,2], group=Finland2$p[Finland2$e<4], groupcols=rep("purple",5), shade=T,
                groupnames=c("ph1","p2","nop3"), shownames=T, add=T)

  • for the demographic group education:
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-1,1), ylim=c(0.5,1.5))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc2[Finland2$e<9,1], Finland2.rpc2[Finland2$e<9,2], group=Finland2$e[Finland2$e<9], groupcols=rep("forestgreen",8), shade=T,
                groupnames=c("e1","e2","e3","e4","e5","e6","e7","e8"), shownames=T, add=T)

  • for the demographic group gender:
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(0,0.5), ylim=c(0.8,1.4))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc2[Finland2$g<3,1], Finland2.rpc2[Finland2$g<3,2], group=Finland2$g[Finland2$g<3], groupcols=rep("blue",6), shade=T,
                groupnames=c("m","f"), shownames=T, add=T)

  • for the demographic group age:
plot(Finland2.rpc2, type="n", asp=1, xlab="Subset CA dim1 (48.6%)", ylab="Subset CA dim1 (13.5%)", xlim=c(-0.5,1), ylim=c(0.5,1.5))
abline(v=0, h=0, lty=2, col="grey") 
confidenceplots(Finland2.rpc2[Finland2$a<7,1], Finland2.rpc2[Finland2$a<7,2], group=Finland2$a[Finland2$a<7], groupcols=rep("blue",6), shade=T,
                groupnames=c("a1","a2","a3","a4","a5","a6"), shownames=T, add=T)

From all analyses, provided in Task 3, we can conclude that MCA is a powerful tool for investigating missing and “middle” categories. Both categories involve quite big amount of uncertainty, because of which most of the other methods just ignore these cases. Here we have seen how the information from both categories could be utilized applying MCA approach.

MULTIPLE CORRESPONDANCE ANALYSIS - Assignment 3

The data

As during the Assignment set 1. here we again use Finnish sample from ISSP 2012 survey “Family and Changing Gender Roles IV”. Original data involve 1171 observations of 8 variables (4 substantive and 4 demographic). All variables are categorical.

The 4 substantive variables, which values are measured in 1-5 scale, are:

A: Married people are generally happier than unmarried people.
B: People who want children ought to get married.
C: It is all right for a couple to live together without intending to get married.
D: Divorce is usually the best solution when a couple can’t seem to work out their marriage problems.

The demographic variables are:
g: gender (1=male, 2=female)
a: age group (1=16-25, 2=26-35, 3=36-45, 4=46-55, 5=56-65, 6= 66+)
e: education (1=Primary, 2=Comprehensive, primary and lower secondary, 3= Post-comprehensive, vocational school or course, 4=General upper secondary education or certificate, 5= Vocational post-secondary non-tertiary education, 6=Polytechnics, 7= University, lower academic degree, BA, 8=University, higher academic degree, MA
p: Living in steady partnership (1=Yes, have partner; live in same household, 2=Yes, have partner; don’t live in same household, 3=No partner)

The data wrangling includes the following changes:

  1. The missing data ae removed (this has already been provided). The number of observations without missing data is N=924.

  2. We again use a combined variable **ga = 6*(g-1) + a. The combined variable ga** describes the interaction of gender and age categories.

Graphical overview of the data and summaries of the variables

The preliminary treated data look as:

Finland <- read.table("Finland.txt")
Finland$ga <- 6*(Finland$g-1) + Finland$a
head(Finland)
##   A B C D g a e p ga
## 1 3 3 1 2 1 2 4 3  2
## 2 3 2 3 2 1 4 2 3  4
## 3 3 3 1 3 1 3 8 1  3
## 4 3 2 2 3 2 2 6 1  8
## 5 2 2 2 3 2 4 5 1 10
## 6 3 3 1 3 1 3 7 3  3
dim(Finland)
## [1] 924   9
str(Finland)
## 'data.frame':    924 obs. of  9 variables:
##  $ A : int  3 3 3 3 2 3 3 2 2 3 ...
##  $ B : int  3 2 3 2 2 3 3 3 3 3 ...
##  $ C : int  1 3 1 2 2 1 1 1 2 3 ...
##  $ D : int  2 2 3 3 3 3 2 2 3 3 ...
##  $ g : int  1 1 1 2 2 1 2 2 2 2 ...
##  $ a : int  2 4 3 2 4 3 1 2 5 4 ...
##  $ e : int  4 2 8 6 5 7 4 8 7 5 ...
##  $ p : int  3 3 1 1 1 3 3 3 1 3 ...
##  $ ga: num  2 4 3 8 10 3 7 8 11 10 ...
library(FactoMineR)
library(tidyr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
keep_columns <- c("A", "B", "C", "D", "g", "a", "e", "p", "ga")
Finland <- dplyr::select(Finland, one_of(keep_columns))
summary(Finland)
##        A               B               C               D        
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :3.193   Mean   :3.025   Mean   :1.835   Mean   :2.503  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##        g               a               e               p        
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:1.000  
##  Median :2.000   Median :4.000   Median :5.000   Median :1.000  
##  Mean   :1.563   Mean   :3.644   Mean   :4.527   Mean   :1.563  
##  3rd Qu.:2.000   3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :6.000   Max.   :8.000   Max.   :3.000  
##        ga        
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 7.000  
##  Mean   : 7.021  
##  3rd Qu.:10.000  
##  Max.   :12.000

We present data graphically as barplots using ggplot() function.

gather(Finland) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) 

We observe that the highest frequency of the answers to question A takes a middle hypothesis, while for questions B and D the highest frequency has the answer 2. Most respondents strongly agree with the hypothesis C. The rest of the barplots show the distribution of demographic variables.

Nonlinear MCA: The package homals in R

There are two main constrained forms of MCA: canonical MCA and nonlinear MCA. Canonical MCA use explanatory variables to explain set of response variables, while in classical MCA there is no restriction imposed by explanatory variables since all variables are assumed to be response variables.

Nonlinear MCA is also assumed to be a constrained form of MCA because it is obtained from common MCA (known also as homogeneity analysis) by imposing restrictions on the variable ranks and levels as well as defined set of variables.

The most important advantages of nonlinear over linear MCA are that except truly numeric nonlinear MCA incorporates nominal and ordinal variables, and additionally can discover nonlinear relationships between variables.

In nonlinear MCA the variables could be nominal, ordinal or true numeric. Nominal variables consist of unordered categories. Since principal components are weighted sums of the original variables, nominal variables could not be analyzed using standard linear PCA. Ordinal variables consist of ordered categories, for example as we have also used, a Likert-type scale. Such scale values are not truly numeric because intervals between consecutive categories are not equal. Although ordinal variables display more structure than nominal ones, these variables still do not possess traditional numeric properties. The true numeric variables can be viewed as categorical with c categories, where c indicates the number of different observed values. Both ratio and interval variables are considered numeric in nonlinear PCA.

Linear MCA is suitable for variables, all measured as numeric. It is possible to exist nonlinear relationships between some of these variables. Then nonlinear MCA will be more appropriate approach.

Nonlinear PCA converts categories into numeric values because the variance could be established only for numeric values. This process is called quantification. Thus correlations are not computed between observed variables, but between quantified variables. Therefore the correlation matrix in nonlinear PCA is not fixed and depends on the type of quantification chosen for every variable.

The type of quantification is called analysis level. Different analysis levels imply different requirements.

Nonlinear MCA in R could be provided by using the homals package. This package performs simple homogeneity analysis, which corresponds to MCA. The package homals also provide some extensions to MCA, including nonlinear MCA. Nonlinear PCA could be provided using the function homals and setting appropriate options.

In the example below we use “ordinal” analysis level because the substantive variables are ordinal and the categories are given in Likert-type scale.

require(homals)
## Loading required package: homals
#algebraically, the geometric concept of dimensionality is related to 
#the rank of the matrix, which has to be reduced.
#rank - Which quantification ranks. Default is number of dimensions ndim
#level - Which quantification levels. Possible values are "nominal", "ordinal", 
#"numerical", and "polynomial" which can be defined as 
#single character (if all variable are of the same level) or 
#as vector which length corresponds to the number of variables

Finland.nlpca <- homals(Finland[,1:4], rank=1, level="ordinal")

1. Transformation plots

Next the transformation plot is drawn. The horizontal axis (x) of the plot displays the categories (1 to 5) of the ordinal substantive variables. On the vertical axis (y) the category quantification for every substantive variable (A-D) are shown.

#Transformation plot: Plots the original (categorical) scale against 
#the transformed (metric) scale on each dimension over the categories 
#of each variable separately. 
plot(Finland.nlpca, plot.type="trfplot")    # relationship of transformed scale with 1-to-5 scale
Figure 1A

Figure 1A

Figure 1B

Figure 1B

Figure 1C

Figure 1C

Figure 1D

Figure 1D

From the transformation plot for the variable D we see that the ordinal category quantification are non-strictly increasing with the original category labels. The original spacing between categories is not maintained in the quantifications. Between categories 2 and 4 we see something similar to plateau, which means that some consecutive categories obtain almost the same quantification, called ties. There are two possible reasons for such ties: 1. The persons scoring in the tied categories do not structurally differ from each other considering their scores on the other variables, and therefore categories cannot be distinguished from each other. 2. The ordinal quantifications are obtained by placing an order restriction on nominal quantifications.

2.Category plots

This plot represent a quantified variable by displaying its category points in principal component space, where the axes are given by the principal components. Here a variable is represented by a vector going through the origin (0,0) (which is also the mean of the quantified variable) and the point with as coordinates the component loadings for the variable. The component loadings are correlations between the quantified variables and the principal components, and the sum of squared component loadings indicates the variance accounted for (VAF) by the principal component. The category points are also positioned on the variable vector, and their coordinates are found by multiplying the category quantifications by the corresponding loadings on the first (for the x-coordinate) and the second (for the y-coordinate) component.
Categories with quantifications above the mean lie on the side of the origin on which the component loadings point is positioned, and categories with quantifications below the mean lie in the opposite direction. The total length of the variable vector does not indicate the importance of the variable.

#2. Category plot: Plots the rank-restricted category quantifications for 
#each variable separately. 
plot(Finland.nlpca, plot.type="catplot")
Figure 2A

Figure 2A

Figure 2B

Figure 2B

Figure 2C

Figure 2C

Figure 2D

Figure 2D

In these plots the loading point is not shown. I do not know how to interpret the nonlinear transformations, which lead to these 4 category plots.

3.Component loadings plot

In these plots only the loading vectors (origin and loading point) are displayed. Here the variables with relatively long vectors fit well into the solution and variables with relatively short vectors fit badly. When vectors are long, the cosines of the angles between the vectors indicate the correlation between the quantified variables. The length of the variable vector from the origin up to the component loading point is an indication of the variable’s total variance accounted for (VAF). Thus VAF can be interpreted as the amount of information retained when variables are represented in a lower dimensional space. Nonlinear transformations reduce the dimensionality necessary to represent the variables satisfactory.

#3. Loadings plot: Plots the loadings of the variables and connects them with the origin.
plot(Finland.nlpca, plot.type="loadplot")
Figure 3

Figure 3

In this example the longest length has a variable vector for D. Other variable vector lengths are very similar and not much shorter than D loading vector.

4.Vector plot

In this plot all cases (object scores) are projected onto a straight line defined by each rank restricted category quantified variable. Here except the straight line, plotted in the category plots, also all object scores are shown.

#4. Vector plot: cases projected onto straight line defined by each variable
plot(Finland.nlpca, plot.type="vecplot")     
Figure 4A

Figure 4A

Figure 4B

Figure 4B

Figure 4C

Figure 4C

Figure 4D ##5.Loss plot

The loss plot shows the rank-restricted category quantifications against the unrestricted for each variable separately. It actually compares MCA and NLPCA.

plot(Finland.nlpca, plot.type="lossplot")   
Figure 5A

Figure 5A

Figure 5B

Figure 5B

Figure 5C

Figure 5C

Figure 5D From these plots we see that the linear and nonlinear MCA results are very near for variables A and D, while both model outputs differ quite a lot for the variables B and C. This could mean that linear model assumption for variables A and D is appropriate.

6. Object plot

It plots the scores (cases) of the objects (rows in data set) on two dimensions. Therefore this plot is presented only by dots.

plot(Finland.nlpca, plot.type="objplot")   
Figure 6

Figure 6

7. Label plot

It is similar to the previous object plot. In the label plot the object scores are also plotted, but now for each variable separately with the corresponding category labels. It looks like contour lines forced to be straight & parallel

plot(Finland.nlpca, plot.type="labplot")    
Figure 7A

Figure 7A

Figure 7B

Figure 7B

Figure 7C

Figure 7C

Figure 7D #8. Graph plot.

It is a joint plot with connections between scores/quantifications. So, except the object scores, plotted on object plot, the variable quantifications are also shown. It should be noted that this plot works only for small data sets. Even in the provided example ot is already quite difficult to observe the plotted relationships.

plot(Finland.nlpca, plot.type="graphplot")
Figure 8

Figure 8

9. Voronoi plot.

It produces a category plot with Voronoi regions. Looks like contour lines forced to be straight and parallel.

plot(Finland.nlpca, plot.type="vorplot") 
Figure 9A

Figure 9A

Figure 9B

Figure 9B

Figure 9C

Figure 9C

Figure 9D

Figure 9D

10. Hull plot

For each single variable the object scores are mapped onto two dimensions and the convex hull for each response category is drawn.

# would be better to show confidence ellipses
plot(Finland.nlpca, plot.type="hullplot")
Figure 10A

Figure 10A

Figure 10B

Figure 10B

Figure 10C

Figure 10C

Figure 10D

Figure 10D

CA of doubled data

The CA method could be applied on data, which have been preprocessed in different ways. One kind of preprocessing is so called doubling, which in this example is provided on data type ratings. We assume 1-to-5 Linkert scale, which starts at one. Then the first column consists of all ratings minus 1. Since rating 1 = “strongly agree”, the transformed first column measure the strength of disagreements and therefore is called negative pole. The second column is 4 minus the first column and measure the strength of disagreements. The second column is called positive pole. This kind of data transformation is called doubling. The CA of doubled data is performed as before when we have used nontransformed data.
Next plot shows the asymmetric map.

require(ca)
Finland.doubled <- cbind( (Finland[,1:4]-1), (5-Finland[,1:4]) )
colnames(Finland.doubled) <- paste( rep(LETTERS[1:4],2), rep(c("-","+"), each=4), sep="" )
head(Finland.doubled)
Finland.doubled.ca <- ca(Finland.doubled)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, col=c("lightblue","red"), labels=c(0,2), font=2, map="rowprincipal")
Figure 11

Figure 11

Next we connect opposite categories in column standard coordinates and draw the contribution biplot.

# connect opposite categories in column standard coordinates
Finland.doubled.csc <- Finland.doubled.ca$colcoord
for(j in 1:4) segments(Finland.doubled.csc[j,1],Finland.doubled.csc[j,2], 
                       Finland.doubled.csc[4+j,1],Finland.doubled.csc[4+j,2], col="red", lty=2)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, col=c("lightblue","red"), labels=c(0,2), font=2, map="rowgreen")
Finland.doubled.ccc <- Finland.doubled.ca$colcoord * sqrt(Finland.doubled.ca$colmass)
for(j in 1:4) segments(Finland.doubled.ccc[j,1],Finland.doubled.ccc[j,2], 
                       Finland.doubled.ccc[4+j,1],Finland.doubled.ccc[4+j,2], col="red", lty=2)
Figure 12

Figure 12

The polar positions in contribution coordinates are shown on Figure 13.

# just the polar positions in contribution coordinates
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, what=c("none","all"), labels=c(0,2), font=2, map="rowgreen")
for(j in 1:4) segments(Finland.doubled.ccc[j,1],Finland.doubled.ccc[j,2], 
                       Finland.doubled.ccc[4+j,1],Finland.doubled.ccc[4+j,2], col="red", lty=2)
Figure 13

Figure 13

The demographic group averages are shown on Figure 14

Finland.doubled.rpc <- Finland.doubled.ca$rowcoord %*% diag(Finland.doubled.ca$sv)
Figure 14

Figure 14

The demographic group averages and confidence regions are shown on Figure 15

# add centroids of demographic categories
Finland.doubled.rpc <- Finland.doubled.ca$rowcoord %*% diag(Finland.doubled.ca$sv)
# with confidence ellipses
source("confidenceplots.R")
require(ellipse)
par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(Finland.doubled.ca, labels=c(0,2), what=c("none","all"))
Finland.doubled.cpc <- Finland.doubled.ca$colcoord %*% diag(Finland.doubled.ca$sv)
for(j in 1:4) segments(Finland.doubled.cpc[j,1],Finland.doubled.cpc[j,2], 
                       Finland.doubled.cpc[4+j,1],Finland.doubled.cpc[4+j,2], col="red", lty=2)
confidenceplots(Finland.doubled.rpc[,1], Finland.doubled.rpc[,2], group=Finland$ga, groupcols=c(rep("blue",6),rep("red",6)), shade=T,
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=T)
Figure 15

Figure 15

Next we add confidence ellipses to the plot and show the resultt onFigure 14. Only the confidence ellipses are shown on Figure 16.

#ellipses just by themselves (watch out for aspect ratio distortion! I have to correct this in next version of my function)
confidenceplots(Finland.doubled.rpc[,1], Finland.doubled.rpc[,2], group=Finland$ga, groupcols=c(rep("blue",6),rep("red",6)), shade=T,
                groupnames=c("ma1","ma2","ma3","ma4","ma5","ma6","fa1","fa2","fa3","fa4","fa5","fa6"),shownames=T, add=F)
Figure 16

Figure 16

Regular PCA on non-missing data

The regular PCA plot is shown on the next Figure 17:

Finland.pca <- prcomp(Finland[,1:4])
names(Finland.pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
# [1] "sdev"     "rotation" "center"   "scale"    "x"  

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(Finland.pca$x, 10.5*Finland.pca$rotation), type="n", asp=1, xlab="PCA dim 1 (37.5%)", ylab="PCA dim 2 (17.1%)")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(Finland.pca$x, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, 10*Finland.pca$rotation[,1], 10*Finland.pca$rotation[,2], length=0.1, angle=10, col="pink", lwd=2)
text(10.3*Finland.pca$rotation, labels=colnames(Finland[,1:4]), col="red", font=4)

Factor analysis

The R function factanal() performs the analysis on standardized variables

The correlation matrix is:

round(cor(Finland[,1:4]),3)
##        A      B      C      D
## A  1.000  0.527 -0.236 -0.057
## B  0.527  1.000 -0.508 -0.138
## C -0.236 -0.508  1.000  0.291
## D -0.057 -0.138  0.291  1.000

Since 2 factors are too many for 4 variables, we have to set factors = 1.

Finland.fa <- factanal(Finland[,1:4], factors=1, rotation="none", scores="regression")
names(Finland.fa)
##  [1] "converged"    "loadings"     "uniquenesses" "correlation" 
##  [5] "criteria"     "factors"      "dof"          "method"      
##  [9] "scores"       "STATISTIC"    "PVAL"         "n.obs"       
## [13] "call"
print(Finland.fa)
## 
## Call:
## factanal(x = Finland[, 1:4], factors = 1, scores = "regression",     rotation = "none")
## 
## Uniquenesses:
##     A     B     C     D 
## 0.715 0.028 0.735 0.979 
## 
## Loadings:
##   Factor1
## A  0.534 
## B  0.986 
## C -0.515 
## D -0.144 
## 
##                Factor1
## SS loadings      1.543
## Proportion Var   0.386
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 65.38 on 2 degrees of freedom.
## The p-value is 6.35e-15

Factor analysis results:

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(-Finland.fa$loadings, -0.5*Finland.fa$scores), asp=1, type="n", xlab="FA factor 1 (26.6%)", ylab="FA factor 2 (7.3%")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(-0.5*Finland.fa$scores, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, -Finland.fa$loadings[,1], -Finland.fa$loadings[,1], length=0.1, angle=10, col="pink", lwd=2)
text(-1.05*Finland.fa$loadings, labels=colnames(Finland[,1:4]), col="red", font=4)

FA with rotation:

Finland.fa <- factanal(Finland[,1:4], factors=1, rotation="varimax", scores="regression")

print(Finland.fa)
## 
## Call:
## factanal(x = Finland[, 1:4], factors = 1, scores = "regression",     rotation = "varimax")
## 
## Uniquenesses:
##     A     B     C     D 
## 0.715 0.028 0.735 0.979 
## 
## Loadings:
##   Factor1
## A  0.534 
## B  0.986 
## C -0.515 
## D -0.144 
## 
##                Factor1
## SS loadings      1.543
## Proportion Var   0.386
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 65.38 on 2 degrees of freedom.
## The p-value is 6.35e-15

Plot FA results:

par(mar=c(4.2,4,1,1), mgp=c(2,0.7,0), font.lab=2, cex.axis=0.8)
plot(rbind(-Finland.fa$loadings, -0.5*Finland.fa$scores), asp=1, type="n", xlab="FA factor 1 (18.6%)", ylab="FA factor 2 (15.3%")
abline(h=0, col="gray", lty=2)
abline(v=0, col="gray", lty=2)
points(-0.5*Finland.fa$scores, pch=19, col="lightblue", cex=0.9)
arrows(0, 0, -Finland.fa$loadings[,1], -Finland.fa$loadings[,1], length=0.1, angle=10, col="pink", lwd=2)
text(-1.05*Finland.fa$loadings, labels=colnames(Finland[,1:4]), col="red", font=4)

K-means clustering of respondents using MCA coordinates

Finland is the set of complete cases

require(ca)
Finland.B <- mjca(Finland[,1:4], ps="")$Burt
Finland.Z <- mjca(Finland[,1:4], ps="", reti=T)$indmat
Finland.csc <- mjca(Finland[,1:4], ps="")$colcoord
rownames(Finland.Z) <- 1:nrow(Finland.Z)
colnames(Finland.Z) <- colnames(Finland.B)
# notice that the division by 8 in the next line is becasue of the 9 variables in Finland
Finland.rpc <- Finland.Z %*% Finland.csc/9
summary(mjca(Finland[,1:4]))
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.116786  44.9  44.9  *************            
##  2      0.090602  34.8  79.7  **********               
##  3      0.009457   3.6  83.3  *                        
##  4      0.006314   2.4  85.7  *                        
##  5      0.001032   0.4  86.1                           
##  6      0.000624   0.2  86.4                           
##  7      00000000   0.0  86.4                           
##         -------- -----                                 
##  Total: 0.260281                                       
## 
## 
## Columns:
##      name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1  |  A:1 |   12  848   57 | -374 159  15 |  778 689  83 |
## 2  |  A:2 |   58  847   46 | -351 769  61 |  112  78   8 |
## 3  |  A:3 |   85  545   38 |  -90 122   6 | -168 423  26 |
## 4  |  A:4 |   58  375   44 |   68  45   2 | -184 329  22 |
## 5  |  A:5 |   37  830   63 |  785 759 193 |  240  71  23 |
## 6  |  B:1 |   34  834   65 | -333 121  32 |  807 713 245 |
## 7  |  B:2 |   67  738   47 | -368 666  78 | -120  71  11 |
## 8  |  B:3 |   47  543   46 | -115 100   5 | -242 443  30 |
## 9  |  B:4 |   62  460   45 |  135 127  10 | -219 333  33 |
## 10 |  B:5 |   40  804   66 |  836 782 236 |  141  22   9 |
## 11 |  C:1 |  118  902   39 |  379 899 145 |  -25   4   1 |
## 12 |  C:2 |   91  898   42 | -312 611  76 | -214 288  46 |
## 13 |  C:3 |   17  636   52 | -429 588  27 |  122  47   3 |
## 14 |  C:4 |   12  756   54 | -542 510  30 |  376 246  19 |
## 15 |  C:5 |   12  845   66 | -207  20   4 | 1327 825 231 |
## 16 |  D:1 |   48  970   47 |  382 924  60 |   85  46   4 |
## 17 |  D:2 |   94  829   36 | -119 227  11 | -193 602  39 |
## 18 |  D:3 |   56  277   42 |  -11   4   0 |  -89 273   5 |
## 19 |  D:4 |   38  557   46 | -149 305   7 |  135 252   8 |
## 20 |  D:5 |   14  960   59 |  -59   3   0 | 1008 956 155 |

We use 4 dimensions (which is also the maximum factor analysis will allow) and loop on k-means algorithm to decide how many clusters

Finland.BW <- rep(0, 10)
for(nc in 2:10) {
  Finland.km <- kmeans(Finland.rpc[,1:4], centers=nc, nstart=20, iter.max=200)
  Finland.BW[nc] <- Finland.km$betweenss/Finland.km$totss
}
Finland.BW
##  [1] 0.0000000 0.2336672 0.4374201 0.5789785 0.6871295 0.7217150 0.7460767
##  [8] 0.7698848 0.7823934 0.8072644

Plot the proportion of between-cluster variance:

par(mar=c(4.2,4,1,2), cex.axis=0.8, mfrow=c(1,2))
plot(Finland.BW, xlab="Nr of clusters", ylab="BSS/TSS")

Plot the increments in between-cluster variance:

Finland.BWinc <- Finland.BW[2:10]-Finland.BW[1:9]  
plot(1:10, c(NA,NA, Finland.BWinc[2:9]), xlab="Nr of clusters", ylab="improvement")
lines(3:10, Finland.BWinc[2:9], col="red", lwd=2)

If it looks like 5-cluster solution, than it is a good choice

Finland.km5 <- kmeans(Finland.rpc[,1:4], centers=5, nstart=20, iter.max=200)
Finland.km5$betweenss/Finland.km5$totss
## [1] 0.6871295

Cluster sizes:

Finland.km5$size
## [1] 212 105 230 210 167

Relate clusters to 9 variables in Finland and average those in a cluster by the original Finland counts:

Finland.means <- matrix(0,nrow=5,ncol=4)
rownames(Finland.means) <- c("clus1","clus2","clus3","clus4","clus5")
colnames(Finland.means) <- colnames(Finland)[1:4]
for(j in 1:4) Finland.means[,j] <- tapply(Finland[,j], Finland.km5$cluster, mean)
round(Finland.means,1)
##         A   B   C   D
## clus1 2.8 2.8 1.8 2.5
## clus2 2.2 1.2 3.4 3.3
## clus3 3.6 3.7 1.5 2.5
## clus4 2.6 2.1 2.1 2.4
## clus5 4.4 4.8 1.1 2.2