conducted by Neli Noykova
The data for this project are taken from [Statistics Finland] (https://tilastokeskus.fi/tup/mikroaineistot/aineistot_en.html). These are individual level combined employer-employee data, or so-called FLEED (Finnish Longitudinal Employer-Employee Data). The original data contain information on population of working age, which can be combined with enterprise and establishment level data. Here we use two adapted for studying purposes data sets in Excel format (files fleed_tyo.csv and fleed_yritys.csv).
The file fleed_tyo.csv consists of data about 15-70 years old employees in Finland for the period 1990-2010. Because of data protection reasons only 15 years period is taken into account, and the years are numbered from 1 to 15. The file is fully described (in Finnish) here. The sample data involve information about 89312 persons and 18 variables, describing person’s basic characteristics, family, living, employment relationships, periods of unemployment, income and education. The variables, listed in this file, are:
vuosi - Year, given in integer values.
shtun - Encrypted personal identity code, given in integer values.
syrtun - Encrypted enterprise code, related to the employment relationship during the last week of the year), integer values, with missing values.
sukup - Gender, 2 different integer values.
syntyv - Year of Birth, integer values.
kieli - Native language, defined as factor variable with 3 levels.
peas - Family status, 7 different integer values: 1 -head, 2 - spouse, 3 - child, 4 - head of cohabiting family, 5 - spouse of cohabiting family, 9 - unknown, 0 - not belonging to a family.
a7lkm - Number of children aged under 7 in family, integer values.
a18lkm - Number of children aged under 18 in family, integer values.
ktutk - Education, integer values.
sose - Socio-economic group, 9 different integer values.
ptoim1 - Main activity (TVM=employment relationship during the last week of the year), 7 different integer values.
tyokk - Months in employment, 13 different integer values.
tyke - Number of unemployment months, 13 different integer values.
toimiala - Industry (TVM=employment relationship during the last week of the year). It is defined as factor variable with 23 levels.
svatva - Earned income total in state taxation, integer values.
tyotu - Earned income, integer values.
suuralue12 - Major region based on the 2012 regional division, 5 different integer values.
data_workers <- read.csv(file="fleed_tyo.csv", header=TRUE, sep=",")
str(data_workers)
## 'data.frame': 89312 obs. of 18 variables:
## $ vuosi : int 1 1 1 1 1 1 1 1 1 1 ...
## $ shtun : int 2 3 5 7 8 9 11 12 13 14 ...
## $ syrtun : int 887 NA NA 4963 6639 8749 2506 7777 NA 6932 ...
## $ sukup : int 2 2 1 2 2 1 2 2 2 1 ...
## $ syntyv : int 1945 1927 1930 1952 1947 1950 1949 1957 1928 1946 ...
## $ kieli : Factor w/ 3 levels "9","fi","sv": 3 2 2 2 2 2 2 2 2 2 ...
## $ peas : int 2 2 1 0 1 1 5 2 0 1 ...
## $ a7lkm : int 0 0 0 NA 1 0 0 0 NA 0 ...
## $ a18lkm : int 0 0 0 NA 1 0 0 2 NA 1 ...
## $ ktutk : int 38 52 43 63 40 NA 34 32 NA 62 ...
## $ sose : int NA NA NA NA NA NA NA NA NA NA ...
## $ ptoim1 : int 11 24 24 11 11 11 11 11 24 11 ...
## $ tyokk : int NA NA NA NA NA NA NA NA NA NA ...
## $ tyke : int NA NA NA NA NA NA NA 7 NA NA ...
## $ toimiala : Factor w/ 23 levels "","A","B","C",..: 17 1 1 14 14 5 5 17 1 16 ...
## $ svatva : int 19000 16000 18000 54000 20000 22000 13000 8000 14000 37000 ...
## $ tyotu : int 19000 NA 9000 53000 20000 22000 12000 6000 NA 37000 ...
## $ suuralue12: int 1 4 1 1 1 2 2 3 4 2 ...
The other file, fleed_yritys.csv, involves data about the corresponding employers during the same period. It is described in Finnish here.
The sample data involve information about 66878 companies and 6 variables, describing the different companies where employees have been working during the observed period. The variables, listed in this file, are:
vuosi - Year, given in integer values.
syrtun - Encrypted enterprise code, related to the employment relationship during the last week of the year), integer values, with missing values.
oty - Type of owner, integer values.
toimiala - Industry (TVM=employment relationship during the last week of the year). It is defined as factor variable with 22 levels.
SLHKY - Group of the company according the number of employees working there, 9 different integer values corresponding to 9 different groups - 1 with the smallest number of employees (< 4,5), and 9 with the biggest number of employees (>=9 999,5).
sllvy - Group of the company according its turnover, 9 different integer values, corresponding to 9 different groups - 1 with the smallest turnover (< 1000), and 9 with the biggest turnover (>=9 200 000 000).
data_firms <- read.csv(file="fleed_yritys.csv", header=TRUE, sep=",")
str(data_firms)
## 'data.frame': 66878 obs. of 6 variables:
## $ vuosi : int 1 1 1 1 1 1 1 1 1 1 ...
## $ syrtun : int 1 3 6 14 16 17 20 24 27 28 ...
## $ oty : int 1 1 1 2 1 5 1 1 3 1 ...
## $ toimiala: Factor w/ 22 levels "","A","B","C",..: 10 7 8 16 7 8 8 5 16 10 ...
## $ SLHKY : int 3 6 4 7 1 2 1 5 5 1 ...
## $ sllvy : int 5 7 6 1 2 6 5 6 1 3 ...
Both tables are connected via the variable syrtun, which is an encrypted enterprise code describing the employment relationship during the last week of the year.
The goal of this project is to find out how the employee’s family status and the size of the employing company (measured by its turnover) are related to the size of the earned income.
The data should be restricted to the year vuosi=2 (the last number of my student number).
We assume that the whole earned income of each employee has come from a linked company during the investigated year.
We assume that the different employees are “independent of each other”.
The missing values can be omited.
data_workers2 <- subset(data_workers, vuosi==2)
dim(data_workers2)
## [1] 5855 18
data_firms2 <- subset(data_firms, vuosi== 2)
dim(data_firms2)
## [1] 3699 6
data1<-merge(data_workers2, data_firms2, by="syrtun")
dim(data1)
## [1] 1925 23
data1[,'peas']<-factor(data1[,'peas'])
levels(data1$peas)
## [1] "0" "1" "2" "3" "4" "5" "9"
data1[,'a7lkm']<-factor(data1[,'a7lkm'])
levels(data1$a7lkm)
## [1] "0" "1" "2" "3" "4" "5"
data1[,'a18lkm']<-factor(data1[,'a18lkm'])
levels(data1$a18lkm)
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
data1[,'SLHKY']<-factor(data1[,'SLHKY'])
levels(data1$SLHKY)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
data1[,'sllvy']<-factor(data1[,'sllvy'])
levels(data1$sllvy)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
data2<-na.omit(data1)
dim(data2)
## [1] 0 23
We see that since there are some missing data for all employees corresponding to year 2, if we remove all missing values, we end up with empty subset.
Therefore at this stage I remove only the missing values, involved directly in task description.
data<-data1[!is.na(data1$svatva),]
dim(data)
## [1] 1916 23
To get a glimpse of the three targeted data variables we plot all investigated variables against each other.
library(ggplot2)
library(GGally)
var<-data[,c("svatva", "peas", "sllvy")]
assignInNamespace("ggally_cor", ggally_cor, "GGally")
ggpairs(var, upper = list(continuous = wrap("cor", size = 10)), lower = list(continuous = "smooth"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We observe that the total earned income in almost all groups in both categoracal variables - family status and company turnover - deviates from normal distribution.
The aim at this first analysis step is to achieve some preliminary assessments about the population distribution of the variables svatva, SLLVY and peas using the data of the observed sample for year=2. The variable svatva is numerical, while svatva and SLLVY are categorical.
First we analyse svatva.
attach(data)
summary(svatva)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 14000 18000 19434 23000 99000
We see that the mean and median are quite near, which suggests normal distribution.
The distribution of svatva can be visualized using so called steam-and-leaf plot. This plot is a special textual graph (table) where each data value is split into a “stem” (the first digit or digits) and a “leaf” (usually the last digit).
stem(svatva)
##
## The decimal point is 4 digit(s) to the right of the |
##
## 0 | 0000111111111222222222222333333333333334444444444444
## 0 | 55555555555555555555556666666666666666666666667777777777777777777777+61
## 1 | 00000000000000000000000000000000000000011111111111111111111111111111+283
## 1 | 55555555555555555555555555555555555555555555555555555555555555555555+525
## 2 | 00000000000000000000000000000000000000000000000000000000000000000000+295
## 2 | 55555555555555555555555555555555555555555555666666666666666666666666+97
## 3 | 00000000000000000000000011111111111111112222222222222222223333333333+5
## 3 | 555555555666677777777777888889999999
## 4 | 0000000011122233444444
## 4 | 5555566667778889999
## 5 | 0011112222234
## 5 | 56666778
## 6 | 11134
## 6 | 55678
## 7 | 044
## 7 | 7
## 8 | 124
## 8 | 5
## 9 |
## 9 | 59
The stem-and-leaf plot suggests that comparing to normal distribution this one is skewed towards positive values. Next the distribution is shown in more details using a histogram plot.
h<-hist(svatva, breaks=seq(0, 101000, by=2500), plot=F)
#str(h)
plot(h, col = heat.colors(length(h$mids))[length(h$count)-rank(h$count)+1],
ylim = c(0, max(h$count)+5),
main="Earned income total in state taxation, weight %",
sub="Counts shown above bar, actual values shown with rug plot")
rug(svatva)
#cex - size of the text on the figure
text(h$mids, h$count, h$count, cex=0.7, cex.main =0.7, cex.sub=0.7, pos=3)
rm(h)
We see that there are some unusually high values on the right tail of the histogram. Next we investigate this part of the histogram, the employees who have earned total income svatva> 48500.
data[svatva>48500,]
## syrtun vuosi.x shtun sukup syntyv kieli peas a7lkm a18lkm ktutk sose
## 20 117 2 5371 1 1940 fi 1 0 0 74 NA
## 22 135 2 4018 1 1944 fi 1 0 0 54 NA
## 79 386 2 836 1 1926 fi 1 0 0 NA NA
## 98 505 2 3020 1 1954 fi 1 1 5 63 NA
## 109 553 2 1269 1 1941 fi 1 0 1 43 NA
## 134 655 2 8248 1 1946 fi 1 1 2 74 NA
## 170 819 2 5176 1 1946 fi 1 0 1 44 NA
## 194 966 2 2112 2 1953 fi 0 <NA> <NA> 76 NA
## 283 1405 2 4721 1 1946 fi 1 0 2 40 NA
## 327 1538 2 5202 1 1938 fi 4 0 0 84 NA
## 417 1960 2 3345 2 1945 fi 1 0 0 NA NA
## 438 2084 2 4533 1 1932 fi 1 0 0 NA NA
## 484 2373 2 217 1 1945 fi 1 0 1 73 NA
## 519 2558 2 8343 1 1954 fi 0 <NA> <NA> 43 NA
## 612 2975 2 1338 1 1930 fi 1 0 0 74 NA
## 753 3653 2 5132 1 1941 sv 0 <NA> <NA> 74 NA
## 807 3872 2 5699 1 1925 fi 1 0 0 NA NA
## 811 3893 2 2083 1 1947 fi 1 0 2 43 NA
## 841 4062 2 1500 2 1942 fi 2 0 0 63 NA
## 880 4247 2 7485 1 1945 fi 1 0 2 44 NA
## 944 4626 2 130 1 1935 fi 1 0 0 54 NA
## 977 4850 2 5574 1 1951 fi 1 0 2 74 NA
## 988 4931 2 1604 2 1945 fi 0 <NA> <NA> 43 NA
## 990 4940 2 7784 1 1944 fi 1 1 2 63 NA
## 996 4963 2 7 2 1952 fi 0 <NA> <NA> 63 NA
## 1236 5850 2 282 1 1951 fi 1 0 1 34 NA
## 1241 5879 2 7405 1 1939 fi 1 0 0 63 NA
## 1360 6226 2 865 1 1944 fi 1 0 3 63 NA
## 1365 6226 2 1231 1 1944 fi 1 0 1 76 NA
## 1379 6328 2 7281 1 1941 fi 1 0 0 63 NA
## 1381 6332 2 8010 1 1940 fi 1 0 1 74 NA
## 1412 6505 2 7929 1 1946 fi 1 0 2 NA NA
## 1426 6537 2 3260 1 1949 fi 4 0 2 43 NA
## 1443 6618 2 1307 1 1947 sv 1 0 1 43 NA
## 1450 6637 2 1296 2 1936 fi 2 0 0 76 NA
## 1497 6894 2 768 1 1953 fi 1 0 0 43 NA
## 1599 7400 2 2852 1 1931 fi 0 <NA> <NA> 73 NA
## 1616 7485 2 2350 1 1936 sv 1 0 0 NA NA
## 1625 7533 2 6662 1 1947 fi 1 0 1 34 NA
## 1633 7581 2 6999 1 1941 fi 1 0 3 44 NA
## 1738 8135 2 7438 1 1950 fi 1 0 3 74 NA
## 1746 8135 2 3246 1 1946 fi 1 0 0 74 NA
## 1801 8486 2 5679 1 1943 fi 1 0 0 76 NA
## 1822 8591 2 1876 2 1940 fi 2 0 1 76 NA
## 1853 8733 2 3080 1 1948 fi 1 0 2 86 NA
## ptoim1 tyokk tyke toimiala.x svatva tyotu suuralue12 vuosi.y oty
## 20 11 NA NA M 74000 69000 1 2 1
## 22 11 NA NA M 56000 56000 1 2 1
## 79 11 NA NA G 99000 NA 4 2 1
## 98 11 NA NA F 70000 69000 4 2 1
## 109 11 NA NA D 57000 56000 2 2 1
## 134 11 NA NA D 52000 50000 1 2 2
## 170 11 NA NA G 52000 NA 4 2 1
## 194 11 NA NA P 52000 52000 2 2 1
## 283 11 NA NA G 64000 64000 1 2 5
## 327 11 NA NA M 51000 49000 1 2 1
## 417 11 NA NA F 51000 41000 2 2 1
## 438 11 NA NA G 49000 48000 2 2 1
## 484 11 NA NA K 49000 49000 2 2 1
## 519 11 NA NA M 74000 56000 1 2 1
## 612 11 NA NA D 50000 50000 1 2 1
## 753 11 NA NA D 56000 56000 2 2 1
## 807 11 NA NA A 63000 NA 2 2 1
## 811 11 NA NA L 61000 31000 2 2 1
## 841 11 NA NA D 51000 50000 1 2 1
## 880 11 NA NA F 50000 NA 1 2 1
## 944 11 NA NA D 53000 52000 3 2 1
## 977 11 NA NA M 56000 55000 1 2 1
## 988 11 NA NA G 56000 56000 1 2 5
## 990 11 NA NA M 55000 54000 1 2 1
## 996 11 NA NA M 67000 67000 1 2 5
## 1236 11 NA NA F 52000 NA 1 2 1
## 1241 11 NA NA G 57000 57000 1 2 5
## 1360 11 NA NA D 66000 65000 1 2 2
## 1365 11 NA NA E 85000 61000 4 2 2
## 1379 11 NA NA M 61000 61000 1 2 1
## 1381 11 NA NA M 77000 40000 1 2 1
## 1412 11 NA NA K 52000 46000 1 2 1
## 1426 11 NA NA G 49000 40000 2 2 1
## 1443 11 NA NA G 95000 69000 1 2 1
## 1450 11 NA NA P 51000 NA 2 2 1
## 1497 11 NA NA D 49000 48000 4 2 1
## 1599 11 NA NA M 68000 58000 1 2 1
## 1616 11 NA NA I 65000 NA 1 2 1
## 1625 11 NA NA G 54000 52000 1 2 5
## 1633 11 NA NA F 58000 58000 1 2 1
## 1738 11 NA NA D 84000 84000 1 2 2
## 1746 11 NA NA D 61000 61000 1 2 2
## 1801 11 NA NA G 82000 NA 4 2 1
## 1822 11 NA NA G 81000 NA 1 2 1
## 1853 11 NA NA D 65000 62000 2 2 1
## toimiala.y SLHKY sllvy
## 20 M 4 6
## 22 M 5 6
## 79 G 2 5
## 98 F 4 5
## 109 D 4 5
## 134 D 9 9
## 170 G 3 6
## 194 P 5 5
## 283 G 5 7
## 327 M 1 1
## 417 F 5 6
## 438 G 7 9
## 484 K 5 6
## 519 M 4 5
## 612 D 9 9
## 753 D 9 9
## 807 A 1 2
## 811 L 9 7
## 841 D 9 8
## 880 F 1 3
## 944 D 9 8
## 977 M 3 5
## 988 G 4 7
## 990 M 2 5
## 996 M 4 6
## 1236 F 1 3
## 1241 G 5 7
## 1360 D 9 9
## 1365 D 9 9
## 1379 M 7 7
## 1381 M 3 5
## 1412 K 9 8
## 1426 G 4 6
## 1443 G 2 5
## 1450 P 1 3
## 1497 D 6 6
## 1599 M 2 4
## 1616 I 1 4
## 1625 G 3 6
## 1633 F 9 9
## 1738 D 9 9
## 1746 D 9 9
## 1801 G 2 5
## 1822 G 3 5
## 1853 D 9 9
We observe that these are 45 observations, which are consistent with the total earned income of other employees. There is no evidence for a data entry error. The interesting observation is that 34 of these employees work in private domestic companies, and 42 are native Finnish speakers (other 3 are Swedish speaking). Only 7 of them are women.
We continue to investigate this distribution using other exploratory graphics.
boxplot(svatva, notch=T, horizontal=T,main="Boxplot of total earned income")
This boxplot does not show clear evidence that the distribution is right skewed. The employees with total earned income higher than 10 000 are shown as outliers (more than the estimated maximum of the distribution (1.5 times IQR larger than third quartile)). Also there is one outlier from the left (less than estimated minimum of the distribution (1.5 times IQR smaller than first quartile)). This does not means that these aoutliers are not part from the investigated population.
In order to be able to compare the histograms with other histograms, using different number of observations, we use probability density instead frequencies when plot the histogram. These densities are computed using 3 different kernel density estimators, which are smoothed continuous approximation to the histogram.
hist(svatva, freq=F, breaks=seq(0, 101000, by=2500), cex.main =0.9, main="Probability density for total earned income")
#density - computes kernel density estimates.
#These are smoothed continuous approximation to the histogram.
lines(density(svatva),lwd=2)
lines(density(svatva, adj=.5),lwd=1)
lines(density(svatva, adj=2),lwd=1.5)
We again see that this distribution has quite long right tail.
Next we compare the actual distribution to the theoretical one using quantile-quantile plot.
qqnorm(svatva, main="QQ plot for total earned income vs Normal distribution",ylab="Total earned income")
qqline(svatva, col=4)
This plot shows clearly that the distribution is not normal, especially at the right tail where the values are too high.
Point estimation. Inference of the mean. We have used summary() function to compute the descriptive statistics, including also the median and mean of this sample. Here we can try to make some inferences about the population mean (point estimate). For such small sample we assume t-distribution. For the null hyphotesis H0 here we assume the mean mu= 18000. We set confidence interval 99%.
t.test(svatva, mu=18000, conf.level = 0.99)
##
## One Sample t-test
##
## data: svatva
## t = 6.059, df = 1915, p-value = 1.645e-09
## alternative hypothesis: true mean is not equal to 18000
## 99 percent confidence interval:
## 18823.60 20043.84
## sample estimates:
## mean of x
## 19433.72
The best estimate of the mean is 19433.72 for total earned income of the company. With only 1% to be wrong we state that the true mean is between 18823.60 and 20043.84 total earned income. Since the p-value is very small (almost 0), we cannot reject H0.
Conclusion: We observe clear deviation from normal distribution of total earned income. Therefore we expect problems in the further multivariate analysis.
We are interested to find the relative frequencies of different categories in both variables. For this purpose we first count the number of occurrences of each type using the table function, and next compute the proportions of the different classes.
counts_sllvy <- table(sllvy)
proportions_sllvy<-counts_sllvy / sum(counts_sllvy)
proportions_sllvy
## sllvy
## 1 2 3 4 5 6
## 0.05741127 0.05010438 0.03288100 0.08402923 0.14874739 0.15448852
## 7 8 9
## 0.12473904 0.13778706 0.20981211
The most common category corresponds to the sample mode, which in this case is the category 9, companies with turnover >= 200 000 000. Since the values are nominal, is is not very correct to compute the median directly from them. Next we display the counts on bar plot:
library(ggplot2)
library(GGally)
bar_sllvy <- ggplot(data, aes(x = sllvy))
bar_sllvy <- bar_sllvy + geom_bar()
#summary(bar_sllvy)
ggplot(data, aes(x = sllvy)) +
geom_bar(fill = "orange", width = 0.7) +
xlab("Company group according its turnover") + ylab("Number of Observations")
###Analysis of family status peas.
counts_peas <- table(peas)
proportions_peas<-counts_peas / sum(counts_peas)
proportions_peas
## peas
## 0 1 2 3 4 5
## 0.160229645 0.393006263 0.222860125 0.080897704 0.075678497 0.063674322
## 9
## 0.003653445
We see that the most common category (and also a sample mode) is group 1, which codes the man in the family who is also a had of the family.
The bar plot is visualized as:
bar_peas <- ggplot(data, aes(x = peas))
bar_peas <- bar_peas + geom_bar()
#summary(bar_peas)
ggplot(data, aes(x = peas)) +
geom_bar(fill = "orange", width = 0.7) +
xlab("Family group") + ylab("Number of Observations")
Before to summarize and graph these data we first look them carefully and try to clarify which total earned incomes are associated with different groups of company turnover. For this purpose we first sort the observations and use order function to keep the indexes of sorted vector.
svatva[order(svatva)][1:100]
## [1] 0 0 0 0 1000 1000 1000 1000 1000 1000 1000 1000 1000 2000
## [15] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 3000 3000 3000
## [29] 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 4000 4000 4000
## [43] 4000 4000 4000 4000 4000 4000 4000 4000 4000 4000 5000 5000 5000 5000
## [57] 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
## [71] 5000 5000 5000 5000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
## [85] 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
## [99] 7000 7000
sllvy[order(svatva)] [1:100]
## [1] 3 3 4 2 6 1 7 4 3 2 2 7 5 6 8 1 8 9 2 3 3 5 2 5 1 5 4 5 9 6 6 2 6 4 4
## [36] 1 1 8 1 4 2 4 4 5 9 9 9 2 4 2 5 4 4 9 8 2 4 2 6 1 1 4 9 2 4 1 7 2 2 2
## [71] 3 2 7 5 4 4 8 6 7 4 2 8 4 6 2 9 4 1 2 4 4 2 3 1 2 3 5 9 6 9
## Levels: 1 2 3 4 5 6 7 8 9
It is difficult to make some conclusions only by looking these values. Next we use by method to compute some statistics for every level of the categorical variable. We compute also length of the range of income in every company turnover group.
by(svatva,sllvy,range)
## sllvy: 1
## [1] 1000 51000
## --------------------------------------------------------
## sllvy: 2
## [1] 0 63000
## --------------------------------------------------------
## sllvy: 3
## [1] 0 52000
## --------------------------------------------------------
## sllvy: 4
## [1] 0 68000
## --------------------------------------------------------
## sllvy: 5
## [1] 1000 99000
## --------------------------------------------------------
## sllvy: 6
## [1] 1000 74000
## --------------------------------------------------------
## sllvy: 7
## [1] 1000 64000
## --------------------------------------------------------
## sllvy: 8
## [1] 2000 53000
## --------------------------------------------------------
## sllvy: 9
## [1] 2000 85000
by(svatva,sllvy,function(x) max(x)-min(x))
## sllvy: 1
## [1] 50000
## --------------------------------------------------------
## sllvy: 2
## [1] 63000
## --------------------------------------------------------
## sllvy: 3
## [1] 52000
## --------------------------------------------------------
## sllvy: 4
## [1] 68000
## --------------------------------------------------------
## sllvy: 5
## [1] 98000
## --------------------------------------------------------
## sllvy: 6
## [1] 73000
## --------------------------------------------------------
## sllvy: 7
## [1] 63000
## --------------------------------------------------------
## sllvy: 8
## [1] 51000
## --------------------------------------------------------
## sllvy: 9
## [1] 83000
The number of observations in every group also can be obtained using by method:
by(svatva,sllvy,length)
## sllvy: 1
## [1] 110
## --------------------------------------------------------
## sllvy: 2
## [1] 96
## --------------------------------------------------------
## sllvy: 3
## [1] 63
## --------------------------------------------------------
## sllvy: 4
## [1] 161
## --------------------------------------------------------
## sllvy: 5
## [1] 285
## --------------------------------------------------------
## sllvy: 6
## [1] 296
## --------------------------------------------------------
## sllvy: 7
## [1] 239
## --------------------------------------------------------
## sllvy: 8
## [1] 264
## --------------------------------------------------------
## sllvy: 9
## [1] 402
This corresponds to the bar graph visualization during univariate analysis of sllvy.
Next we use boxplot visualization, where the incomes are divided by the different factors of companies turnover. The attribute notch=T on this plot shows if the class medians are significantly different.
boxplot(svatva~sllvy, notch=T, horizontal=T, xlab="Total earned income", ylab="turnover group")
The results shown on this box plot are somehow surprising. It seems that there is not too big differences in the total earned income of different employees and the size of turnover of the company where they work. There are also not too large differences in the average income of employees in different groups. The strange thing is that in the lowest turnover group 1 the mean income is higher comparing to groups 2, 3 and 4. The mean income increases in groups 5, 6, 7, 8 and 9. When increase the group number (especially in group 5), the differences reflect in more outliers from right. When we compare this plot with the number of observations in every group shown above, we see that groups 1, 3 and 8 are significantly under-represented (about twice less then group 5). The groups 3, 6 and 9 have the widest ranges, while the groups 1 and 2 have the lowest ranges. In all groups the distributions are somehow skewed, negatively or positively. Only in group 3 the distribution seems to be symmetric.
A numerical conformation of the results shown on box plots could be obtained using summaries about different groups in by method:
by(svatva,sllvy,summary)
## sllvy: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 14000 18000 18209 21000 51000
## --------------------------------------------------------
## sllvy: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 7000 10000 11677 14000 63000
## --------------------------------------------------------
## sllvy: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 9000 15000 17016 21000 52000
## --------------------------------------------------------
## sllvy: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 11000 15000 16453 20000 68000
## --------------------------------------------------------
## sllvy: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 14000 17000 19765 21000 99000
## --------------------------------------------------------
## sllvy: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 14000 18000 20101 24000 74000
## --------------------------------------------------------
## sllvy: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 15000 18000 20234 24000 64000
## --------------------------------------------------------
## sllvy: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 15000 19000 20462 23000 53000
## --------------------------------------------------------
## sllvy: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 16000 20000 21316 25000 85000
The simplest ANOVA is one-way, where the total variance of the data is compared to the residual variance after each observation’s value is adjusted for the mean for the one factor. Here the question is how big proportion of employees with their corresponding total income varies among the 9 companies group (divided by their turnover). In R the method, use for ANOVA, is lm. It is used also for linear regression. ANOVA is just another form of the same linear modeling, as it is shown in [4].
lm_an<-lm(svatva~sllvy)
summary(lm_an)
##
## Call:
## lm(formula = svatva ~ sllvy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19316 -5462 -2101 2899 79235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18209.1 965.8 18.853 < 2e-16 ***
## sllvy2 -6532.0 1414.8 -4.617 4.16e-06 ***
## sllvy3 -1193.2 1600.5 -0.746 0.45605
## sllvy4 -1755.7 1253.1 -1.401 0.16135
## sllvy5 1555.8 1137.1 1.368 0.17138
## sllvy6 1892.3 1131.2 1.673 0.09452 .
## sllvy7 2025.2 1167.1 1.735 0.08287 .
## sllvy8 2253.0 1149.6 1.960 0.05016 .
## sllvy9 3106.8 1090.0 2.850 0.00442 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10130 on 1907 degrees of freedom
## Multiple R-squared: 0.0475, Adjusted R-squared: 0.04351
## F-statistic: 11.89 on 8 and 1907 DF, p-value: < 2.2e-16
The Intersept corresponds to the mean for Group 1 in sllvy. The mean estimates for other groups are shown in corresponding coefficients in column Estimate as difference the corresponding group and intercept (mean for group 1). This summary shows only first two groups as important, and group 9 as quite significant. The value of Adjusted R-squared is 0.043, which means that only 4.3% of total variation is explained by sllvy. Since p-value< 2.2e-16 this means that the probability this amount of variability to be explained by chance is practically 0.
We repeat the same pipeline to investigate bivariate relationship between total earned income svatva and family status peas.
svatva[order(svatva)][1:100]
## [1] 0 0 0 0 1000 1000 1000 1000 1000 1000 1000 1000 1000 2000
## [15] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 3000 3000 3000
## [29] 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 3000 4000 4000 4000
## [43] 4000 4000 4000 4000 4000 4000 4000 4000 4000 4000 5000 5000 5000 5000
## [57] 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
## [71] 5000 5000 5000 5000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
## [85] 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
## [99] 7000 7000
peas[order(svatva)] [1:100]
## [1] 4 1 4 1 3 1 3 0 4 1 1 3 3 3 3 2 2 3 1 3 2 3 2 3 3 3 1 2 3 3 3 2 3 3 3
## [36] 2 3 3 3 3 2 3 5 1 2 2 3 1 1 2 3 3 2 1 3 1 0 3 3 5 2 3 2 4 2 2 3 3 4 2
## [71] 1 5 3 2 3 2 2 9 3 2 1 3 2 4 2 5 3 3 1 2 0 3 3 2 3 5 3 5 3 3
## Levels: 0 1 2 3 4 5 9
As before, here is also difficult to make some conclusions only by looking these values. Next apply by method.
by(svatva,peas,range)
## peas: 0
## [1] 1000 74000
## --------------------------------------------------------
## peas: 1
## [1] 0 99000
## --------------------------------------------------------
## peas: 2
## [1] 2000 81000
## --------------------------------------------------------
## peas: 3
## [1] 1000 27000
## --------------------------------------------------------
## peas: 4
## [1] 0 51000
## --------------------------------------------------------
## peas: 5
## [1] 4000 37000
## --------------------------------------------------------
## peas: 9
## [1] 6000 22000
by(svatva,peas,function(x) max(x)-min(x))
## peas: 0
## [1] 73000
## --------------------------------------------------------
## peas: 1
## [1] 99000
## --------------------------------------------------------
## peas: 2
## [1] 79000
## --------------------------------------------------------
## peas: 3
## [1] 26000
## --------------------------------------------------------
## peas: 4
## [1] 51000
## --------------------------------------------------------
## peas: 5
## [1] 33000
## --------------------------------------------------------
## peas: 9
## [1] 16000
Find out the number of observations, which fall in different family groups:
by(svatva,peas,length)
## peas: 0
## [1] 307
## --------------------------------------------------------
## peas: 1
## [1] 753
## --------------------------------------------------------
## peas: 2
## [1] 427
## --------------------------------------------------------
## peas: 3
## [1] 155
## --------------------------------------------------------
## peas: 4
## [1] 145
## --------------------------------------------------------
## peas: 5
## [1] 122
## --------------------------------------------------------
## peas: 9
## [1] 7
This corresponds to the bar graph visualization during univariate analysis of peas. Different groups are not equally represented. The biggest part fall in Group 1, 753 observations, followed by Group 2 (427 observations) and Group 0 (307 observations). The groups 3, 4 and 5 have comparable number of observations (155, 145 and 122), while the group 9 with unknown family status is presented only by 7 observations.
The box plot is shown below. Since some notches went outside hinges, the attribute notch=T is now set as notch=F.
boxplot(svatva~peas, notch=F, horizontal=T, xlab="Total earned income", ylab="Family group")
As it was expected, the highest average income in Group 1, for the heads of the families. It has the highest range, and the biggest number of outliers from right. The smallest range shows group 9. The range of group 5 is also small. All distributions are not symmetric.
Numerically the same box plot results are shown below:
by(svatva,peas,summary)
## peas: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 14000 18000 18847 22000 74000
## --------------------------------------------------------
## peas: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 17000 21000 23578 27000 99000
## --------------------------------------------------------
## peas: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 12000 16000 16415 19000 81000
## --------------------------------------------------------
## peas: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 12000 11871 16000 27000
## --------------------------------------------------------
## peas: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 15000 19000 19628 24000 51000
## --------------------------------------------------------
## peas: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 13000 15000 15615 18000 37000
## --------------------------------------------------------
## peas: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6000 10500 14000 13571 16000 22000
It seems that the family status has some influence on earned income. This statement is further examined using ANOVA.
lm_an_2<-lm(svatva~peas)
summary(lm_an_2)
##
## Call:
## lm(formula = svatva ~ peas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23578 -5578 -1415 3385 75422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18846.9 550.5 34.234 < 2e-16 ***
## peas1 4730.8 653.2 7.243 6.35e-13 ***
## peas2 -2432.4 721.8 -3.370 0.000767 ***
## peas3 -6975.9 950.5 -7.340 3.16e-13 ***
## peas4 780.7 972.0 0.803 0.421976
## peas5 -3232.2 1032.4 -3.131 0.001769 **
## peas9 -5275.5 3687.2 -1.431 0.152663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9646 on 1909 degrees of freedom
## Multiple R-squared: 0.1354, Adjusted R-squared: 0.1327
## F-statistic: 49.83 on 6 and 1909 DF, p-value: < 2.2e-16
In this case the first 5 family groups (from 0 to 4) are shown to be significant, and also the family group 5 could be included in the model. The value of Adjusted R-squared now is 0.1327, which means that 13.27% of total variation is explained by peas. Since p-value< 2.2e-16 this means that the probability this amount of variability to be explained by chance is practically 0.
The ggplot2 package in R allows visualization of two categorical variables using bar plots. We use this capability to visualize the company turnover and family status.
g0<-ggplot(data, aes(x = sllvy, fill = peas)) + geom_bar(position = "dodge")
g0 + xlab("Company turnover")
We see that except the company turnover group 1, in all other groups the biggest part of employees is form of family heads (family status 1). Next biggest proportion in most of the groups is the spouse (status 2). For the lowest turnover group 1 the proportion of spouses is highest. This most important family group is 3 = child.
The ggplot2 package in R offers also visualization of the numerical predctor and two categorical variables on one plot. The total income (on y axes) for every company turnover group (on x axes) is shown as side-by-side box plots using different colors for different family statuses.
First generate frequency tables:
table(sllvy, peas)
## peas
## sllvy 0 1 2 3 4 5 9
## 1 16 26 47 6 5 10 0
## 2 7 50 21 9 5 3 1
## 3 9 31 7 4 10 2 0
## 4 33 47 31 27 11 9 3
## 5 42 105 58 29 28 23 0
## 6 50 122 53 24 21 25 1
## 7 41 86 50 16 22 23 1
## 8 42 107 67 18 14 15 1
## 9 67 179 93 22 29 12 0
Since the number of observations in every group is not equal, this is not balanced design.
At this stage, before to start actual ANOVA modeling, we compute some descriptive statiustics. Means of all groups:
tapply(svatva,list(sllvy,peas),mean)
## 0 1 2 3 4 5 9
## 1 19687.50 21307.69 16595.74 6666.667 33200.00 14800.00 NA
## 2 11857.14 13160.00 10047.62 9333.333 10000.00 9000.00 8000
## 3 13444.44 20387.10 19571.43 7250.000 13000.00 11500.00 NA
## 4 17575.76 21255.32 12548.39 12444.444 18000.00 11666.67 14000
## 5 18976.19 24742.86 16637.93 13103.448 20142.86 14304.35 NA
## 6 18780.00 24418.03 17094.34 10291.667 20190.48 17960.00 6000
## 7 19975.61 24720.93 17280.00 12000.000 20000.00 16434.78 17000
## 8 18214.29 24785.05 17955.22 13944.444 18428.57 16733.33 22000
## 9 20402.99 25452.51 16709.68 12772.727 21241.38 16250.00 NA
Standard deviations:
tapply(svatva,list(sllvy,peas),sd)
## 0 1 2 3 4 5 9
## 1 6139.693 8531.210 7033.026 5240.865 15139.353 6033.241 NA
## 2 5814.596 9792.459 6272.768 4663.690 5567.764 3605.551 NA
## 3 4126.473 12867.731 17232.306 4272.002 8666.667 7778.175 NA
## 4 12811.202 11333.805 5702.857 6710.115 8532.292 4387.482 1000
## 5 11595.811 16553.543 9809.945 6597.320 7998.677 3495.904 NA
## 6 9109.963 10589.906 6708.961 6300.305 8003.868 6711.185 NA
## 7 7900.911 11680.704 5671.411 7554.248 5209.881 5061.675 NA
## 8 3904.567 9659.318 6832.260 6999.767 3837.353 5535.169 NA
## 9 7563.989 11185.451 6233.815 6179.319 6550.110 6397.798 NA
g1 <- ggplot(data, aes(x = sllvy, y = svatva, col=peas))
g1 + geom_boxplot() + xlab("Company turnover group")+ ylab("Earned total income")
We see that in the lowest turnover group 1 the highest mean income have employees with family status 4, heads of cohabiting family. It is interesting to observe that this is the hihgest mean income among all other turnover groups with different family statuses. ###Line plots with multiple groups
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggpubr)
## Loading required package: magrittr
ggline(data, x = "sllvy", y = "svatva",
xlab = "Company turnover group", ylab="Earned total income",
color = "peas",
add = "mean_se", palette = 1:7)
Here we see how the mean total income values are connected in different company turnover groups for different family statuses.
Since for these data we have the case of unbalanced design, there are 3 methods to apply two way ANOVA test in these data, namely Type I, Type II and Type III sum of squares [7].
We first examine the case without interactions, so called additive model. Here we also have to assume that the two categorical variables sllvy and peas are independent. For such additive models the Type II two-sided ANOVA method is recommended [7].
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
anova_ind <- aov(svatva ~ sllvy + peas, data = data)
Anova(anova_ind, type = "II")
## Anova Table (Type II tests)
##
## Response: svatva
## Sum Sq Df F value Pr(>F)
## sllvy 9.4266e+09 8 13.318 < 2.2e-16 ***
## peas 2.7487e+10 6 51.777 < 2.2e-16 ***
## Residuals 1.6820e+11 1901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From these results we can conclude that both company turnover group sllvy and family status peas are statistically significant with very small p-value (less than 2.2e-16). We expect that if there is no interaction between these variables, changing one of them will impact significantly the mean total income of employee.
Next we assume that there is a simultaneous effect of company turnover groups and family statuses and involve the interaction term in the model:
#anova_inter <- aov(svatva ~ sllvy + peas + sllvy:peas, data = data)
#Anova(anova_inter, type = "III")
This code chunk produces the following error message:
“Error in Anova.III.lm(mod, error, singular.ok = singular.ok, …) : there are aliased coefficients in the model”
This seems to be a warning about multicollinearity. Therefore we continue to analyze only the first model and compute some summary statistics.
We compute the grand mean and the mean by groups:
model.tables(anova_ind, type="means")
## Tables of means
## Grand mean
##
## 19433.72
##
## sllvy
## 1 2 3 4 5 6 7 8 9
## 18209 11677 17016 16453 19765 20101 20234 20462 21316
## rep 110 96 63 161 285 296 239 264 402
##
## peas
## 0 1 2 3 4 5 9
## 18666 23581 16394 12316 19572 15504 15600
## rep 307 753 427 155 145 122 7
These numbers are not much different from analogous ones during uni- ja bivariate analyses.
A significant p-value in ANOVA two side tests means that some of the group means are different. Since we do not know which pairs of groups are different, we can clarify it by performing multiple pairwise comparison. In R this can be provided using the method TukeyHSD().
TukeyHSD(anova_ind)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = svatva ~ sllvy + peas, data = data)
##
## $sllvy
## diff lwr upr p adj
## 2-1 -6532.0076 -10611.71796 -2452.297 0.0000255
## 3-1 -1193.2179 -5808.34979 3421.914 0.9968070
## 4-1 -1755.6748 -5368.96666 1857.617 0.8516866
## 5-1 1555.8214 -1722.92225 4834.565 0.8678828
## 6-1 1892.2604 -1369.47326 5153.994 0.6812937
## 7-1 2025.2187 -1340.24648 5390.684 0.6355932
## 8-1 2253.0303 -1061.82960 5567.890 0.4662184
## 9-1 3106.8295 -36.23502 6249.894 0.0555466
## 3-2 5338.7897 602.69524 10074.884 0.0140182
## 4-2 4776.3328 1009.76588 8542.900 0.0027494
## 5-2 8087.8289 4640.90164 11534.756 0.0000000
## 6-2 8424.2680 4993.51666 11855.019 0.0000000
## 7-2 8557.2263 5027.70733 12086.745 0.0000000
## 8-2 8785.0379 5303.73865 12266.337 0.0000000
## 9-2 9638.8371 6320.70466 12956.969 0.0000000
## 4-3 -562.4569 -4903.24390 3778.330 0.9999813
## 5-3 2749.0393 -1317.49654 6815.575 0.4740587
## 6-3 3085.4783 -967.35530 7138.312 0.3045260
## 7-3 3218.4366 -918.33867 7355.212 0.2756386
## 8-3 3446.2482 -649.46296 7541.959 0.1817258
## 9-3 4300.0474 342.09233 8258.002 0.0215390
## 5-4 3311.4961 431.71326 6191.279 0.0109519
## 6-4 3647.9352 787.53375 6508.337 0.0025137
## 7-4 3780.8935 802.74881 6759.038 0.0027003
## 8-4 4008.7051 1087.86857 6929.542 0.0007152
## 9-4 4862.5042 2138.19807 7586.810 0.0000012
## 6-5 336.4391 -2087.64380 2760.522 0.9999683
## 7-5 469.3973 -2092.56030 3031.355 0.9997381
## 8-5 697.2089 -1797.89972 3192.318 0.9945378
## 9-5 1551.0081 -710.87727 3812.894 0.4533606
## 7-6 132.9583 -2407.19410 2673.111 1.0000000
## 8-6 360.7699 -2111.94408 2833.484 0.9999534
## 9-6 1214.5690 -1022.58821 3451.726 0.7553497
## 8-7 227.8116 -2380.20747 2835.831 0.9999991
## 9-7 1081.6108 -1304.24793 3467.469 0.8951240
## 9-8 853.7992 -1460.12863 3167.727 0.9670641
##
## $peas
## diff lwr upr p adj
## 1-0 4914.83416 3034.8788 6794.7895 0.0000000
## 2-0 -2272.51174 -4349.9407 -195.0828 0.0215216
## 3-0 -6350.65701 -9086.2232 -3615.0908 0.0000000
## 4-0 905.87564 -1891.6708 3703.4221 0.9631384
## 5-0 -3162.79075 -6134.0519 -191.5296 0.0283201
## 9-0 -3066.26861 -13678.5289 7545.9917 0.9791609
## 2-1 -7187.34590 -8869.2098 -5505.4820 0.0000000
## 3-1 -11265.49117 -13714.2210 -8816.7614 0.0000000
## 4-1 -4008.95852 -6526.7396 -1491.1775 0.0000573
## 5-1 -8077.62491 -10787.1171 -5368.1327 0.0000000
## 9-1 -7981.10278 -18523.0677 2560.8621 0.2774203
## 3-2 -4078.14527 -6681.5578 -1474.7328 0.0000823
## 4-2 3178.38738 509.9233 5846.8514 0.0081596
## 5-2 -890.27901 -3740.3363 1959.7783 0.9691287
## 9-2 -793.75687 -11372.7220 9785.2083 0.9999902
## 4-3 7256.53265 4048.9928 10464.0725 0.0000000
## 5-3 3187.86626 -172.2585 6547.9910 0.0760456
## 9-3 3284.38840 -7443.2448 14012.0216 0.9720702
## 5-4 -4068.66639 -7479.4408 -657.8920 0.0080076
## 9-4 -3972.14425 -14715.7497 6771.4612 0.9308300
## 9-5 96.52213 -10693.6208 10886.6651 1.0000000
These results show that the biggest part of pairs in company turnover groups show high p-values (3-1, 4-1, 5-1, 6-1, 7-1, 8-1, 4-3, 5-3, 6-3, 7-3, 8-3, 6-5, 7-5, 8-5, 9-5, 7-6, 8-6, 9-6, 8-7, 9-7, 9-8). Also quite many pairs in peas express higher p-values (4-0, 9-0, 9-1, 5-2, 9-2, 9-3, 9-4, 9-5).
op <- par(mfrow = c(2, 2))
plot(anova_ind)
par(op)
On the first plot the points 1822, 1443 and 79 are detected as outliers, which can affect normality and homogeneity of variance. It can be useful to remove them in order to match the test assumptions.
The second plot shows that the normality assumption is violated on the right upper part of the plot. As we have seen in the previous univariate analyses and plots, the total income data have long tail from right.
Because of the unbalanced design the leverages are not constant. On the fourth plot they are drawn in x-axis.
Next we observe that some of the levels of the categorical variables could be combined since they do not differ too much. For example the family status of people, who are officially married (family statuses 1 and 2) and others, who live together and probably have children, but are not officially married (family statuses 4 and 5), should not differ too much in the sense of their expenses and way of living. Since the children, who work, should not be too small, for me they look somehow similar (in the sense that financially they do not take care for the other members of family) to the singles. Also the group of unknowns is quite small and could be joined to the group of singles. Thus the new larger groups are: 0 - single/child/unknown; 1 - head, 2 - spouse. I also decided to group every 3 consequent (by their turnover) company groups in one bigger, so company group 1 has turnover LV <100 000, for group 2 the turnover LV is 100 000 <= LV < 10 000 000 , for group 3 the turnover 10 000 000 <= LV.
#5.1. Group the levels of peas and sllvy
#Group the levels in family status)
#0 - single/child/unknown
data$peas[data$peas=='3']<-'0'
data$peas[data$peas=='9']<-'0'
#1 - head
data$peas[data$peas=='4']<-'1'
#2 - spouse
data$peas[data$peas=='5']<-'2'
#drop levels
data$peas<-droplevels(data$peas)
#Group the levels in company turnover
#1 - new group formed from 1,2,3.
data$sllvy[data$sllvy=='2']<-'1'
data$sllvy[data$sllvy=='3']<-'1'
#4 - new group formed from 4,5,6.
data$sllvy[data$sllvy=='5']<-'4'
data$sllvy[data$sllvy=='6']<-'4'
#7 - new group formed from 7,8,9.
data$sllvy[data$sllvy=='8']<-'7'
data$sllvy[data$sllvy=='9']<-'7'
data$sllvy<-droplevels(data$sllvy)
levels(data$sllvy)<-c("1", "2", "3")
To get a glimpse of these combined data we plot all investigated variables against each other.
var<-data[,c("svatva", "peas", "sllvy")]
assignInNamespace("ggally_cor", ggally_cor, "GGally")
ggpairs(var, upper = list(continuous = wrap("cor", size = 10)), lower = list(continuous = "smooth"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that the distribution of total earned income for the highest group of heads of families deviates from normal distribution in similar way as the wholedistribution of the earned income. The distributions of earned income for highest two turnover company groups also have long right tails.
Next we repeat the bivariate and multivariate analyses following the same pipeline as before. ##Bivariate plots The corresponding bivariate box plots in this case are:
boxplot(data$svatva~data$peas, notch=F, horizontal=T, xlab="Total earned income", ylab="Family status")
We observe that the mean values f both groups of singles and spouses are very similar, while heads of family have higher mean value of their total income in wider range.
boxplot(data$svatva~data$sllvy, notch=F, horizontal=T, xlab="Total earned income", ylab="Company turnover")
Here we see clear tendency for increasing the mean total income in different company turnover groups. In this case there is clear difference between 3 different groups.
g0<-ggplot(data, aes(x = data$sllvy, fill = data$peas)) + geom_bar(position = "dodge")
g0 + xlab("Company turnover")
The observed dependences on the box plots are even better visualized on this bar plot. Only the family group 0 of singles/children/unknown does not show clear increasing tendency with increasing the company turnover.
We follow the same steps as before. First generate frequency tables and make the same observation as before - the number of observations, which fall in different groups, is different, so again we observe that the design is not balanced.
We show some descriptive statistics anout the means
tapply(data$svatva,list(data$sllvy,data$peas),mean)
## 0 1 2
## 1 13076.92 17244.09 14733.33
## 2 15918.66 23239.52 15793.97
## 3 17855.77 24366.13 17096.15
and standard deviations:
tapply(data$svatva,list(data$sllvy,data$peas),sd)
## 0 1 2
## 1 7023.340 11439.08 8323.245
## 2 9916.385 12624.26 7482.151
## 3 7428.784 10399.96 6138.868
Box plot with multiple groups.
g1 <- ggplot(data, aes(x = data$sllvy, y = data$svatva, col=data$peas))
g1 + geom_boxplot() + xlab("Company turnover group")+ ylab("Earned total income")
Line plots with multiple groups
library(dplyr)
library(ggpubr)
ggline(data, x = "sllvy", y = "svatva",
xlab = "Company turnover group", ylab="Earned total income",
color = "peas",
add = "mean_se", palette = 1:7)
Both graphs suggest that there is clear dependency between the earned total income and both categorical variables - company turnover and family status. ###Two way ANOVA test As before for the case of unbalanced design we apply type II sum of squares method to run ANOVA.
First we assume that both categorical variables are independent.
anova_ind <- aov(data$svatva ~ data$sllvy + data$peas, data = data)
Anova(anova_ind, type = "II")
## Anova Table (Type II tests)
##
## Response: data$svatva
## Sum Sq Df F value Pr(>F)
## data$sllvy 5.4278e+09 2 28.937 4.162e-13 ***
## data$peas 2.0570e+10 2 109.666 < 2.2e-16 ***
## Residuals 1.7923e+11 1911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again both independent variables appear to be very significant because the very low p-values. We also check the model which involves the simultaneous effect of both predicates. Here we apply type III sum of squares method.
anova_inter <- aov(data$svatva ~ data$sllvy + data$peas + data$sllvy:data$peas, data = data)
Anova(anova_inter, type = "III")
## Anova Table (Type III tests)
##
## Response: data$svatva
## Sum Sq Df F value Pr(>F)
## (Intercept) 8.8923e+09 1 95.2037 < 2.2e-16 ***
## data$sllvy 1.0616e+09 2 5.6830 0.003461 **
## data$peas 7.4191e+08 2 3.9716 0.019000 *
## data$sllvy:data$peas 1.1073e+09 4 2.9638 0.018704 *
## Residuals 1.7812e+11 1907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we observe that all predicates are significant. The intercept is the estimate of the predicate when all the independent variables are 0. By a rule the significance of the intercept is not of interest because its value can be changed by recoding the predictor, and this will not affect the meaning of the model. Since this model shows significant simultaneous effect of both predicates, we cannot assume that these are independent and continue its investigation.
We compute some summary statistics:
model.tables(anova_inter, type="means")
## Tables of means
## Grand mean
##
## 19433.72
##
## data$sllvy
## 1 2 3
## 15599 19181 20781
## rep 269 742 905
##
## data$peas
## 0 1 2
## 16403 22921 16319
## rep 469 898 549
##
## data$sllvy:data$peas
## data$peas
## data$sllvy 0 1 2
## 1 13077 17244 14733
## rep 52 127 90
## 2 15919 23240 15794
## rep 209 334 199
## 3 17856 24366 17096
## rep 208 437 260
Next we obtain the multiple pairwise-comparison between the means of the groups by computing Tukey Honest Significant Differences (THS).
TukeyHSD(anova_inter)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$svatva ~ data$sllvy + data$peas + data$sllvy:data$peas, data = data)
##
## $`data$sllvy`
## diff lwr upr p adj
## 2-1 3582.080 1968.7626 5195.397 0.0000006
## 3-1 5182.702 3608.5169 6756.888 0.0000000
## 3-2 1600.622 477.9764 2723.269 0.0024234
##
## $`data$peas`
## diff lwr upr p adj
## 1-0 6517.56548 5226.102 7809.029 0.0000000
## 2-0 -83.99162 -1509.348 1341.365 0.9895235
## 2-1 -6601.55710 -7829.653 -5373.461 0.0000000
##
## $`data$sllvy:data$peas`
## diff lwr upr p adj
## 2:0-1:0 2841.7372 -1809.1074 7492.5818 0.6156561
## 3:0-1:0 4778.8462 125.7747 9431.9176 0.0388103
## 1:1-1:0 4167.1714 -773.7662 9108.1090 0.1792236
## 2:1-1:0 10162.5979 5688.5036 14636.6922 0.0000000
## 3:1-1:0 11289.2096 6886.7191 15691.7001 0.0000000
## 1:2-1:0 1656.4103 -3571.2534 6884.0739 0.9873559
## 2:2-1:0 2717.0468 -1957.0214 7391.1149 0.6788812
## 3:2-1:0 4019.2308 -539.8296 8578.2911 0.1354280
## 3:0-2:0 1937.1089 -1002.2296 4876.4475 0.5108029
## 1:1-2:0 1325.4342 -2051.1767 4702.0451 0.9524928
## 2:1-2:0 7320.8607 4673.9461 9967.7753 0.0000000
## 3:1-2:0 8447.4724 5923.4757 10971.4692 0.0000000
## 1:2-2:0 -1185.3270 -4969.1218 2598.4679 0.9882630
## 2:2-2:0 -124.6904 -3097.1557 2847.7748 1.0000000
## 3:2-2:0 1177.4936 -1610.6363 3965.6234 0.9281598
## 1:1-3:0 -611.6747 -3991.3523 2768.0028 0.9997611
## 2:1-3:0 5383.7517 2732.9263 8034.5772 0.0000000
## 3:1-3:0 6510.3635 3982.2658 9038.4612 0.0000000
## 1:2-3:0 -3122.4359 -6908.9676 664.0958 0.2041206
## 2:2-3:0 -2061.7994 -5037.7477 914.1490 0.4383684
## 3:2-3:0 -759.6154 -3551.4583 2032.2275 0.9954501
## 2:1-1:1 5995.4265 2866.7421 9124.1109 0.0000001
## 3:1-1:1 7122.0382 4096.6343 10147.4421 0.0000000
## 1:2-1:1 -2510.7612 -6645.9290 1624.4067 0.6239620
## 2:2-1:1 -1450.1246 -4858.6520 1958.4028 0.9252049
## 3:2-1:1 -147.9406 -3396.9678 3101.0865 1.0000000
## 3:1-2:1 1126.6118 -1054.6090 3307.8325 0.8029284
## 1:2-2:1 -8506.1876 -12070.4924 -4941.8829 0.0000000
## 2:2-2:1 -7445.5511 -10133.0619 -4758.0403 0.0000000
## 3:2-2:1 -6143.3671 -8625.4696 -3661.2647 0.0000000
## 1:2-3:1 -9632.7994 -13106.7987 -6158.8001 0.0000000
## 2:2-3:1 -8572.1629 -11138.7008 -6005.6249 0.0000000
## 3:2-3:1 -7269.9789 -9620.5608 -4919.3970 0.0000000
## 2:2-1:2 1060.6365 -2751.6674 4872.9404 0.9946984
## 3:2-1:2 2362.8205 -1307.5718 6033.2128 0.5441775
## 3:2-2:2 1302.1840 -1524.5147 4128.8827 0.8861276
Again we observe some high p-values, indicating that there is not significant efect between corresponding pairs of variables. I am not very sure how strong the conclusions from this test for the case of unbalanced design are. Residual analysis: diagnostic plots to check the assumptions about normally distributed data and variance.
op1 <- par(mfrow = c(2, 2))
plot(anova_inter)
par(op1)
On the first plot we observe that the assumption about homogeneity of variances is probably valid. Again the points 1822, 1443 and 79 are detected as outliers and can affect the assumptions about normality and homogeneity of variance. In this model we are able to compute Levene’s test to check the homogeneity of variances (leveneTest() in car package), but as it is stated in [10], it is not recommended for the case of unbalanced design because the significance level could be under- or overestimated. On the second plot we again observe that the normality assumption is violated on the right upper part of the plot. On the fourth plot the leverages are drawn in x-axis.
After wrangling around all columns, I choose some of them which look interesting and plot them against each other. I added to the three already investigated variables the other numerical variable tyotu (income from salary), and syntyv (year of birth), and also one more categorical variable - SLHKY about the size of the company measured by number of employees. Since we have already observed that it is difficult to make some inference when the number of categories is too big, I have applied similar grouping of all 3 consequent categories to 1 bigger in SLHKY as well. There are two different colors on the plot, separating males (pink) and females (blue).
#6.2 plot chosen variables using the combined groups in peas, sllvy and SLHKY.
var1<-data_p[,c("svatva", "tyotu", "syntyv", "peas", "sllvy", "SLHKY")]
str(var1)
## 'data.frame': 1783 obs. of 6 variables:
## $ svatva: int 24000 14000 25000 24000 21000 26000 42000 13000 23000 14000 ...
## $ tyotu : int 24000 14000 25000 23000 21000 26000 41000 12000 21000 14000 ...
## $ syntyv: int 1942 1951 1956 1947 1944 1952 1942 1950 1961 1954 ...
## $ peas : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 1 3 2 2 ...
## $ sllvy : Factor w/ 3 levels "1","2","3": 2 2 3 3 3 2 2 2 2 2 ...
## $ SLHKY : Factor w/ 3 levels "1","2","3": 2 1 3 3 3 2 1 1 1 1 ...
p <- ggpairs(var1, mapping = aes(col=data_p$sukup, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)), cardinality_threshold=23)
p
We observe that there is not so big difference between age distribution of males and females. This variable will probably become more informative if we transform it to age (current year - year of birth), and separate different age groups in different categories. We see that the companies with biggest turnover (group 3) are also the companies with bigger number of employees. The pairwise plot also show strong correlation (0.954) between the total earned income svatva and income from salary tyotu. All these findings should be investigated further.
In this work we found that grouping some levels into bigger groups in two categorical variables significantly improved the modeling results and helped to clarify some important tendencies. This is valid only for this particular data subset and the modeling task defined as it is. The situation could be opposite in other situations, when dividing the existing groups will lead to discovering new dependences between the variables. It seems that we cannot say in advance what approach to apply; just have to explore different possibilities.
Here we have analyzed only a data subset for year=2, which became very small after omitting only a part of the missing values. Therefore we cannot be sure that the results are representative for the whole populations. Furthermore, we could just by chance to be able to observe dependences, which do not exist in other subsets. Therefore it is logical to analyze other samples for different years, providing some statistical analyses for comparing these samples. I have also checked some statistics for year vuosi=10. The data points there are a bit more, but the difference is not very significant.
summary(data$svatva)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 14000 18000 19434 23000 99000
summary(data_10$svatva)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 17000 25000 26490 33000 100000
Here we observe that all values (not only mean and median) for latter year=10 are higher, which is logical because there are 8 years difference and we expect increasing total earned income in time (at least because the inflation).
summary(sllvy)
## 1 2 3 4 5 6 7 8 9
## 110 96 63 161 285 296 239 264 402
summary(data_10$sllvy)
## 1 2 3 4 5 6 7 8 9
## 238 56 72 177 320 310 229 296 392
Here we observe slight decreases in the frequencies of company turnover in higher groups and increasing frequencies in group one, which also cannot be explained only by chance, but it is rather because the economical changes in Finland during this time period.
summary(peas)
## 0 1 2 3 4 5 9
## 307 753 427 155 145 122 7
summary(data_10$peas)
## 0 1 2 3 4 5 9
## 432 681 388 165 240 174 10
Here we observe decreasing the amount of officially married couples and increase of the proportion of couples who live together without official marriage. Also the proportion of singles is increased. This also is rather regular tendency in time than random event. It seems that if we would like to compare the modeling results, we have to choose some subsets quite near in time.
One of the simplified assumptions here is to remove the missing data. For this particular task we did not removed any values from both categorical variables concerning family status and company turnover because these data were complete. But if we were interested in some other categorical variables (like for example toimiala or suuralue), we have to probably remove some valuable information, which is involved in these missing values. For the categorical variable one of the possibilities is to form an additional category from the missing values, and then apply the multiple correspondance analysis to investigate the associations between different categories [2, 3].