Chapter 3 Data Introduction
Data preprocessing is one of the most important steps for our research. The quality of data preprocessing is directly related with quality of the data analysis. Most of time, the raw data is really messy and disorganized. What I need to is to make the data organized and split it into different data tables for different analysis. For one data table, life is easy because all I need to do is to put variable of interest into a huge matrix. For multiple data table, I need to split it into different data tables based on their basic features (variables measuring similar features, such as friends number and social network size). Most importantly, I need to make a design matrix for the data table, aka grouping strategy. Based on different grouping strategies, we can have completely different results in our plot and conclusion. However, it is totally up to you to choose the grouping variables: in my current data set, I can group participants according to their social intelligence, general intelligence, age, sex, gpa, and so on. It is highly recommended to read related article before choose the best group variables for your analysis.
In the rest of this chapter, I will introduce three data sets used in this book, Collective Action Data set, Low Income Sausage and Wines. The main data set is corresponding to PCA, BADA, PLS, MCA, MFA; The second one is for CA and the last one is for DiSTATIS.
3.1 Main data set: Collective Action Data Set
The main data set we are using is from an research article published on PNAS last year. Several part of different data tables consist the data set:
- The first part is experimental data:
- The game performance for each round
- The time usage for each round
- Emotional & Social Intelligence test
- State of reasoning
- Comprehension
- Spontaneous
- Academic results
- SAT&ACT overall
- SAT&ACT Math
- SAT&ACT Verbal
- GPA
- Attitude
- Global_warming
- Government
- Environmental ecology
- Social relationship
- Close friends number
- Social network size
- Personality
- Helpful
- Trust
- Hard working
- Volunteer
- Take advantage
- Demographoc
- Age
- Sex
- Parent Education
- Parents’ job1
- Parents’ job2
- Race
- Income
- Language
- Major
- religious
- Emotional processing ability
- Eyetest scores
- Quizzes scores
All these variables are easy to understand except the experiment and emotional processing abilities. Generally, the experiment is a token collection game. Its aim is to collect token as many as possible. The token will automatically generate following a rule. If you and your teammates are following this rule to play the game, you will have higher generate rate of the token during this game; However, if you don’t follow it, you will end the game soon with less number of tokens in your packet.
For the emotional processing ability, it used a classic paradigm in psychology: participants are required to watch only eyes parts of many face pictures and make decision on what’s the emotions on the faces behind. For the social intelligence part, it is similar but participants are required to read paragraphs and make response, which is called Short Story Test (SST), to measure how much capacity participants can empathize with others.
All these variables have detailed illustration on website. Please see this supplementary information document in PNAS. In my data preprocessing, you will notice that I only use Negative Condition as example in this book. The reason why I didn’t include Positive and Negative two conditions is that I am running out of time to submit my final project… Definitely I will finish it after the class.
- Negative Condition: R1-R3, HIGH resource, LOW group members; R4-R6, LOW resource, HIGH group members
- Positive Condition: R1-R3, LOW resource, HIGH group members; R4-R6, HIGH resource, LOW group members
In the following chunk, I showed a detailed process of how I did data cleaning and preprocessing in raw data (raw data can be downloaded here 1.3).
# Data preprocessing----
load(url("https://github.com/yilewang/MSA/blob/main/Dataset/CogCollect.RData?raw=true"))
# Data Cleaning: Missing values
rownames(data.tab3) <- 1:360
$q2[212] <- 3.08
data.tab3$q2[300] <- 1.7
data.tab3
# Binding all the data sets
cbind(data.frame(data.info_demo), data.frame(data.pca), data.frame(data.tab2[3:8]),
all.cog <-data.frame(data.tab3[3:19]))
all.cog[order(all.cog[, "ANumber"]), ]
all.cog <-rownames(all.cog) <- c(1:360)
# Give Colnames
colnames(all.cog)[46:59] <- c("age", "GPA", "close_friends", "social_network", "instruction",
"take_advantage", "helpful", "trust", "volunteer", "global_warming", "hard_work",
"government", "environment_eco", "parent_edu")
colnames(all.cog)[18:25] <- c("religions", "language", "races", "incomes", "majors",
"sex", "parent_job_1", "parent_job_2")
# create sub data tables
cbind(all.cog[, 26:27], all.cog[, 30:36])
sub.token.collection <-
cbind(all.cog[, 26:27], all.cog[, 37:42])
sub.token.collection.time <-
cbind(all.cog[, 26:27], all.cog[, 43:45], all.cog[, "Spontaneous"])
sub.empathy <-colnames(sub.empathy)[6] <- "Spontaneous"
cbind(all.cog[, 26:27], all.cog[, 28:29])
sub.emotion <-
cbind(all.cog[, 26:27], all.cog[, 51:54], all.cog[, 56])
sub.personality <-colnames(sub.personality)[7] <- "hard_work"
cbind(all.cog[, 26:27], all.cog[, 55], all.cog[, 57:58])
sub.attitude <-colnames(sub.attitude)[3] <- "global_warming"
cbind(all.cog[, 26:27], all.cog[, 48:49])
sub.friends <-
cbind(all.cog[, 26:27], all.cog[, 46], all.cog[, 18:25], all.cog[,
sub.demographic <-"parent_edu"])
colnames(sub.demographic)[12] <- "parent_edu"
colnames(sub.demographic)[3] <- "age"
## ****ACT converts to SAT**** ACT DATA
all.cog[which(all.cog$ACT.MATH.SCORE < 40), ]
ACT.math.table <- all.cog[which(all.cog$ACT.READING.SCORE < 40), ]
ACT.reading.table <- all.cog[which(all.cog$ACT.ENGLISH.SCORE < 40), ]
ACT.english.table <-# ACT_MATH
cbind(as.matrix(ACT.math.table$ANumber), as.matrix(ACT.math.table$ACT.MATH.SCORE))
ACT.math <-# ACT_Verbal
matrix(0, nrow = length(ACT.english.table$ACT.ENGLISH.SCORE))
ACT.verbal <-for (i in 1:length(ACT.english.table$ACT.ENGLISH.SCORE))
{ sum(ACT.reading.table$ACT.READING.SCORE[i] + ACT.english.table$ACT.ENGLISH.SCORE[i])
ACT.verbal[i] <-
} cbind(as.matrix(ACT.reading.table$ANumber), ACT.verbal)
ACT.verbal <-# SAT DATA
all.cog[which(all.cog$SAT.MATH.SCORE < 801), ]
SAT.math.table <- all.cog[which(all.cog$SAT.VERBAL.SCORE < 801), ]
SAT.verbal.table <-# SAT_MATH
cbind(as.matrix(SAT.math.table$ANumber), as.matrix(SAT.math.table$SAT.MATH.SCORE))
SAT.math <-# SAT_VERBAL
cbind(as.matrix(SAT.verbal.table$ANumber), as.matrix(SAT.verbal.table$SAT.VERBAL.SCORE))
SAT.verbal <-# Create a calculator to convert ACT to SAT Read data from github
getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/ACT_MATH.csv")
x.math <- read.csv(text = x.math)
math_con <-colnames(math_con)[1] <- "ACT_MATH" ## Avoid UTF-8 error
getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/ACT_VERBAL.csv")
x.verbal <- read.csv(text = x.verbal)
verbal_con <-colnames(verbal_con)[1] <- "ACT_VERBAL" ## Avoid UTF-8 error
# Create a workspace to store the new MATH and VERBAL info
ACT.math
act.math.con <- ACT.verbal
act.verbal.con <- cbind(all.cog[, 26], all.cog[, 7])
act.comp.con <-colnames(act.comp.con) <- c("ANumber", "Composite")
# Make a calculator to convert ACT scores to SAT scores
function(score)
calculator.math <-
{ which(math_con == score)
index <-return(math_con[index, 2])
} function(score)
calculator.verbal <-
{ which(verbal_con == score)
index <-return(verbal_con[index, 2])
}# For loop to convert the ACT score to SAT score
for (i in 1:length(ACT.math[, 1]))
{2] <- calculator.math(ACT.math[i, 2])
act.math.con[i,
}
for (i in 1:length(ACT.verbal[, 1]))
{2] <- calculator.verbal(ACT.verbal[i, 2])
act.verbal.con[i,
}for (i in 1:length(all.cog[, 1]))
{2] <- calculator.math(all.cog$ACT.COMPOSITE.SCORE[i])
act.comp.con[i,
}
# Merge data
rbind(act.math.con, SAT.math)
MATH <- rbind(act.verbal.con, SAT.verbal)
VERBAL <- act.comp.con
COMP <-# Get the RANK
MATH[order(MATH[, 1]), ]
MATH <- MATH[-54, ]
MATH <- VERBAL[order(VERBAL[, 1]), ]
VERBAL <- VERBAL[-54, ]
VERBAL <- COMP[order(COMP[, 1]), ]
COMP <- cbind(all.cog[, 26:27], COMP[, 2], MATH[, 2], VERBAL[, 2], all.cog$GPA)
sub.grades <-colnames(sub.grades)[6] <- "GPA"
colnames(sub.grades)[3:5] <- c("SAT.COMP", "SAT.MATH", "SAT.VERBAL")
# the data we can use for our analysis
cbind(sub.token.collection[, 1:8], sub.token.collection.time[, 3:8],
data.to.use <-3:4], sub.grades[, 3:5], sub.empathy[, 3:6], sub.personality[,
sub.friends[, 3:7], sub.attitude[, 3:5])
cbind(all.cog[, 26], all.cog[, 36])
supplement.token <-colnames(supplement.token)[1:2] <- c("ANumber", "total_earning")
as.data.frame(supplement.token)
supplement.token <-
matrix(0, ncol = 4, nrow = 360)
sub.group <-colnames(sub.group) <- c("gpa", "emo", "group", "col")
rownames(sub.group) <- 1:360
cbind(all.cog[, 26:27], sub.group)
sub.group <-
# Grouping variable: GPA
cbind(all.cog[, 26:27], all.cog$GPA)
sub.gpa <-colnames(sub.gpa)[3] <- "general"
quantile(as.numeric(all.cog$GPA), seq(0, 1, 1/3))
quan.gpa <- which(sub.gpa[, 3] <= quan.gpa[2])
under <- which(sub.gpa[, 3] > quan.gpa[2] & sub.gpa[, 3] <= quan.gpa[3])
between <- which(sub.gpa[, 3] > quan.gpa[3])
over <-3][under] <- 1
sub.group[, 3][between] <- 2
sub.group[, 3][over] <- 3
sub.group[,
# Grouping variable: Emo
cbind(all.cog[, 26:27], all.cog$EyesTestScore)
sub.emo <-colnames(sub.emo)[3] <- "general"
quantile(as.numeric(all.cog$EyesTestScore), seq(0, 1, 1/3))
quan.emo <- which(sub.emo[, 3] <= quan.emo[2])
under <- which(sub.emo[, 3] > quan.emo[2] & sub.emo[, 3] <= quan.emo[3])
between <- which(sub.emo[, 3] > quan.emo[3])
over <-4][under] <- 100
sub.group[, 4][between] <- 200
sub.group[, 4][over] <- 300
sub.group[,
# spearman correlation results > 0.90
cor(sub.gpa[, 3], sub.group[, 3], method = "spearman")
spearman.gpa <- cor(sub.emo[, 3], sub.group[, 4], method = "spearman")
spearman.emo <-
print(paste("corr gpa:", spearman.gpa))
[1] "corr gpa: 0.931355248926045"
print(paste("corr emotional", spearman.emo))
[1] "corr emotional 0.945673602462476"
5] <- sub.group[, 3] + sub.group[, 4]
sub.group[,
# for each condition low-median-high
which(sub.group[, 5] == 101)
condition1 <- which(sub.group[, 5] == 102)
condition2 <- which(sub.group[, 5] == 103)
condition3 <- which(sub.group[, 5] == 201)
condition4 <- which(sub.group[, 5] == 202)
condition5 <- which(sub.group[, 5] == 203)
condition6 <- which(sub.group[, 5] == 301)
condition7 <- which(sub.group[, 5] == 302)
condition8 <- which(sub.group[, 5] == 303)
condition9 <-
# give names
5][condition1] <- "LS_LG"
sub.group[, 5][condition2] <- "LS_MG"
sub.group[, 5][condition3] <- "LS_HG"
sub.group[, 5][condition4] <- "MS_LG"
sub.group[, 5][condition5] <- "MS_MG"
sub.group[, 5][condition6] <- "MS_HG"
sub.group[, 5][condition7] <- "HS_LG"
sub.group[, 5][condition8] <- "HS_MG"
sub.group[, 5][condition9] <- "HS_HG"
sub.group[,
# choose 4 conditions
.1 <- sub.group[which(sub.group[, 5] == "LS_LG"), ]
sub.conditions.2 <- sub.group[which(sub.group[, 5] == "HS_LG"), ]
sub.conditions.3 <- sub.group[which(sub.group[, 5] == "HS_HG"), ]
sub.conditions.4 <- sub.group[which(sub.group[, 5] == "LS_HG"), ]
sub.conditions
# put them into a group
.1$group <- as.factor(sub.conditions.1$group)
sub.conditions.2$group <- as.factor(sub.conditions.2$group)
sub.conditions.3$group <- as.factor(sub.conditions.3$group)
sub.conditions.4$group <- as.factor(sub.conditions.4$group)
sub.conditions
# make sub condition and organize them
rbind(sub.conditions.1, sub.conditions.2, sub.conditions.3, sub.conditions.4)
sub.condition <- cbind(data.to.use, supplement.token$total_earning)
data.to.use.sup <-colnames(data.to.use.sup)[32] <- "total_earning"
merge(sub.condition, data.to.use.sup, by = "ANumber")
final.table <- select(final.table, -c("SessionType.y"))
final.table <-names(final.table)[names(final.table) == "SessionType.x"] <- "SessionType"
### ________________________________________### different conditions, the two
### condition, postive and negative we can use for analysis
final.table[which(final.table$SessionType == "Low_High" | final.table$SessionType ==
exp.neg <- "8.4"), ]
final.table[which(final.table$SessionType == "High_Low" | final.table$SessionType ==
exp.pos <- "4.8"), ]
## color
prettyGraphsColorSelection(n.colors = 4, starting.color = sample((1:300),
index <-4))
# manually setting color design
as.matrix(colnames(exp.neg[8:36]))
m.color.design <-
1:3] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
4:6] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
7:9] <- m.color.design[1:3]
m.color.design[10:12] <- m.color.design[4:6]
m.color.design[13:14] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
15:17] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
18:21] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
22:26] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
27:29] <- prettyGraphsColorSelection(starting.color = sample(1:170,
m.color.design[1))
### ________________________________________###
## **** ending of the data preprocessing ****
It is worth to mention that the chunk above is the data preprocessing for quantitative data.
We used similar strategy for qualitative data preprocessing. The essential part of qualitative data coding is that I need to convert all the quantitative data into categorical format (1,2,3) by using quantile
function. I already compile a new function for the bin coding, please see Function Chapter2.8 in this book.
as.data.frame(exp.neg)
bins.exp.neg <-for(i in 7:26){
bins_helper(bins.exp.neg[,i],
bins.exp.neg[,i] <-colnames(bins.exp.neg)[i])
}
## R1 spearman r: 0.9425675
## R2 spearman r: 0.9433265
## R3 spearman r: 0.9408037
## R4 spearman r: 0.9436019
## R5 spearman r: 0.9418503
## R6 spearman r: 0.9436669
## R1.TimeLeft spearman r: 0.9446858
## R2.TimeLeft spearman r: 0.9502324
## R3.TimeLeft spearman r: 0.9486652
## R4.TimeLeft spearman r: 0.9481581
## R5.TimeLeft spearman r: 0.945816
## R6.TimeLeft spearman r: 0.9473766
## close_friends spearman r: 0.9432373
## social_network spearman r: 0.942117
## SAT.COMP spearman r: 0.9438146
## SAT.MATH spearman r: 0.9404738
## SAT.VERBAL spearman r: 0.9433591
## Overall spearman r: 0.9451715
## Comprehension spearman r: 0.9781535
## State.Reasoning spearman r: 0.9233701
Let’s check the histograms
multi.hist(bins.exp.neg[,7:length(bins.exp.neg[1,])]) hist.afterbin <-

Figure 3.1: Histogram of Categorical Variables
3.2 Low Income Sausage
From this sausage data, we can know that there are five products and 27 emotional words. 20 Judges are required to evaluate these products by these words. The question is: How do you feel when you taste this product? Emotions were presented in a list, and consumers were able to choose as many emotions as they want.
# import data
getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/6Lowincomesausages.csv")
low.income.sausage <- read.csv(text = low.income.sausage)
table.sausage <-colnames(table.sausage)[1] <- "Product"
kable(head(table.sausage[, 1:5]), align = "c")
Product | Happy | Pleasantly.surprised | Unpleasantly.surprised | Salivating |
---|---|---|---|---|
Alpino | 27 | 21 | 11 | 19 |
Bafar | 28 | 20 | 6 | 19 |
Chimex | 26 | 20 | 7 | 12 |
Capistrano | 32 | 16 | 6 | 15 |
Duby | 33 | 17 | 9 | 12 |
3.3 Wines
There are several wines from two places: South Africa and France. The wines
table includes all the rating from Judges; The wines.name
is table with list of emotional words in English and French; The wines.flavors
table is some emotional words which are used to describe these wines.
# import data
getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/Data_sorting_tri_fr_no_info_translate.csv")
wines <- getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/Dictionary.csv")
wines.names <- getURL("https://raw.githubusercontent.com/yilewang/MSA/main/Dataset/Voc.csv")
wines.flavors <-
read.csv(text = wines)
table.wines <- read.csv(text = wines.names)
table.wines.names <- read.csv(text = wines.flavors)
table.wines.flavors <-
### Avoid UTF-8 error
colnames(table.wines)[1] <- "X"
colnames(table.wines.names)[1] <- "French"
colnames(table.wines.flavors)[1] <- "X"
kable(head(table.wines[, 1:5]), align = "c")
X | J1 | J2 | J3 | J4 |
---|---|---|---|---|
FCAR | 4 | 2 | 4 | 1 |
SRUD | 3 | 4 | 3 | 1 |
FBAU | 5 | 4 | 2 | 2 |
FROC | 4 | 1 | 4 | 1 |
SFED | 5 | 4 | 2 | 1 |
SREY | 1 | 1 | 4 | 3 |
kable(head(table.wines.names), align = "c")
French | English |
---|---|
acide | acidic |
peu acide | low acidity |
agressif | aggressive |
agrumes | citrus |
agréable | pleasant |
alcool | alcohol |
kable(head(table.wines.flavors[, 1:5]), align = "c")
X | acide | peu.acide | agressif | agrumes |
---|---|---|---|---|
FCAR | 5 | 2 | 0 | 2 |
SRUD | 4 | 3 | 0 | 2 |
FBAU | 2 | 5 | 0 | 0 |
FROC | 7 | 1 | 0 | 0 |
SFED | 6 | 2 | 1 | 2 |
SREY | 7 | 2 | 1 | 1 |