1. Introduction
In this entry, through the analysis of a data set, we examine how to make histograms, tables, transitive graphs, and scatter plots and the way to calculate medians and interquartile ranges (IQR).
In health preference research (HPR), it is naturally difficult to determine patients’ real necessities. This is because it is often the case that even patients themselves are not necessarily able to tell specifically what medical procedure they hope to receive, facing complex choices and variable expected consequences. A discrete choice experiment (DCE), which presents multiple health descriptions and infers the patients’ weighting factors from the responses, is one of the effective ways to address this problem.
In this entry, before entering the details of DCEs, we are going to analyze the data set accessorily generated in the process of DCE; our focus here is the time spent on choosing a choice for each question, which is called the first response time. Respondents may change their answer to the other before proceeding to the next question, but the first response time only measures the time from starting the question to selecting the first choice. In the DCE held in 2016, 4088 US participants responded to 20 questions regarding hypothetical health conditions. The data set that we are going to analyze contains each individual’s first response time for 20 questions.
2. Load the data set and packages
library(knitr) #this is for the function "kable," which makes well-organized tables.
library(tidyverse) #this is for the function "read_csv."
data1 <- read_csv("C:\\Users\\aaa\\OneDrive\\USF\\Dr. Craig\\resp1wave1_220723.csv")
Please replace the inside of quotation marks with the address of the file location.
3. Make histograms
In general;
Let \(x_i\) be the value of each individual observation (\(1 \le i \le n\)) \[n = \sum_{k=1}^{K}m_k\] where
\[\begin{align}
n &= \text{the total number of observation } \\
m_k &= \text{the number of}\ x_i\ \text{within the}\ k\ \text{th bin}\ (1\le k \le K)\\
&= \text{the number of}\ x_i\ s.t.\ r_{k-1}\le x_i < r_k\\
&= \text{frequency}\\
r_j &= \text{the upper boundary of the}\ j\ \text{th bin}\ (min(k)-1\le j \le max(k)\ \text{or}\ 0\le j \le K)\\
&= \text{the lower boundary of the}\ j+1\ \text{th bin}\\
r_j-r_{j-1} &= \text{the width of each bin}\\
&=[max(r_j)-min(r_j)]/K = [r_{K}-r_0]/K\\
\end{align}\]
It is necessary to decide the range, \([min(r_j), max(r_j)) = [r_0, r_K)\), such that \(r_0=min(r_j) \le min(x_i)\ \wedge\ max(x_i) \le r_{K+1}=max(r_j)\), to create an understandable histogram.
Now, let \(K=50\) and \(x_i = frt_i\ (milliseconds)\) (frt: first response time) and we get the following histogram.
hist(data1$time,
breaks = 50,
main = "Histogram 1: Histogram of the first response time",
xlab = "the first response time (milliseconds)",
col = "lightgreen")
Now, let \(K=50\) and \(x_i = log(frt_i)\) and we get the following histogram.
hist(log(data1$time),
breaks = 50,
main = "Histogram 2: Log of the first response time",
xlab = "Log of the first response time",
col = "lightblue")
Comparing Histograms 1 and 2, we observe that by taking a logarithmic scale of the variable, the first response time (milliseconds), the originally right-skewed histogram, Histogram 1, was converted into the more bellshaped histogram, Histogram 2.
4. Make a table that contains the median and interquartile range for each question
Let’s suppose we hope to calculate the median and interquartile range (IQR) for Question 1. Then, we firstly need to gather the individual values of the first response time for Question 1 from the data set and secondly, calculate the median and IQR for that question. When we want to gain the median and IQR for each question, we are going to simply repeat that process 20 times since there are 20 questions in the experiment.
m2 =NULL
ir2 = NULL
iqr2 = NULL
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
m0 = median(d1$time)
m2 = rbind(m2, m0)
ir0 = quantile(d1$time)
ir2 = rbind(ir2, ir0)
iqr0 = IQR(d1$time)
iqr2 = rbind(iqr2, iqr0)
table1 = cbind(m2,iqr2, ir2)
}
rownames(table1) = paste0("Question ", 1:20)
colnames(table1) <- c("Median", "IQR", "0% percentile", "25% percentile",
"50% percentile", "75% percentile ","100% percentile")
table2 <- data.frame(cbind(c(1:20),table1))
colnames(table2) <- c("Question", "Median", "IQR", "0% percentile", "25% percentile",
"50% percentile", "75% percentile ","100% percentile")
kable(table1)
| Median | IQR | 0% percentile | 25% percentile | 50% percentile | 75% percentile | 100% percentile | |
|---|---|---|---|---|---|---|---|
| Question 1 | 29645.0 | 23640.25 | 2023 | 20152.00 | 29645.0 | 43792.25 | 3e+05 |
| Question 2 | 20553.0 | 18588.75 | 2050 | 12991.25 | 20553.0 | 31580.00 | 3e+05 |
| Question 3 | 19209.5 | 17642.50 | 1475 | 11748.75 | 19209.5 | 29391.25 | 3e+05 |
| Question 4 | 17394.5 | 16838.00 | 1590 | 10983.00 | 17394.5 | 27821.00 | 3e+05 |
| Question 5 | 16897.5 | 16081.25 | 1998 | 10359.75 | 16897.5 | 26441.00 | 3e+05 |
| Question 6 | 16023.5 | 15845.00 | 1732 | 9773.25 | 16023.5 | 25618.25 | 3e+05 |
| Question 7 | 15416.5 | 14887.75 | 1787 | 9423.50 | 15416.5 | 24311.25 | 3e+05 |
| Question 8 | 15081.0 | 15779.50 | 1700 | 9047.00 | 15081.0 | 24826.50 | 3e+05 |
| Question 9 | 14317.0 | 14732.75 | 1786 | 8566.75 | 14317.0 | 23299.50 | 3e+05 |
| Question 10 | 14210.5 | 14608.50 | 1669 | 8305.75 | 14210.5 | 22914.25 | 3e+05 |
| Question 11 | 13901.0 | 13757.50 | 1723 | 8545.75 | 13901.0 | 22303.25 | 3e+05 |
| Question 12 | 9884.0 | 10562.50 | 1938 | 5845.75 | 9884.0 | 16408.25 | 3e+05 |
| Question 13 | 8982.0 | 9519.50 | 1818 | 5109.25 | 8982.0 | 14628.75 | 3e+05 |
| Question 14 | 8580.0 | 9704.75 | 1944 | 4813.25 | 8580.0 | 14518.00 | 3e+05 |
| Question 15 | 8158.5 | 9037.50 | 1829 | 4461.75 | 8158.5 | 13499.25 | 3e+05 |
| Question 16 | 7905.5 | 8648.75 | 1801 | 4377.00 | 7905.5 | 13025.75 | 3e+05 |
| Question 17 | 7386.0 | 8255.75 | 1847 | 4082.25 | 7386.0 | 12338.00 | 3e+05 |
| Question 18 | 7250.5 | 8094.75 | 1538 | 4111.00 | 7250.5 | 12205.75 | 3e+05 |
| Question 19 | 7133.5 | 7991.25 | 1290 | 4044.25 | 7133.5 | 12035.50 | 3e+05 |
| Question 20 | 6908.5 | 7638.25 | 1731 | 3905.50 | 6908.5 | 11543.75 | 3e+05 |
5. Plot the results of the table
Now, we try to plot the results of the table above and connect plots to reveal the transition of median and IQR as the number of questions increases. The black points represent the median and red circles denote the 25% percentile and the 75% percentile for each question.
xmax <- 20
xmin <- 1
ymax <- 60000
ymin <- 0
plot(table2$Question, table2$Median, bty = "l", pch = 16, type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax),
xlab = NA, ylab =NA, )
par(new=T)
plot(table2$Question, table2$`25% percentile`, bty = "l", pch = 1, col = "red", type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = NA, ylab =NA,)
par(new=T)
plot(table2$Question, table2$`75% percentile`, bty = "l", pch = 1, col = "red", type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax),
xlab = "n th Question", ylab ="The first response time (milliseconds)",
main = "Graph 1: The median and interquartile range of the first response time")
6. Make a scatter plot
In Graph 1, we can observe the tendency where the median and IQR decline as the experiment proceeds and the number of questions increases. Now, then, we focus on each individual. Since each participant answered 20 questions, let us classify the questions into two groups: the first ten questions and the second ten questions. Focusing on a specific individual, We are able to calculate his or her medians of the first response time for the first group of questions as well as the second group of questions. Let us call them median1 and median2. Preparing the plane with median1 in the x-axis and median2 in the y-axis, we can plot that individual in the plane. We repeat that process for all participants, and at the end, we can create the scatter plot with all individuals (Graph 2).
inv2 = NULL
for (i in 1:max(data1$survey_id)) {
ften <- subset(data1, subset = survey_id == i & task <10.5)
sten <- subset(data1, subset = survey_id == i & task >10.5)
inv0 = c(median(ften$time), median(sten$time))
inv2 = rbind(inv2, inv0)
}
table3 <- data.frame(cbind(c(1:max(data1$survey_id)),inv2))
colnames(table3) <- c("ID", "First_ten", "Second_ten" )
rownames(table3) = paste0("ID ", 1:max(data1$survey_id))
plot(table3$First_ten, table3$Second_ten, bty = "l", pch = 1, cex = 0.5,
xlab = "Median1",
ylab ="Median2",
main ="Graph 2: Scatter plots of the first median and second median"
)
par(new=T)
x1 <- seq(0:160000)
y1 <- x1
plot(x=x1, y=y1, type ="l", xlab = NA, ylab =NA, xaxt="n", yaxt="n",
bty ="n", col = "red")
Median1: Median of the first response time for the first ten questions
Median2: Median of the first response time for the second ten questions
The red line is the line where Median1 = Median2. We can observe that majority of plots are under the red line, and thus, for the majority of individuals, Median1 < Median2. This indicates that even when focusing on individual respondents, we can assume the first response time declines as the number of questions becomes larger.
7. Reference
Jakubczyk, M., Craig, B. M., Barra, M., Groothuis-Oudshoorn, C. G. M., Hartman, J. D., Huynh, E., Ramos-Goñi, J. M., Stolk, E. A., & Rand, K. (2017). Choice defines value: A predictive modeling competition in Health Preference Research. Value in Health, 21(2), 229–238. https://doi.org/10.1016/j.jval.2017.09.016
8. How to cite this entry
Okubo, S. (yyyy, month dd). Analysis of initial response time across 20 pair comparisons. R4HPR. https://r4hpr.org/visor/?e=analysis-of-initial-response-time-across-20-pair-comparisons