This entry examines the distributions of time to the initial response in milliseconds across a series of choice tasks. The R code shows how to illustrate these times as histograms, tables, and line plots as well as how to calculate their medians and interquartile ranges (IQR).
For this worked example, data on initial response times was extracted from the 2016 predictive modeling competition in HPR (Jakubczyk, Craig, et al. 2017). In 2016, 4088 US participants responded to 20 paired comparisons, choosing between two alternative health outcomes. Apart from initial response time, respondents may change their answers before proceeding to the next task (time to last response) or may spend additional time on the page before proceeding to the next task (i.e., page time); nevertheless, the time to initial response is a common behavioral measure of task difficulty in HPR. For this analysis, initial response times were truncated at five minutes (300,000 milliseconds).
We will continue to examine this topic. Please see the entry Correlations between initial response times, where we examine the variance, co-variance, correlations of time to the initial response.
Notes: Change the working directory to the location of the source files on your computer. To do so, you need to replace the inside of setwd() with the location of data file on your computer.
setwd("C:\\Users\\aaa\\OneDrive\\USF\\Dr. Craig")
library(knitr) #this is for the function "kable," which makes well-organized tables.
library(tidyverse) #this is for the function "read_csv."
library(tinytex)
library(gt)
data1 <- read_csv("resp1wave1_220723.csv")
Let \(x_{it}\) be the initial response time for respondent \(i\) and task \(t\) (\(1 \le i \le N\); \(1 \le t \le T\)). A histogram needs to satisfy Equation (1).
\[N\times T = \sum_{k=1}^{K}m_k
\tag{1}\] where
\[\begin{aligned}
N &= \text{the total number of respondents } \\
T &= \text{the total number of tasks per respondent } \\
K &= \text{the total number of bins } (1 \le K \le N \times T) \\
m_k &= \text{the number of}\ x_{it}\ \text{within the}\ k^{th}\
\text{bin}\ (1\le k \le K)\\
&= \text{the number of}\ x_{it}\ (r_{k-1}\le x_{it} < r_k)\\
&= \text{frequency with the}\ k^{th}\ \text{bin}\\
r_j &= \text{the upper boundary of the}\ j^{th}\ \text{bin}\
(min(k)-1\le j \le max(k),\ \text{that is,}\ 0\le j \le K)\\
&= \text{the lower boundary of the}\ j+1^{th}\ \text{bin}\\
r_j-r_{j-1} &= \text{the width of each bin}\\
&=\frac{max(r_j)-min(r_j)}{K}=\frac{r_{K}-r_0}{K}
\end{aligned}\]
Now, let \(K=50\) (50 bins or
breaks). This histogram represents the distribution of the initial
response times \(x_{it}\) for all task
(\(N=4088\); \(T=20\)).
hist(data1$time,
breaks = 50,
main = "Histogram 2.2: Initial response time for the first task",
xlab = "Initial response time (milliseconds)",
col = "lightgreen")
Now, replace \(x_{it} = log(x_{it})\). This histogram represent the distribution of log initial response times for all tasks (\(N=4088\); \(T=20\))
hist(log(data1$time),
breaks = 50,
main = "Histogram 2.3: Log Initial response time for the first task",
xlab = "Log Initial response time (milliseconds)",
col = "lightblue")
The histogram better illustrates that the distribution is bimodal (i.e., two local maxima).
This code produces overlapping histograms to compare the distribution of log initial response times between the first and last tasks (\(N=4088\); \(t=1,20\))
d2 = matrix()
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
d2 = cbind(d2, d1)
}
d3 <- as.data.frame(d2[,-1])
colnames(d3) = paste0("Task", 1:max(data1$task))
rownames(d3) = paste0("ID ", 1:max(data1$survey_id))
xh <- sort(d3$Task1) %>% data.frame()#this is task1 sorted by the value (increasing).
colnames(xh) = c("task1")
hist(d3[,20], breaks = 50, col="#FF00007F",
main="Histogram 2.4: Initial response time for the first and last tasks",
xlab="Initial response time (milliseconds)")
hist(d3[,1], breaks = 50, col="#0000FF7F", add=T)
legend( "topright",
legend=c("Task 1","Task 20"),
pch=15,
col=c("#0000FF7F", "#FF00007F"),
bty="n",
ncol=2,
pt.cex=3
)
The overlay histogram shows distributional differences between the first and last tasks.
Definition : The \(k^{th}\) percentile is a number
such that \(k\) percent of observations
have an equal or smaller value than that number. For example, if the
\(25^{th}\) percentile is 1256, then 25
percent of observations are equal or smaller than 1256 and 75 percent
are larger.
Definition: The median is the \(50^{th}\) percentile (Q2).
Assume initial response times for task \(t\), \(x_{t}\) were sorted from smallest to
largest, \(x_{t}[1]\le x_{t}[2]\le ...\le
x_{t}[N]\).
When \(N\) is even, that is, \(\exists\ a \in\mathbb{N}\ s.t.\ N = 2a\),
\(median\ (Q2)=
\frac{1}{2}(x_{t}[\frac{N}{2}]+x_{t}[\frac{N}{2}+1])\).
When \(N\) is odd, that is, \(\exists\ b \in\mathbb{N}\ s.t.\ N =
2b+1\), \(median\ (Q2)=
x_{t}[\frac{N+1}{2}]\).
Definition: The interquartile range (IQR) is the distance between the \(25^{th}\) percentile (Q1) and the \(75^{th}\) percentile (Q3). That is, \(IQR = Q3-Q1\).
Q1 and Q3 could be given by Equation (2) and (3), respectively.
However, depending on the value of \(N\), \(\frac{N+1}{4}\) and \(\frac{3(N+1)}{4}\) could be non-integers.
In those cases, you need to apply a linear interpolation to \(x_{t}[\lfloor\frac{N+1}{4}\rfloor]\) and
\(x_{t}[\lceil\frac{N+1}{4}\rceil]\)
for Q1 and \(x_{t}[\lfloor\frac{3(N+1)}{4}\rfloor]\) and
\(x_{t}[\lceil\frac{3(N+1)}{4}\rceil]\)
for Q3 so that you are able to calculate them (shown in cases
(i)-(iv)).
\[
Q1=x_{t}[\frac{N+1}{4}] \tag{2}
\]\[
Q3=x_{t}[\frac{3(N+1)}{4}]\tag{3}
\]
The way of linear interpolation are divided into four cases (i)-(iv) following Equation (4).
\[ \ \forall N \in\ \mathbb{N}, \exists\ h \in\ \mathbb{N}\ s.t.\ N = \begin{cases} {4h}\\ {4h-1}\\ {4h+1}\\ {4h+2} \end{cases} \tag{4}\]
\[\begin{aligned} \text{(i)}\ N=4h \ \ \ \\ Q1&=\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]\\ Q3&=\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1]\\ IQR&=Q3-Q1\\ &=(\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1])-(\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]) \\ \text{(ii)}\ N=4h-1\\ Q1&=\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1] \\ Q3&=\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2])-(\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1]) \\ \text{(iii)}\ N=4h+1 \\ Q1&=x_{t}[h] \\ Q3&=x_{t}[3h] \\ IQR&=Q3-Q1\\ &=x_{t}[3h]-x_{t}[h] \\ \text{(iv)}\ N=4h+2 \\ Q1&=\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1] \\ Q3&=\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2])-(\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1]) \\ \end{aligned}\]
This code calculates percentiles of initial response times for the first task (\(t=1\)) using the sorted data \(x_{i1}[h]\) as well as the command quantile() for comparison. Table 3.2.1 shows Q1, Q2, and Q3 for the first task (\(t=1\)) that are calculated by the formulas listed above. The command quantile() has 9 different types of algorithms (default is type = 7). Table 3.2.2 shows percentiles of initial response times for the first task (\(t=1\)) by 9 different algorithms. The way of calculation explained above is “type = 6”. We have the same calculation results between “type = 6” in Table 3.2.2 and Table 3.2.1.
h <- floor(nrow(xh)/4)
Q1 <- 0.75*xh[h,1]+0.25*xh[h+1,1]
Q3 <- 0.25*xh[3*h,1]+0.75*xh[3*h+1,1]
Q2 <- 0.5*xh[2*h,1]+0.5*xh[2*h+1,1]
IQRf <- data.frame(Q1=Q1,Q2=Q2,Q3=Q3)
kable(IQRf, caption="Table 3.2.1: Percentiles by formula for the first task (t=1)",)
| Q1 | Q2 | Q3 |
|---|---|---|
| 20148 | 29645 | 43792.75 |
IQR1<- NULL
for(i in 1:9){
IQR0 <- quantile(xh[,1], type=i)
IQR1 <-rbind(IQR1, IQR0)
}#IQR1 stores 9 types of IQR for task1. Type = 6 is what the author showed in formulas.
rownames(IQR1) = paste0("IQR type=", 1:9)
colnames(IQR1) = c( "1st percentile", "25th percentile", "50th percentile",
"75th percentile ","100th percentile")
IQR1 <- round(IQR1, digit = 2)
IQR2 <- format(IQR1, digit = 5)
kable(IQR2, caption="Table 3.2.2: 9 types of percentiles for the first task (t=1)", digit=2)
| 1st percentile | 25th percentile | 50th percentile | 75th percentile | 100th percentile | |
|---|---|---|---|---|---|
| IQR type=1 | 2023 | 20146 | 29644 | 43792 | 300000 |
| IQR type=2 | 2023 | 20150 | 29645 | 43793 | 300000 |
| IQR type=3 | 2023 | 20146 | 29644 | 43792 | 300000 |
| IQR type=4 | 2023 | 20146 | 29644 | 43792 | 300000 |
| IQR type=5 | 2023 | 20150 | 29645 | 43793 | 300000 |
| IQR type=6 | 2023 | 20148 | 29645 | 43793 | 300000 |
| IQR type=7 | 2023 | 20152 | 29645 | 43792 | 300000 |
| IQR type=8 | 2023 | 20149 | 29645 | 43793 | 300000 |
| IQR type=9 | 2023 | 20150 | 29645 | 43793 | 300000 |
Again, this example suggests that the commands produce the same results as the sorting method.
This code calculates the same results for each of the twenty tasks and visualizes the results using a table.
m2 =NULL
ir2 = NULL
iqr2 = NULL
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
m0 = median(d1$time)
m2 = rbind(m2, m0)
ir0 = quantile(d1$time, type=6)
ir2 = rbind(ir2, ir0)
iqr0 = IQR(d1$time)
iqr2 = rbind(iqr2, iqr0)
table1 = cbind(m2,iqr2, ir2)
}
rownames(table1) = paste0("task ", 1:20)
colnames(table1) <- c("Median", "IQR", "1st percentile", "25th percentile",
"50th percentile", "75th percentile ","100th percentile")
table3.3 <- data.frame(cbind(c(1:20),table1))
colnames(table3.3) <- c("t th task", "Median", "IQR", "1st percentile", "25th percentile",
"50th percentile", "75th percentile ","100th percentile")
table3.4 <- round(table3.3, digit = 0)
table3.5 <- format(table3.4, digit = 5)
kable(table3.5, align = "c", caption = "**Table 3.3: Median and IQR**",
row.names = FALSE, escape = FALSE, centering = T)
| t th task | Median | IQR | 1st percentile | 25th percentile | 50th percentile | 75th percentile | 100th percentile |
|---|---|---|---|---|---|---|---|
| 1 | 29645 | 23640 | 2023 | 20148 | 29645 | 43793 | 3e+05 |
| 2 | 20553 | 18589 | 2050 | 12990 | 20553 | 31580 | 3e+05 |
| 3 | 19210 | 17642 | 1475 | 11748 | 19210 | 29392 | 3e+05 |
| 4 | 17394 | 16838 | 1590 | 10981 | 17394 | 27827 | 3e+05 |
| 5 | 16898 | 16081 | 1998 | 10357 | 16898 | 26443 | 3e+05 |
| 6 | 16024 | 15845 | 1732 | 9772 | 16024 | 25619 | 3e+05 |
| 7 | 15416 | 14888 | 1787 | 9422 | 15416 | 24326 | 3e+05 |
| 8 | 15081 | 15780 | 1700 | 9045 | 15081 | 24828 | 3e+05 |
| 9 | 14317 | 14733 | 1786 | 8564 | 14317 | 23308 | 3e+05 |
| 10 | 14210 | 14608 | 1669 | 8305 | 14210 | 22915 | 3e+05 |
| 11 | 13901 | 13758 | 1723 | 8545 | 13901 | 22322 | 3e+05 |
| 12 | 9884 | 10562 | 1938 | 5845 | 9884 | 16413 | 3e+05 |
| 13 | 8982 | 9520 | 1818 | 5106 | 8982 | 14630 | 3e+05 |
| 14 | 8580 | 9705 | 1944 | 4812 | 8580 | 14526 | 3e+05 |
| 15 | 8158 | 9038 | 1829 | 4459 | 8158 | 13506 | 3e+05 |
| 16 | 7906 | 8649 | 1801 | 4377 | 7906 | 13033 | 3e+05 |
| 17 | 7386 | 8256 | 1847 | 4081 | 7386 | 12338 | 3e+05 |
| 18 | 7250 | 8095 | 1538 | 4111 | 7250 | 12211 | 3e+05 |
| 19 | 7134 | 7991 | 1290 | 4043 | 7134 | 12036 | 3e+05 |
| 20 | 6908 | 7638 | 1731 | 3904 | 6908 | 11545 | 3e+05 |
The largest changes in initial response times occurs between the first and second task and between the eleventh and twelfth task.
This code visualizes the results taken from Table 3.3 using a line plot, where the black dots represent the median (Q2) and the red dots represent the \(25^{th}\) and \(75^{th}\) percentiles (Q1 and Q3).
xmax <- 20
xmin <- 1
ymax <- 60000
ymin <- 0
plot(table3.3$'t th task', table3.3$Median, bty = "l", pch = 16, type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax),
xlab = NA, ylab =NA, )
par(new=T)
plot(table3.3$'t th task', table3.3$'25th percentile', bty = "l", pch = 1,
col = "red", type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = NA, ylab =NA,)
par(new=T)
plot(table3.3$'t th task', table3.3$'75th percentile', bty = "l", pch = 1,
col = "red", type ="o",
xlim = c(xmin, xmax), ylim = c(ymin, ymax),
xlab = "task number", ylab ="The initial response time (milliseconds)",
main = "Graph 3.4: The median and interquartile range of the initial response time")
Generally the \(25^{th}\), \(50^{th}\), and \(75^{th}\) percentiles decrease monotonically by task and the IQR decreases monotonically by task.
Jakubczyk, M., Craig, B. M., Barra, M., Groothuis-Oudshoorn, C. G. M., Hartman, J. D., Huynh, E., Ramos-Goñi, J. M., Stolk, E. A., & Rand, K. (2017). Choice defines value: A predictive modeling competition in Health Preference Research. Value in Health, 21(2), 229–238. https://doi.org/10.1016/j.jval.2017.09.016
Okubo, S and Craig, B. (2023, February 18). Correlations between initial response times. R4HPR. https://r4hpr.org/visor/?e=correlations-between-initial-response-times%e3%80%80
Rizzo, M. L. (2019). Statistical computing with R, second edition. Chapman and Hall.
Okubo, S and Craig, B. (2023, February 18). Illustrating time to initial response in choice tasks. R4HPR. https://r4hpr.org/visor/?e=analysis-of-initial-response-time-across-20-pair-comparisons