1 Introduction

1.1 Background

This entry examines the distributions of time to the initial response in milliseconds across a series of choice tasks. The R code shows how to illustrate these times as histograms, tables, and line plots as well as how to calculate their medians and interquartile ranges (IQR).

For this worked example, data on initial response times was extracted from the 2016 predictive modeling competition in HPR (Jakubczyk, Craig, et al. 2017). In 2016, 4088 US participants responded to 20 paired comparisons, choosing between two alternative health outcomes. Apart from initial response time, respondents may change their answers before proceeding to the next task (time to last response) or may spend additional time on the page before proceeding to the next task (i.e., page time); nevertheless, the time to initial response is a common behavioral measure of task difficulty in HPR. For this analysis, initial response times were truncated at five minutes (300,000 milliseconds).

We will continue to examine this topic. Please see the entry Correlations between initial response times, where we examine the variance, co-variance, correlations of time to the initial response.

1.2 Load libraries and source files

Notes: Change the working directory to the location of the source files on your computer. To do so, you need to replace the inside of setwd() with the location of data file on your computer.

setwd("C:\\Users\\aaa\\OneDrive\\USF\\Dr. Craig")
library(knitr) #this is for the function "kable," which makes well-organized tables.
library(tidyverse) #this is for the function "read_csv."
library(tinytex)
library(gt)
data1 <- read_csv("resp1wave1_220723.csv")

2 Visualize distributions of initial response times

2.1 Background

  • Definition: A histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the range divided by the number of bins. To create a histogram of initial response times, you must decide on the number of bins, \(K\), and the range of histogram, \([r_0,r_K)\). When you decide the number of bins, \(K\), “Sturges’ rule” might give you a clue (Rizzo, 2019, p.339). Note that this rule just gives you a rough indication. (This entry does not follow this rule.) \[\small{ \text{Sturges' rule:}\ K= log_2\ {(N \times T)}+1 }\] . You must decide \(r_0\) and \(r_K\)\[\small{ r_0, r_K\ s.t.\ min(x_{it}), max(x_{it}) \in[r_0,r_K) }\] .

Let \(x_{it}\) be the initial response time for respondent \(i\) and task \(t\) (\(1 \le i \le N\); \(1 \le t \le T\)). A histogram needs to satisfy Equation (1).

\[N\times T = \sum_{k=1}^{K}m_k \tag{1}\] where
\[\begin{aligned} N &= \text{the total number of respondents } \\ T &= \text{the total number of tasks per respondent } \\ K &= \text{the total number of bins } (1 \le K \le N \times T) \\ m_k &= \text{the number of}\ x_{it}\ \text{within the}\ k^{th}\ \text{bin}\ (1\le k \le K)\\ &= \text{the number of}\ x_{it}\ (r_{k-1}\le x_{it} < r_k)\\ &= \text{frequency with the}\ k^{th}\ \text{bin}\\ r_j &= \text{the upper boundary of the}\ j^{th}\ \text{bin}\ (min(k)-1\le j \le max(k),\ \text{that is,}\ 0\le j \le K)\\ &= \text{the lower boundary of the}\ j+1^{th}\ \text{bin}\\ r_j-r_{j-1} &= \text{the width of each bin}\\ &=\frac{max(r_j)-min(r_j)}{K}=\frac{r_{K}-r_0}{K} \end{aligned}\]

2.2 Example I

Now, let \(K=50\) (50 bins or breaks). This histogram represents the distribution of the initial response times \(x_{it}\) for all task (\(N=4088\); \(T=20\)).

hist(data1$time,
     breaks = 50,
     main = "Histogram 2.2: Initial response time for the first task",
     xlab = "Initial response time (milliseconds)",
     col = "lightgreen")


2.3 Example II

Now, replace \(x_{it} = log(x_{it})\). This histogram represent the distribution of log initial response times for all tasks (\(N=4088\); \(T=20\))

hist(log(data1$time),
     breaks = 50,
     main = "Histogram 2.3: Log Initial response time for the first task",
     xlab = "Log Initial response time (milliseconds)",
     col = "lightblue")    

The histogram better illustrates that the distribution is bimodal (i.e., two local maxima).

2.4 Example III

This code produces overlapping histograms to compare the distribution of log initial response times between the first and last tasks (\(N=4088\); \(t=1,20\))

d2 = matrix()
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
d2 = cbind(d2, d1)
}
d3 <- as.data.frame(d2[,-1]) 
colnames(d3) = paste0("Task", 1:max(data1$task))
rownames(d3) = paste0("ID ", 1:max(data1$survey_id))
xh <- sort(d3$Task1) %>% data.frame()#this is task1 sorted by the value (increasing).
colnames(xh) = c("task1")

hist(d3[,20], breaks = 50, col="#FF00007F",
     main="Histogram 2.4: Initial response time for the first and last tasks",
     xlab="Initial response time (milliseconds)")
hist(d3[,1], breaks = 50, col="#0000FF7F", add=T)

legend( "topright", 
        legend=c("Task 1","Task 20"),
        pch=15,
        col=c("#0000FF7F", "#FF00007F"),
        bty="n",
        ncol=2,
        pt.cex=3
        )

The overlay histogram shows distributional differences between the first and last tasks.

3 Visualize the percentiles of initial response times

3.1 Background

  • Definition : The \(k^{th}\) percentile is a number such that \(k\) percent of observations have an equal or smaller value than that number. For example, if the \(25^{th}\) percentile is 1256, then 25 percent of observations are equal or smaller than 1256 and 75 percent are larger.

  • Definition: The median is the \(50^{th}\) percentile (Q2).
    Assume initial response times for task \(t\), \(x_{t}\) were sorted from smallest to largest, \(x_{t}[1]\le x_{t}[2]\le ...\le x_{t}[N]\).
    When \(N\) is even, that is, \(\exists\ a \in\mathbb{N}\ s.t.\ N = 2a\), \(median\ (Q2)= \frac{1}{2}(x_{t}[\frac{N}{2}]+x_{t}[\frac{N}{2}+1])\)
    When \(N\) is odd, that is, \(\exists\ b \in\mathbb{N}\ s.t.\ N = 2b+1\), \(median\ (Q2)= x_{t}[\frac{N+1}{2}]\).

  • Definition: The interquartile range (IQR) is the distance between the \(25^{th}\) percentile (Q1) and the \(75^{th}\) percentile (Q3). That is, \(IQR = Q3-Q1\).

Q1 and Q3 could be given by Equation (2) and (3), respectively. However, depending on the value of \(N\), \(\frac{N+1}{4}\) and \(\frac{3(N+1)}{4}\) could be non-integers. In those cases, you need to apply a linear interpolation to \(x_{t}[\lfloor\frac{N+1}{4}\rfloor]\) and \(x_{t}[\lceil\frac{N+1}{4}\rceil]\) for Q1 and \(x_{t}[\lfloor\frac{3(N+1)}{4}\rfloor]\) and \(x_{t}[\lceil\frac{3(N+1)}{4}\rceil]\) for Q3 so that you are able to calculate them (shown in cases (i)-(iv)).
\[ Q1=x_{t}[\frac{N+1}{4}] \tag{2} \]\[ Q3=x_{t}[\frac{3(N+1)}{4}]\tag{3} \]

The way of linear interpolation are divided into four cases (i)-(iv) following Equation (4).

\[ \ \forall N \in\ \mathbb{N}, \exists\ h \in\ \mathbb{N}\ s.t.\ N = \begin{cases} {4h}\\ {4h-1}\\ {4h+1}\\ {4h+2} \end{cases} \tag{4}\]

\[\begin{aligned} \text{(i)}\ N=4h \ \ \ \\ Q1&=\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]\\ Q3&=\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1]\\ IQR&=Q3-Q1\\ &=(\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1])-(\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]) \\ \text{(ii)}\ N=4h-1\\ Q1&=\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1] \\ Q3&=\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2])-(\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1]) \\ \text{(iii)}\ N=4h+1 \\ Q1&=x_{t}[h] \\ Q3&=x_{t}[3h] \\ IQR&=Q3-Q1\\ &=x_{t}[3h]-x_{t}[h] \\ \text{(iv)}\ N=4h+2 \\ Q1&=\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1] \\ Q3&=\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2])-(\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1]) \\ \end{aligned}\]

3.2 Example I

This code calculates percentiles of initial response times for the first task (\(t=1\)) using the sorted data \(x_{i1}[h]\) as well as the command quantile() for comparison. Table 3.2.1 shows Q1, Q2, and Q3 for the first task (\(t=1\)) that are calculated by the formulas listed above. The command quantile() has 9 different types of algorithms (default is type = 7). Table 3.2.2 shows percentiles of initial response times for the first task (\(t=1\)) by 9 different algorithms. The way of calculation explained above is “type = 6”. We have the same calculation results between “type = 6” in Table 3.2.2 and Table 3.2.1.

h <- floor(nrow(xh)/4)
Q1 <- 0.75*xh[h,1]+0.25*xh[h+1,1]
Q3 <- 0.25*xh[3*h,1]+0.75*xh[3*h+1,1]
Q2 <- 0.5*xh[2*h,1]+0.5*xh[2*h+1,1]
IQRf <- data.frame(Q1=Q1,Q2=Q2,Q3=Q3)
kable(IQRf, caption="Table 3.2.1: Percentiles by formula for the first task (t=1)",)
Table 3.2.1: Percentiles by formula for the first task (t=1)
Q1 Q2 Q3
20148 29645 43792.75
IQR1<- NULL
for(i in 1:9){
IQR0 <- quantile(xh[,1], type=i)
IQR1 <-rbind(IQR1, IQR0)
}#IQR1 stores 9 types of IQR for task1. Type = 6 is what the author showed in formulas. 
rownames(IQR1) = paste0("IQR type=", 1:9)
colnames(IQR1) = c( "1st percentile", "25th percentile", "50th percentile", 
                    "75th percentile ","100th percentile")
IQR1 <- round(IQR1, digit = 2)
IQR2 <- format(IQR1, digit = 5)
kable(IQR2, caption="Table 3.2.2: 9 types of percentiles for the first task (t=1)", digit=2)
Table 3.2.2: 9 types of percentiles for the first task (t=1)
1st percentile 25th percentile 50th percentile 75th percentile 100th percentile
IQR type=1 2023 20146 29644 43792 300000
IQR type=2 2023 20150 29645 43793 300000
IQR type=3 2023 20146 29644 43792 300000
IQR type=4 2023 20146 29644 43792 300000
IQR type=5 2023 20150 29645 43793 300000
IQR type=6 2023 20148 29645 43793 300000
IQR type=7 2023 20152 29645 43792 300000
IQR type=8 2023 20149 29645 43793 300000
IQR type=9 2023 20150 29645 43793 300000

Again, this example suggests that the commands produce the same results as the sorting method.

3.3 Example II

This code calculates the same results for each of the twenty tasks and visualizes the results using a table.

m2 =NULL
ir2 = NULL 
iqr2 = NULL
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
m0 = median(d1$time)
m2 = rbind(m2, m0)
ir0 = quantile(d1$time, type=6)
ir2 = rbind(ir2, ir0)
iqr0 = IQR(d1$time)
iqr2 = rbind(iqr2, iqr0)
table1 = cbind(m2,iqr2, ir2)
}
rownames(table1) = paste0("task ", 1:20)
colnames(table1) <- c("Median", "IQR", "1st percentile", "25th percentile", 
                      "50th percentile", "75th percentile ","100th percentile")
table3.3 <- data.frame(cbind(c(1:20),table1))
colnames(table3.3) <- c("t th task", "Median", "IQR", "1st percentile", "25th percentile", 
                        "50th percentile", "75th percentile ","100th percentile")

table3.4 <- round(table3.3, digit = 0)
table3.5 <- format(table3.4, digit = 5)

kable(table3.5, align = "c", caption = "**Table 3.3: Median and IQR**", 
      row.names = FALSE, escape = FALSE, centering = T) 
Table 3.3: Median and IQR
t th task Median IQR 1st percentile 25th percentile 50th percentile 75th percentile 100th percentile
1 29645 23640 2023 20148 29645 43793 3e+05
2 20553 18589 2050 12990 20553 31580 3e+05
3 19210 17642 1475 11748 19210 29392 3e+05
4 17394 16838 1590 10981 17394 27827 3e+05
5 16898 16081 1998 10357 16898 26443 3e+05
6 16024 15845 1732 9772 16024 25619 3e+05
7 15416 14888 1787 9422 15416 24326 3e+05
8 15081 15780 1700 9045 15081 24828 3e+05
9 14317 14733 1786 8564 14317 23308 3e+05
10 14210 14608 1669 8305 14210 22915 3e+05
11 13901 13758 1723 8545 13901 22322 3e+05
12 9884 10562 1938 5845 9884 16413 3e+05
13 8982 9520 1818 5106 8982 14630 3e+05
14 8580 9705 1944 4812 8580 14526 3e+05
15 8158 9038 1829 4459 8158 13506 3e+05
16 7906 8649 1801 4377 7906 13033 3e+05
17 7386 8256 1847 4081 7386 12338 3e+05
18 7250 8095 1538 4111 7250 12211 3e+05
19 7134 7991 1290 4043 7134 12036 3e+05
20 6908 7638 1731 3904 6908 11545 3e+05

The largest changes in initial response times occurs between the first and second task and between the eleventh and twelfth task.

3.4 Example III

This code visualizes the results taken from Table 3.3 using a line plot, where the black dots represent the median (Q2) and the red dots represent the \(25^{th}\) and \(75^{th}\) percentiles (Q1 and Q3).

xmax <- 20
xmin <- 1
ymax <- 60000
ymin <- 0
plot(table3.3$'t th task', table3.3$Median, bty = "l", pch = 16, type ="o", 
     xlim = c(xmin, xmax), ylim = c(ymin, ymax),
     xlab = NA, ylab =NA, ) 
par(new=T)
plot(table3.3$'t th task', table3.3$'25th percentile', bty = "l", pch = 1, 
     col = "red", type ="o",
     xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = NA, ylab =NA,)
par(new=T)
plot(table3.3$'t th task', table3.3$'75th percentile', bty = "l", pch = 1, 
     col = "red", type ="o",
     xlim = c(xmin, xmax), ylim = c(ymin, ymax),
     xlab = "task number", ylab ="The initial response time (milliseconds)",
     main = "Graph 3.4: The median and interquartile range of the initial response time")

Generally the \(25^{th}\), \(50^{th}\), and \(75^{th}\) percentiles decrease monotonically by task and the IQR decreases monotonically by task.

4 Reference

Jakubczyk, M., Craig, B. M., Barra, M., Groothuis-Oudshoorn, C. G. M., Hartman, J. D., Huynh, E., Ramos-Goñi, J. M., Stolk, E. A., & Rand, K. (2017). Choice defines value: A predictive modeling competition in Health Preference Research. Value in Health, 21(2), 229–238. https://doi.org/10.1016/j.jval.2017.09.016

Okubo, S and Craig, B. (2023, February 18). Correlations between initial response times. R4HPR. https://r4hpr.org/visor/?e=correlations-between-initial-response-times%e3%80%80

Rizzo, M. L. (2019). Statistical computing with R, second edition. Chapman and Hall.

5 How to cite this entry

Okubo, S and Craig, B. (2023, February 18). Illustrating time to initial response in choice tasks. R4HPR. https://r4hpr.org/visor/?e=analysis-of-initial-response-time-across-20-pair-comparisons