1 Introduction

1.1 Background

This entry examines the distributions of time to the initial response in milliseconds across a series of choice tasks. The R code shows how to illustrate these times as histograms, tables, and line plots as well as how to calculate their medians and interquartile ranges (IQR).

For this worked example, data on initial response times was extracted from the 2016 predictive modeling competition in HPR (Jakubczyk, Craig, et al. 2017). In 2016, 4088 US participants responded to 20 paired comparisons, choosing between two alternative health outcomes. Apart from initial response time, respondents may change their answers before proceeding to the next task (time to last response) or may spend additional time on the page before proceeding to the next task (i.e., page time); nevertheless, the time to initial response is a common behavioral measure of task difficulty in HPR. For this analysis, initial response times were truncated at five minutes (300,000 milliseconds).

We will continue to examine this topic. Please see the entry Correlations between initial response times, where we examine the variance, co-variance, correlations of time to the initial response.

1.2 Load libraries and source files

Notes: Change the working directory to the location of the source files on your computer. To do so, you need to replace the inside of setwd() with the location of data file on your computer.

setwd("C:\\Users\\aaa\\OneDrive\\USF\\Dr. Craig")
library(knitr) #this is for the function "kable," which makes well-organized tables.
library(tidyverse) #this is for the function "read_csv."
library(tinytex)
library(gt)
data1 <- read_csv("resp1wave1_220723.csv")

2 Visualize distributions of initial response times

2.1 Background

Definition: A histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the range divided by the number of bins. To create a histogram of initial response times, you must decide on the number of bins, \(K\), and the range of histogram, \([r_0,r_K)\). When you decide the number of bins, \(K\), “Sturges’ rule” might give you a clue (Rizzo, 2019, p.339). Note that this rule just gives you a rough indication. (This entry does not follow this rule.) \[\small{ \text{Sturges' rule:}\ K= log_2\ {(N \times T)}+1 }\] . You must decide \(r_0\) and \(r_K\). \[\small{ r_0, r_K\ s.t.\ min(x_{it}), max(x_{it}) \in[r_0,r_K) }\] .

Let \(x_{it}\) be the initial response time for respondent \(i\) and task \(t\) (\(1 \le i \le N\); \(1 \le t \le T\)). A histogram needs to satisfy Equation (1).

\[N\times T = \sum_{k=1}^{K}m_k \tag{1}\] where
\[\begin{aligned} N &= \text{the total number of respondents } \\ T &= \text{the total number of tasks per respondent } \\ K &= \text{the total number of bins } (1 \le K \le N \times T) \\ m_k &= \text{the number of}\ x_{it}\ \text{within the}\ k^{th}\ \text{bin}\ (1\le k \le K)\\ &= \text{the number of}\ x_{it}\ (r_{k-1}\le x_{it} < r_k)\\ &= \text{frequency with the}\ k^{th}\ \text{bin}\\ r_j &= \text{the upper boundary of the}\ j^{th}\ \text{bin}\ (min(k)-1\le j \le max(k),\ \text{that is,}\ 0\le j \le K)\\ &= \text{the lower boundary of the}\ j+1^{th}\ \text{bin}\\ r_j-r_{j-1} &= \text{the width of each bin}\\ &=\frac{max(r_j)-min(r_j)}{K}=\frac{r_{K}-r_0}{K} \end{aligned}\]

2.2 Example I

Now, let \(K=50\) (50 bins or breaks). This histogram represents the distribution of the initial response times \(x_{it}\) for all task (\(N=4088\); \(T=20\)).

hist(data1$time,
     breaks = 50,
     main = "Histogram 2.2: Initial response time for the first task",
     xlab = "Initial response time (milliseconds)",
     col = "lightgreen")

2.3 Example II

Now, replace \(x_{it} = log(x_{it})\). This histogram represent the distribution of log initial response times for all tasks (\(N=4088\); \(T=20\))

hist(log(data1$time),
     breaks = 50,
     main = "Histogram 2.3: Log Initial response time for the first task",
     xlab = "Log Initial response time (milliseconds)",
     col = "lightblue")

The histogram better illustrates that the distribution is bimodal (i.e., two local maxima).

2.4 Example III

This code produces overlapping histograms to compare the distribution of log initial response times between the first and last tasks (\(N=4088\); \(t=1,20\))

d2 = matrix()
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
d2 = cbind(d2, d1)
}
d3 <- as.data.frame(d2[,-1]) 
colnames(d3) = paste0("Task", 1:max(data1$task))
rownames(d3) = paste0("ID ", 1:max(data1$survey_id))
xh <- sort(d3$Task1) %>% data.frame()#this is task1 sorted by the value (increasing).
colnames(xh) = c("task1")

hist(d3[,20], breaks = 50, col="#FF00007F",
     main="Histogram 2.4: Initial response time for the first and last tasks",
     xlab="Initial response time (milliseconds)")
hist(d3[,1], breaks = 50, col="#0000FF7F", add=T)

legend( "topright", 
        legend=c("Task 1","Task 20"),
        pch=15,
        col=c("#0000FF7F", "#FF00007F"),
        bty="n",
        ncol=2,
        pt.cex=3
        )

The overlay histogram shows distributional differences between the first and last tasks.

3 Visualize the percentiles of initial response times

3.1 Background

Definition : The \(k^{th}\) percentile is a number such that \(k\) percent of observations have an equal or smaller value than that number. For example, if the \(25^{th}\) percentile is 1256, then 25 percent of observations are equal or smaller than 1256 and 75 percent are larger.
Definition: The median is the \(50^{th}\) percentile (Q2).
Assume initial response times for task \(t\), \(x_{t}\) were sorted from smallest to largest, \(x_{t}[1]\le x_{t}[2]\le ...\le x_{t}[N]\).
When \(N\) is even, that is, \(\exists\ a \in\mathbb{N}\ s.t.\ N = 2a\), \(median\ (Q2)= \frac{1}{2}(x_{t}[\frac{N}{2}]+x_{t}[\frac{N}{2}+1])\).
When \(N\) is odd, that is, \(\exists\ b \in\mathbb{N}\ s.t.\ N = 2b+1\), \(median\ (Q2)= x_{t}[\frac{N+1}{2}]\).
Definition: The interquartile range (IQR) is the distance between the \(25^{th}\) percentile (Q1) and the \(75^{th}\) percentile (Q3). That is, \(IQR = Q3-Q1\).

Q1 and Q3 could be given by Equation (2) and (3), respectively. However, depending on the value of \(N\), \(\frac{N+1}{4}\) and \(\frac{3(N+1)}{4}\) could be non-integers. In those cases, you need to apply a linear interpolation to \(x_{t}[\lfloor\frac{N+1}{4}\rfloor]\) and \(x_{t}[\lceil\frac{N+1}{4}\rceil]\) for Q1 and \(x_{t}[\lfloor\frac{3(N+1)}{4}\rfloor]\) and \(x_{t}[\lceil\frac{3(N+1)}{4}\rceil]\) for Q3 so that you are able to calculate them (shown in cases (i)-(iv)).
\[ Q1=x_{t}[\frac{N+1}{4}] \tag{2} \]\[ Q3=x_{t}[\frac{3(N+1)}{4}]\tag{3} \]

The way of linear interpolation are divided into four cases (i)-(iv) following Equation (4).

\[ \ \forall N \in\ \mathbb{N}, \exists\ h \in\ \mathbb{N}\ s.t.\ N = \begin{cases} {4h}\\ {4h-1}\\ {4h+1}\\ {4h+2} \end{cases} \tag{4}\]

\[\begin{aligned} \text{(i)}\ N=4h \ \ \ \\ Q1&=\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]\\ Q3&=\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1]\\ IQR&=Q3-Q1\\ &=(\frac{1}{4}x_{t}[3h]+\frac{3}{4}x_{t}[3h+1])-(\frac{3}{4}x_{t}[h]+\frac{1}{4}x_{t}[h+1]) \\ \text{(ii)}\ N=4h-1\\ Q1&=\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1] \\ Q3&=\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{1}{2}x_{t}[3h+1]+\frac{1}{2}x_{t}[3h+2])-(\frac{1}{2}x_{t}[h]+\frac{1}{2}x_{t}[h+1]) \\ \text{(iii)}\ N=4h+1 \\ Q1&=x_{t}[h] \\ Q3&=x_{t}[3h] \\ IQR&=Q3-Q1\\ &=x_{t}[3h]-x_{t}[h] \\ \text{(iv)}\ N=4h+2 \\ Q1&=\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1] \\ Q3&=\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2] \\ IQR&=Q3-Q1\\ &=(\frac{3}{4}x_{t}[3h+2]+\frac{1}{4}x_{t}[3h+2])-(\frac{1}{4}x_{t}[h]+\frac{3}{4}x_{t}[h+1]) \\ \end{aligned}\]

3.2 Example I

This code calculates percentiles of initial response times for the first task (\(t=1\)) using the sorted data \(x_{i1}[h]\) as well as the command quantile() for comparison. Table 3.2.1 shows Q1, Q2, and Q3 for the first task (\(t=1\)) that are calculated by the formulas listed above. The command quantile() has 9 different types of algorithms (default is type = 7). Table 3.2.2 shows percentiles of initial response times for the first task (\(t=1\)) by 9 different algorithms. The way of calculation explained above is “type = 6”. We have the same calculation results between “type = 6” in Table 3.2.2 and Table 3.2.1.

h <- floor(nrow(xh)/4)
Q1 <- 0.75*xh[h,1]+0.25*xh[h+1,1]
Q3 <- 0.25*xh[3*h,1]+0.75*xh[3*h+1,1]
Q2 <- 0.5*xh[2*h,1]+0.5*xh[2*h+1,1]
IQRf <- data.frame(Q1=Q1,Q2=Q2,Q3=Q3)

kable(IQRf, caption="Table 3.2.1: Percentiles by formula for the first task (t=1)",)

Table 3.2.1: Percentiles by formula for the first task (t=1)
Q1	Q2	Q3
20148	29645	43792.75

IQR1<- NULL
for(i in 1:9){
IQR0 <- quantile(xh[,1], type=i)
IQR1 <-rbind(IQR1, IQR0)
}#IQR1 stores 9 types of IQR for task1. Type = 6 is what the author showed in formulas. 
rownames(IQR1) = paste0("IQR type=", 1:9)
colnames(IQR1) = c( "1st percentile", "25th percentile", "50th percentile", 
                    "75th percentile ","100th percentile")
IQR1 <- round(IQR1, digit = 2)
IQR2 <- format(IQR1, digit = 5)
kable(IQR2, caption="Table 3.2.2: 9 types of percentiles for the first task (t=1)", digit=2)

Table 3.2.2: 9 types of percentiles for the first task (t=1)
	1st percentile	25th percentile	50th percentile	75th percentile	100th percentile
IQR type=1	2023	20146	29644	43792	300000
IQR type=2	2023	20150	29645	43793	300000
IQR type=3	2023	20146	29644	43792	300000
IQR type=4	2023	20146	29644	43792	300000
IQR type=5	2023	20150	29645	43793	300000
IQR type=6	2023	20148	29645	43793	300000
IQR type=7	2023	20152	29645	43792	300000
IQR type=8	2023	20149	29645	43793	300000
IQR type=9	2023	20150	29645	43793	300000

Again, this example suggests that the commands produce the same results as the sorting method.

3.3 Example II

This code calculates the same results for each of the twenty tasks and visualizes the results using a table.

m2 =NULL
ir2 = NULL 
iqr2 = NULL
for(i in 1:20) {d1 <- data1[data1$task == i, "time"]
m0 = median(d1$time)
m2 = rbind(m2, m0)
ir0 = quantile(d1$time, type=6)
ir2 = rbind(ir2, ir0)
iqr0 = IQR(d1$time)
iqr2 = rbind(iqr2, iqr0)
table1 = cbind(m2,iqr2, ir2)
}
rownames(table1) = paste0("task ", 1:20)
colnames(table1) <- c("Median", "IQR", "1st percentile", "25th percentile", 
                      "50th percentile", "75th percentile ","100th percentile")
table3.3 <- data.frame(cbind(c(1:20),table1))
colnames(table3.3) <- c("t th task", "Median", "IQR", "1st percentile", "25th percentile", 
                        "50th percentile", "75th percentile ","100th percentile")

table3.4 <- round(table3.3, digit = 0)
table3.5 <- format(table3.4, digit = 5)

kable(table3.5, align = "c", caption = "**Table 3.3: Median and IQR**", 
      row.names = FALSE, escape = FALSE, centering = T)

**Table 3.3: Median and IQR**
t th task	Median	IQR	1st percentile	25th percentile	50th percentile	75th percentile	100th percentile
1	29645	23640	2023	20148	29645	43793	3e+05
2	20553	18589	2050	12990	20553	31580	3e+05
3	19210	17642	1475	11748	19210	29392	3e+05
4	17394	16838	1590	10981	17394	27827	3e+05
5	16898	16081	1998	10357	16898	26443	3e+05
6	16024	15845	1732	9772	16024	25619	3e+05
7	15416	14888	1787	9422	15416	24326	3e+05
8	15081	15780	1700	9045	15081	24828	3e+05
9	14317	14733	1786	8564	14317	23308	3e+05
10	14210	14608	1669	8305	14210	22915	3e+05
11	13901	13758	1723	8545	13901	22322	3e+05
12	9884	10562	1938	5845	9884	16413	3e+05
13	8982	9520	1818	5106	8982	14630	3e+05
14	8580	9705	1944	4812	8580	14526	3e+05
15	8158	9038	1829	4459	8158	13506	3e+05
16	7906	8649	1801	4377	7906	13033	3e+05
17	7386	8256	1847	4081	7386	12338	3e+05
18	7250	8095	1538	4111	7250	12211	3e+05
19	7134	7991	1290	4043	7134	12036	3e+05
20	6908	7638	1731	3904	6908	11545	3e+05

The largest changes in initial response times occurs between the first and second task and between the eleventh and twelfth task.

3.4 Example III

This code visualizes the results taken from Table 3.3 using a line plot, where the black dots represent the median (Q2) and the red dots represent the \(25^{th}\) and \(75^{th}\) percentiles (Q1 and Q3).

xmax <- 20
xmin <- 1
ymax <- 60000
ymin <- 0
plot(table3.3$'t th task', table3.3$Median, bty = "l", pch = 16, type ="o", 
     xlim = c(xmin, xmax), ylim = c(ymin, ymax),
     xlab = NA, ylab =NA, ) 
par(new=T)
plot(table3.3$'t th task', table3.3$'25th percentile', bty = "l", pch = 1, 
     col = "red", type ="o",
     xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = NA, ylab =NA,)
par(new=T)
plot(table3.3$'t th task', table3.3$'75th percentile', bty = "l", pch = 1, 
     col = "red", type ="o",
     xlim = c(xmin, xmax), ylim = c(ymin, ymax),
     xlab = "task number", ylab ="The initial response time (milliseconds)",
     main = "Graph 3.4: The median and interquartile range of the initial response time")

Generally the \(25^{th}\), \(50^{th}\), and \(75^{th}\) percentiles decrease monotonically by task and the IQR decreases monotonically by task.

4 Reference

Jakubczyk, M., Craig, B. M., Barra, M., Groothuis-Oudshoorn, C. G. M., Hartman, J. D., Huynh, E., Ramos-Goñi, J. M., Stolk, E. A., & Rand, K. (2017). Choice defines value: A predictive modeling competition in Health Preference Research. Value in Health, 21(2), 229–238. https://doi.org/10.1016/j.jval.2017.09.016

Okubo, S and Craig, B. (2023, February 18). Correlations between initial response times. R4HPR. https://r4hpr.org/visor/?e=correlations-between-initial-response-times%e3%80%80

Rizzo, M. L. (2019). Statistical computing with R, second edition. Chapman and Hall.

5 How to cite this entry

Okubo, S and Craig, B. (2023, February 18). Illustrating time to initial response in choice tasks. R4HPR. https://r4hpr.org/visor/?e=analysis-of-initial-response-time-across-20-pair-comparisons