Paper 16: Validity of six consumer-level activity monitors for measuring steps in patients with chronic heart failure
References
Vetrovsky T, Siranec M, Marencakova J, Tufano JJ, Capek V, Bunc V, et al. (2019) Validity of six consumer-level activity monitors for measuring steps in patients with chronic heart failure. PLoS ONE 14(9): e0222569. https://doi.org/10.1371/journal.pone.0222569
Disclosure
This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.
Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.
Methods used in our reproducibility analyses
There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.
The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.
To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.
For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.
Results from the reproduction of the Vetrovsky et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.
Summary from statistical review
This paper examined the reliability of six activity monitors in patients with chronic heart failure, using the Actigraph accelerometer as the criterion measure. The authors stated that linear regression models were used to test whether the true regression line between each device and the criterion corresponded to the line of concordance, with post-hoc tests applied to this comparison. However, no regression coefficients, standard errors, confidence intervals, or model summaries were reported in either the text or tables. Normality was the only regression assumption explicitly mentioned. Although the authors indicated that p-values were adjusted using Holm’s method, no p-values were reported.
This study was included in the review because Bland–Altman plots and related summary statistics were reported. The authors stated that “the Bland–Altman plots revealed no systematic differences between the activity monitors and Actigraph”; however, this conclusion was not consistently supported by the visual evidence, with several plots displaying patterns suggestive of systematic bias. Given the repeated measurements from participants, linear mixed-effects models are required to formally assess both mean bias and proportional bias. Although the authors reported using linear mixed models to compare devices, they did not report repeated-measures Bland–Altman analyses or concordance correlation coefficients. In the absence of these analyses, it is assumed that simple linear regression models were used to assess mean bias only (intercept-only models), with non-significant p-values (≥ 0.05) interpreted as evidence of no bias.
Data availability and software used
Data was available directly from the supporting information in Excel format with three spreadsheets in long format, and had no data dictionary, the file had person ID but no variable indicating the day sequence each device was measured (eg. 1 - 4). All statistical analyses were performed using the statistical package R.
Regression sample
The authors reported that the Bland–Altman plots revealed no systematic differences between the activity monitors and Actigraph for heart failure patients; this assumes they checked mean and proportional bias with linear regression. As they only reported mean differences, the intercept model will be checked for computational reproducibility. There were six devices used on the heart failure patients which were compared to the Actigraph. A random sample of these devices was chosen, these were Omron (OMR), SmartLAB walk+ (SLW) and Withings Go (WGO).
Computational reproducibility results
Of the models assessed in this paper, two were partially reproducible, while the third was not reproducible. The mean difference between the devices and the Actigraph (intercept model) and corresponding p-values were assessed to determine whether the differences were significantly greater than zero. The p-values were inferred from the authors’ statement that there were no systematic differences in the Bland–Altman plots. However, when computationally reproduced, all three models displayed a significant mean bias (p < 0.001), indicating that the differences between the devices and the Actigraph were greater than zero.
The reported mean differences for ORM and SLW compared with the Actigraph were computationally reproduced. However, the reported mean difference between WGO and the Actigraph was not reproduced. The reported mean difference for HF WGO was -204, but this does not match figure 3A or the reproduced results of -604, this may have been a typo. The authors stated that they cleaned the data based on days of full wear (though no variable indicating day was provided) and removed outliers, but they did not provide sufficient detail to replicate this process. The analysis was based on mean differences reported in Table 4; however, the number of observations was not reported, making it unclear whether discrepancies were due to the use of cleaned or unprocessed data. Overall, although some models were partially reproduced, this paper was considered not computationally reproducible because the results did not support the authors’ conclusions.
Inferential reproducibility results
For the two models (OMR and SLW) that were only partially computationally reproducible, neither was considered inferentially reproducible. The difference in the range of the standardized confidence interval exceeded 0.2 for OMR and was slightly lower for SLW, reflecting artificially narrow confidence intervals due to unaccounted repeated measures. This was supported by substantial intraclass correlation coefficients (>0.25), indicating that a linear model was inappropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that differences between devices varied with the magnitude of the measurement. Overall, the paper was not considered inferentially reproducible.
Recommended changes
- Use mixed models to calculate Bland-Altman statistics.
- Assess mean and proportional bias for Bland-Altman plots to determine systematic differences between the activity monitors and Actigraph using linear mixed models, update paper with conclusions.
- Evaluate the assumptions of the linear regression models by examining residuals, identifying influential outliers, and assessing multicollinearity among predictors. If any assumptions are violated, address them using appropriate methods.
- Present all statistics mentioned in the methods section in either the the results or supporting information.
- Update data make sure it matches results presented.
- Consider creating a reproducible analysis workflow and sharing the code.
- Include a data dictionary.
Model 1
Model results for Diffence in OMR - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −475 | 0.05 | ||||
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Fit statistics for Diffence in OMR - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA table for Diffence in OMR - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | |||||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square. | |||||
Model results for Diffence in OMR - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −474.804 | 135.368 | −746.698 | −202.910 | −3.508 | <0.001 |
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Fit statistics for Diffence in OMR - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
0.000 | 0.000 | 0.000 | 848.860 | 957.193 | ||||
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA table for Diffence in OMR - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | 46,727,120.039 | 50 | 934,542.401 | ||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS. | |||||
Checking residuals plots for patterns
Blue line showing quadratic fit for residuals
Checking univariate relationships with the dependent variable using scatterplots
Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling
Model descriptives including cook’s distance and leverage to understand outliers
Term | N | Mean | SD | Median | Min | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
Diffence in OMR - Actigraph | 51 | −474.804 | 966.717 | −246.000 | −4,230.000 | 1,165.000 | −1.368 | 2.865 |
.fitted | 51 | −474.804 | 0.000 | −474.804 | −474.804 | −474.804 | ||
.resid | 51 | −0.000 | 966.717 | 228.804 | −3,755.196 | 1,639.804 | −1.368 | 2.865 |
.leverage | 51 | 0.020 | 0.000 | 0.020 | 0.020 | 0.020 | −6.932 | 46.020 |
.sigma | 51 | 966.423 | 24.094 | 973.554 | 812.449 | 976.529 | −5.200 | 29.852 |
.cooksd | 51 | 0.020 | 0.046 | 0.006 | 0.000 | 0.308 | 5.025 | 28.245 |
.std.resid | 51 | 0.000 | 1.010 | 0.239 | −3.923 | 1.713 | −1.368 | 2.865 |
dfb.1_ | 51 | −0.003 | 0.153 | 0.033 | −0.660 | 0.247 | −1.745 | 4.923 |
dffit | 51 | −0.003 | 0.153 | 0.033 | −0.660 | 0.247 | −1.745 | 4.923 |
* categorical variable | ||||||||
Cooks threshold
Cook’s distance measures the overall change in fit, if an observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 is often used to identify influential observations.
DFFIT threshold
DFFIT measures how many standard deviations the fitted values will change when observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.
DFBETA threshold
DFBETA measures the change in a regression coefficient, in units of its standard error, when a particular observation is removed from the model. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. In practice, DFBETA \(\pm 1\) is often used to identify outliers.
Influence plot
Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.
COVRATIO plot
COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.
Observations of interest identified by the influence plot
ID | StudRes | Leverage | CookD | dfb.1_ | dffit |
|---|---|---|---|---|---|
3 | −0.956 | 0.020 | 0.018 | −0.135 | −0.135 |
2 | −1.151 | 0.020 | 0.026 | −0.163 | −0.163 |
9 | −2.237 | 0.020 | 0.093 | −0.316 | −0.316 |
48 | −4.668 | 0.020 | 0.308 | −0.660 | −0.660 |
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed. | |||||
Results for outliers and influential points
While one observation had higher Cook’s distance values than the remaining observations, all values were below 0.5, and both DFBETAS and DFFITS were within conventional thresholds. The COVRATIO suggested that this observation may affect the precision of the parameter estimates if removed.
Checking for normality of the residuals using a Q–Q plot
Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests
Statistic | p-value | Method |
|---|---|---|
0.134 | 0.2925 | Exact one-sample Kolmogorov-Smirnov test |
Statistic | p-value | Method |
|---|---|---|
0.905 | <0.001 | Shapiro-Wilk normality test |
Normality results
- The Kolmogorov-Smirnov supports residuals being normally distributed.
- The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
- QQ-plot looks roughly normal.
Assessing independence with the Durbin–Watson test for autocorrelation
AutoCorrelation | Statistic | p-value |
|---|---|---|
0.198 | 1.566 | 0.1160 |
Independence results
- The Durbin–Watson test suggests there are no auto-correlation issues.
- The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions
Given that this is an intercept-only model, the assumptions assessed were normality and independence, with additional checks for outliers. The Q-Q plot suggested the data were approximately normally distributed. Although one small outlier was identified, it fell within acceptable limits for Cook’s distance, DFBETA, and DFFITS, although this observation may effect precision. The Durbin–Watson test showed no significant autocorrelation. However, the assumption of independence was violated, as participants were measured on multiple days.
Forest plot showing original and reproduced coefficients and 95% confidence intervals for Diffence in OMR - Actigraph
Change in regression coefficients
term | O_B | R_B | Change.B | reproduce.B |
|---|---|---|---|---|
Intercept | −475 | −474.8039 | 0.1961 | Reproduced |
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced. | ||||
Change in p-values
Term | O_p | R_p | Change.p | Reproduce.p | SigChangeDirection |
|---|---|---|---|---|---|
Intercept | 0.05 | <0.001 | −0.0490 | Not Reproduced | Non-sig to sig, B same direction |
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison. | |||||
Results for p-values
The p-value was not explicitly reported. Instead, the authors inferred non-significance, stating that Bland–Altman plots showed no systematic differences, and a p-value of 0.05 was therefore imputed. In contrast, the reproduced analysis produced a p-value of <0.001; therefore, the p-value was not reproduced.
Conclusion computational reproducibility
This model was found to be partially reproducible, with the regression coefficient reproduced but the p-value was not reproduced.
Methods
This model was found to be partially reproducible. The mean difference between the devices and the Actigraph (intercept model) was successfully reproduced; however, the p-values, which the authors implied were not significant, could not be replicated. As the data appeared to be correct, an inferential reproducibility assessment was conducted to determine whether the authors should have used linear mixed models to calculate Bland–Altman statistics and plots, given that each participant was measured over multiple days. First, the data were scaled but not centered, and the initial analysis used linear regression to assess whether the intercept was equal to zero, which would indicate no mean bias. The same analysis was then performed using a linear mixed model with a random intercept. To further assess bias, a second linear mixed model was fit to examine proportional bias.
Results
The linear regression found that the OMR device, on average, recorded measurements -0.44 standard deviations lower than the Actigraph (95% CI: –0.69 to –0.19, p<0.001).
Linear regression coefficients for mean bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | −0.4400 | 0.1254 | −3.5075 | −0.6920 | −0.1880 | <0.001 |
| R² = 0.000; Sigma = 0.896; AIC = 137; Residual df = 50; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
A linear mixed model with a random intercept was first used to estimate the mean bias between the two devices. The outcome was the standardized difference between devices, and the model included only a fixed intercept and a random intercept, which allows each individual to start at a different baseline level of device difference, accounting for the clustering of repeated measurements within individuals. The fixed intercept estimates whether, on average, one device gives higher or lower values than the other across the sample. The analysis found that the OMR device, on average, recorded measurements -0.45 standard deviations lower than the Actigraph (95% CI: –0.81 to –0.10, p=0.014). Substantial correlation was observed for measurements within individuals with ICC=0.354.
Linear mixed model regression coefficients for mean bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | −0.4525 | 0.1778 | −2.5455 | −0.8099 | −0.0951 | 0.014 |
| id.SD (Intercept) | 0.5413 | |||||
| Residual.SD (Observations) | 0.7309 | |||||
| Sigma = 0.731; AIC = 135; Residual df = 48; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
Reproduced linear model compared to the bootstrapped linear mixed model
Change in regression coefficients
Term | B | boot.B | B_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.4400 | −0.4511 | 0.0111 | 2.5300 | No | No | No |
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in lower 95% confidence interval
Term | Lower | boot.Lower | Lower_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.6920 | −0.7885 | 0.0966 | 13.9600 | Yes | No | No |
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in upper 95% confidence interval
Term | Upper | boot.Upper | Upper_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.1880 | −0.1119 | −0.0761 | −40.4900 | Yes | No | No |
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in Range of 95% confidence interval
Term | Range | boot.Range | Range_Diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | 0.5039 | 0.6766 | 0.1727 | 34.2700 | Yes | Yes | No |
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in p-value significance and regression coefficient direction
Term | p-value | boot.p-value | changep | SigChangeDirection |
|---|---|---|---|---|
Intercept | <0.001 | 0.0108 | −0.0098 | Remains sig, B same direction |
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model. | ||||
Check the distribution of bootstrap estimates
The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.
A linear mixed model was used to assess mean and proportional bias between two devices. The outcome was the standardized difference between devices, with the standardized mean as the predictor. Random intercepts and slopes were included to allow individuals to start at different baseline levels of device difference and to have individual-specific slopes. The fixed intercept estimates overall mean bias, while the fixed slope captures proportional bias, whether the difference between devices changes with measurement magnitude.
After adjusting for the mean of the two devices, no evidence of mean bias remained; the OMR device recorded measurements 0.05 standard deviations higher than the Actigraph on average (95% CI: –0.27 to 0.37, p = 0.746). However, there was evidence of proportional bias: for every one standard deviation increase in the mean of the two devices, the difference between the OMR and Actigraph decreased by -0.60 standard deviations, this suggests that the OMR tends to underestimate higher values relative to the Actigraph.
Linear mixed model regression coefficients for mean and proportional bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | 0.0501 | 0.1537 | 0.3263 | −0.2594 | 0.3597 | 0.746 |
| z_meanomr | −0.5966 | 0.2466 | −2.4190 | −1.0933 | −0.0999 | 0.020 |
| id.SD (Intercept) | 0.0105 | |||||
| id.SD (z_meanomr) | 0.6319 | |||||
| id.Cor (Intercept~z_meanomr) | −0.9452 | |||||
| Residual.SD (Observations) | 0.4680 | |||||
| Sigma = 0.468; AIC = 105; Residual df = 45; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
Scatterplot showing proportional bias between two devices
Assumption check
The performance package produces diagnostic plots to assess model assumptions, including residual normality, homoscedasticity, and potential non-linearity. It also provides component+residual plots for continuous predictors and an outlier plot with Mahalanobis distance contours.
The residuals were approximately normal, and although some outliers were present, Cook’s distance indicated that they were not influential. Linearity was assessed by adding quadratic terms to the model, it was significant, indicating that a linear specification was adequate. No funnelling was observed in the residuals, suggesting that the assumption of homoscedasticity was reasonable.
Cooks distance
Linearity check
Inferential reproducibility conclusion based on linear mixed models
Although the mean bias difference was less than 3%, the bootstrapped confidence interval (CI) was 34% wider than the linear model’s, indicating an underestimation of the standard error relative to the linear mixed model. Because the CI range exceeded 0.2, the model was not considered reproducible. There was substantial within-individual correlation (intraclass correlation coefficient [ICC] = 0.354), indicating that a linear model was not appropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that the difference between devices varied with the magnitude of the measurement. Overall, the model was found not to be inferentially reproducible.
Model 2
Model results for Diffence in SLW - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −474 | 0.05 | ||||
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Fit statistics for Diffence in SLW - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA table for Diffence in SLW - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | |||||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square. | |||||
Model results Diffence in SLW - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −473.529 | 123.814 | −722.218 | −224.841 | −3.825 | <0.001 |
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Model fit for Diffence in SLW - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
0.000 | 0.000 | 0.000 | 839.761 | 875.500 | ||||
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA table for Diffence in SLW - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | 39,091,552.706 | 50 | 781,831.054 | ||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS. | |||||
Checking residuals plots for patterns
Blue line showing quadratic fit for residuals
Checking univariate relationships with the dependent variable using scatterplots
Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling
Model descriptives including cook’s distance and leverage to understand outliers
Term | N | Mean | SD | Median | Min | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
Diffence in SLW - Actigraph | 51 | −473.529 | 884.212 | −394.000 | −2,872.000 | 1,243.000 | −0.298 | −0.473 |
.fitted | 51 | −473.529 | 0.000 | −473.529 | −473.529 | −473.529 | ||
.resid | 51 | −0.000 | 884.212 | 79.529 | −2,398.471 | 1,716.529 | −0.298 | −0.473 |
.leverage | 51 | 0.020 | 0.000 | 0.020 | 0.020 | 0.020 | −6.932 | 46.020 |
.sigma | 51 | 884.134 | 11.866 | 888.351 | 823.430 | 893.177 | −2.877 | 11.135 |
.cooksd | 51 | 0.020 | 0.026 | 0.011 | 0.000 | 0.150 | 2.780 | 10.446 |
.std.resid | 51 | 0.000 | 1.010 | 0.091 | −2.740 | 1.961 | −0.298 | −0.473 |
dfb.1_ | 51 | −0.001 | 0.145 | 0.013 | −0.416 | 0.286 | −0.347 | −0.250 |
dffit | 51 | −0.001 | 0.145 | 0.013 | −0.416 | 0.286 | −0.347 | −0.250 |
* categorical variable | ||||||||
Cooks threshold
Cook’s distance measures the overall change in fit, if an observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 is often used to identify influential observations.
DFFIT threshold
DFFIT measures how many standard deviations the fitted values will change when an observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.
DFBETA threshold
DFBETA measures the change in a regression coefficient, in units of its standard error, when a particular observation is removed from the model. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. In practice, DFBETA \(\pm 1\) is often used to identify outliers.
Influence plot
Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.
COVRATIO plot
COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.
Observations of interest identified by the influence plot
ID | StudRes | Leverage | CookD | dfb.1_ | dffit |
|---|---|---|---|---|---|
3 | −1.287 | 0.020 | 0.033 | −0.182 | −0.182 |
2 | −1.440 | 0.020 | 0.041 | −0.204 | −0.204 |
23 | 2.020 | 0.020 | 0.077 | 0.286 | 0.286 |
44 | −2.942 | 0.020 | 0.150 | −0.416 | −0.416 |
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed. | |||||
Results for outliers and influential points
- While one observation had higher Cook’s distance values than the remaining observations, all values were below 0.5, and both DFBETAS and DFFITS were within conventional thresholds. The COVRATIO suggested that this observation may affect the precision of the parameter estimates if removed.
Checking for normality of the residuals using a Q–Q plot
Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests
Statistic | p-value | Method |
|---|---|---|
0.100 | 0.6891 | Asymptotic one-sample Kolmogorov-Smirnov test |
Statistic | p-value | Method |
|---|---|---|
0.968 | 0.1751 | Shapiro-Wilk normality test |
Normality results
- The Kolmogorov-Smirnov supports residuals being normally distributed.
- The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
- QQ-plot looks roughly normal.
Assessing independence with the Durbin–Watson test for autocorrelation
AutoCorrelation | Statistic | p-value |
|---|---|---|
0.063 | 1.846 | 0.5620 |
Independence results
- The Durbin–Watson test suggests there are no auto-correlation issues.
- The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions
Given that this is an intercept-only model, the assumptions assessed were normality and independence, with additional checks for outliers. The Q-Q plot suggested the data were approximately normally distributed. Although one small outlier was identified, it fell within acceptable limits for Cook’s distance, DFBETA, and DFFITS, although this observation may effect precision. The Durbin–Watson test showed no significant autocorrelation. However, the assumption of independence was violated, as participants were measured on multiple days.
Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Diffence in SLW - Actigraph
Change in regression coefficients
term | O_B | R_B | Change.B | reproduce.B |
|---|---|---|---|---|
Intercept | −474 | −473.5294 | 0.4706 | Reproduced |
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced. | ||||
Change in p-values
Term | O_p | R_p | Change.p | Reproduce.p | SigChangeDirection |
|---|---|---|---|---|---|
Intercept | 0.05 | <0.001 | −0.0490 | Not Reproduced | Non-sig to sig, B same direction |
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison. | |||||
Results for p-values
The p-value was not explicitly reported. Instead, the authors inferred non-significance, stating that Bland–Altman plots showed no systematic differences, and a p-value of 0.05 was therefore imputed. In contrast, the reproduced analysis produced a p-value of <0.001; therefore, the p-value was not reproduced.
Conclusion computational reproducibility
This model was found to be partially reproducible, with the regression coefficient reproduced but the p-value was not reproduced.
This model was found to be partially reproducible. The mean difference between slw and the Actigraph (intercept model) was successfully reproduced; however, the p-values, which the authors implied were not significant, could not be replicated. As the data appeared to be correct, an inferential reproducibility assessment was conducted to determine whether the authors should have used linear mixed models to calculate Bland–Altman statistics and plots, given that each participant was measured over multiple days. First, the data were scaled but not centred, and the initial analysis used linear regression to assess whether the intercept was equal to zero, which would indicate no mean bias. The same analysis was then performed using a linear mixed model with a random intercept. To further assess bias, a second linear mixed model was fit to examine proportional bias.
Linear regression coefficients for mean bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | −0.4711 | 0.1232 | −3.8245 | −0.7184 | −0.2237 | <0.001 |
| R² = 0.000; Sigma = 0.880; AIC = 135; Residual df = 50; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
The linear regression found that the SLW device, on average, recorded measurements -0.47 standard deviations lower than the Actigraph (95% CI: –0.72 to –0.22, p<0.001).
A linear mixed model with a random intercept was first used to estimate the mean bias between the two devices. The outcome was the standardized difference between devices, and the model included only a fixed intercept and a random intercept which allows each individual to start at a different baseline level of device difference, accounting for the clustering of repeated measurements within individuals. The fixed intercept estimates whether, on average, one device gives higher or lower values than the other across the sample. The analysis found that the SLW device, on average, recorded measurements -0.49 standard deviations lower than the Actigraph (95% CI: –0.81 to –0.16, p = 0.004). Substantial correlation was observed for measurements within individuals with ICC=0.26.
Linear mixed model regression coefficients for mean bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | −0.4852 | 0.1613 | −3.0081 | −0.8095 | −0.1609 | 0.004 |
| id.SD (Intercept) | 0.4508 | |||||
| Residual.SD (Observations) | 0.7600 | |||||
| Sigma = 0.760; AIC = 135; Residual df = 48; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
Reproduced linear model compared to bootstrapped linear mixed model
Change in regression coefficients for intercept only model
Term | B | boot.B | B_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.4400 | −0.4837 | 0.0437 | 9.9300 | No | No | No |
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in lower 95% confidence interval for intercept only model
Term | Lower | boot.Lower | Lower_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.6920 | −0.7905 | 0.0986 | 14.2400 | Yes | No | No |
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in upper 95% confidence interval for intercept only model
Term | Upper | boot.Upper | Upper_diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | −0.1880 | −0.1754 | −0.0126 | −6.7100 | No | No | No |
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in Range of 95% confidence interval for intercept only model
Term | Range | boot.Range | Range_Diff | %_Diff | Diff_10% | Diff_0.1 | Diff_0.2 |
|---|---|---|---|---|---|---|---|
Intercept | 0.5039 | 0.6151 | 0.1112 | 22.0600 | Yes | Yes | No |
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively. | |||||||
Change in p-value significance and regression coefficient direction for intercept only model
Term | p-value | boot.p-value | changep | SigChangeDirection |
|---|---|---|---|---|
Intercept | <0.001 | 0.0024 | −0.0014 | Remains sig, B same direction |
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model. | ||||
Check the distribution of bootstrap estimates
The bootstrap distribution of each coefficient appeared approximately normal and centred near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multi-modality was observed.
A linear mixed model was used to assess mean and proportional bias between two devices. The outcome was the standardized difference between devices, with the standardized mean as the predictor. Random intercepts and slopes were included to allow individuals to start at different baseline levels of device difference and to have individual-specific slopes. However, there, was not enough variance to fit a random slope model, so a less complex random intercept model was used. The fixed intercept estimates overall mean bias, while the fixed slope captures proportional bias, whether the difference between devices changes with measurement magnitude.
After adjusting for the mean of the two devices, no evidence of mean bias remained; the SLW device recorded measurements 0.07 standard deviations higher than the Actigraph on average (95% CI: –0.40 to 0.54, p = 0.770). However, there was evidence of proportional bias: for every one standard deviation increase in the mean of the two devices, the difference between the SLW and Actigraph decreased by -0.65 standard deviations, indicating that the SLW tends to underestimate at higher values compared to the Actigraph.
Linear mixed model regression coefficients for mean and proportional bias
| Term | B | SE1 | t |
95% CI
|
p-value | |
|---|---|---|---|---|---|---|
| Lower | Upper | |||||
| (Intercept) | 0.0754 | 0.2278 | 0.3309 | −0.3829 | 0.5337 | 0.742 |
| z_meanslw | −0.6521 | 0.2141 | −3.0453 | −1.0829 | −0.2213 | 0.004 |
| id.SD (Intercept) | 0.3388 | |||||
| Residual.SD (Observations) | 0.7147 | |||||
| Sigma = 0.715; AIC = 127; Residual df = 47; No. Obs. = 51 | ||||||
| 1 SE = Standard Error | ||||||
Scatterplot showing proportional bias between two devices
Assumption check
The residuals were approximately normal, and although some outliers were present, Cook’s distance indicated that they were not influential. Linearity was assessed by adding quadratic and cubic terms to the model; neither term improved the model, indicating that a linear specification was adequate. Residuals showed only mild variation in spread across fitted values; however, this was small and not considered to materially affect inference. Therefore, the assumption of homoscedasticity was judged to be reasonable.
Residual plots
Cooks Distance
Linearity check
Inferential reproducibility conclusion based on linear mixed models
The mean bias difference slightly was less than 10%, the bootstrapped confidence interval (CI) was 22% wider than the linear model’s, indicating an underestimation of the standard error relative to the linear mixed model. Because the CI range exceeded 0.1, the model was not considered reproducible. There was substantial within-individual correlation, (intraclass correlation coefficient [ICC] = 0.260), indicating that a linear model was not appropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that the difference between devices varied with the magnitude of the measurement. Overall, the model was found not to be inferentially reproducible.
Model 3
Model results for Diffence in WGO - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −204 | 0.05 | ||||
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Fit Statistics Diffence in WGO - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA Table for Diffence in WGO - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | |||||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square. | |||||
Model results for Diffence in WGO - Actigraph
Term | B | SE | Lower | Upper | t | p-value |
|---|---|---|---|---|---|---|
Intercept | −604.167 | 121.068 | −846.999 | −361.334 | −4.990 | <0.001 |
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval. | ||||||
Fit statistics for Diffence in WGO - Actigraph
R | R2 | R2Adj | AIC | RMSE | F | DF1 | DF2 | p-value |
|---|---|---|---|---|---|---|---|---|
0.000 | 0.000 | 0.000 | 889.648 | 881.392 | ||||
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals. | ||||||||
ANOVA Table for Diffence in WGO - Actigraph
Term | SS | DF | MS | F | p-value |
|---|---|---|---|---|---|
Residuals | 41,949,965.500 | 53 | 791,508.783 | ||
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS. | |||||
Forest plot showing original and reproduced coefficients and 95% confidence intervals for Diffence in WGO - Actigraph
Change in regression coefficients
term | O_B | R_B | Change.B | reproduce.B |
|---|---|---|---|---|
Intercept | −204 | −604.1667 | −400.1667 | Not Reproduced |
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced. | ||||
Change in p-values
Term | O_p | R_p | Change.p | Reproduce.p | SigChangeDirection |
|---|---|---|---|---|---|
Intercept | 0.05 | <0.001 | −0.0490 | Not Reproduced | Non-sig to sig, B same direction |
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison. | |||||
Results for p-values
The p-value was not explicitly reported. Instead, the authors inferred non-significance, stating that Bland–Altman plots showed no systematic differences, and a p-value of 0.05 was therefore imputed. In contrast, the reproduced analysis yielded a p-value of <0.001; accordingly, the p-values was not reproducible.
Conclusion computational reproducibility
This model was not reproducible, as the regression coefficient and p-value could not be reproduced.
As this model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced. Therefore, statistical assumptions could not be meaningfully compared or interpreted.