Paper 16: Validity of six consumer-level activity monitors for measuring steps in patients with chronic heart failure

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 15, 2026

References

Vetrovsky T, Siranec M, Marencakova J, Tufano JJ, Capek V, Bunc V, et al. (2019) Validity of six consumer-level activity monitors for measuring steps in patients with chronic heart failure. PLoS ONE 14(9): e0222569. https://doi.org/10.1371/journal.pone.0222569

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Vetrovsky et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper examined the reliability of six activity monitors in patients with chronic heart failure, using the Actigraph accelerometer as the criterion measure. The authors stated that linear regression models were used to test whether the true regression line between each device and the criterion corresponded to the line of concordance, with post-hoc tests applied to this comparison. However, no regression coefficients, standard errors, confidence intervals, or model summaries were reported in either the text or tables. Normality was the only regression assumption explicitly mentioned. Although the authors indicated that p-values were adjusted using Holm’s method, no p-values were reported.

This study was included in the review because Bland–Altman plots and related summary statistics were reported. The authors stated that “the Bland–Altman plots revealed no systematic differences between the activity monitors and Actigraph”; however, this conclusion was not consistently supported by the visual evidence, with several plots displaying patterns suggestive of systematic bias. Given the repeated measurements from participants, linear mixed-effects models are required to formally assess both mean bias and proportional bias. Although the authors reported using linear mixed models to compare devices, they did not report repeated-measures Bland–Altman analyses or concordance correlation coefficients. In the absence of these analyses, it is assumed that simple linear regression models were used to assess mean bias only (intercept-only models), with non-significant p-values (≥ 0.05) interpreted as evidence of no bias.

Data availability and software used

Data was available directly from the supporting information in Excel format with three spreadsheets in long format, and had no data dictionary, the file had person ID but no variable indicating the day sequence each device was measured (eg. 1 - 4). All statistical analyses were performed using the statistical package R.

Regression sample

The authors reported that the Bland–Altman plots revealed no systematic differences between the activity monitors and Actigraph for heart failure patients; this assumes they checked mean and proportional bias with linear regression. As they only reported mean differences, the intercept model will be checked for computational reproducibility. There were six devices used on the heart failure patients which were compared to the Actigraph. A random sample of these devices was chosen, these were Omron (OMR), SmartLAB walk+ (SLW) and Withings Go (WGO).

Computational reproducibility results

Of the models assessed in this paper, two were partially reproducible, while the third was not reproducible. The mean difference between the devices and the Actigraph (intercept model) and corresponding p-values were assessed to determine whether the differences were significantly greater than zero. The p-values were inferred from the authors’ statement that there were no systematic differences in the Bland–Altman plots. However, when computationally reproduced, all three models displayed a significant mean bias (p < 0.001), indicating that the differences between the devices and the Actigraph were greater than zero.

The reported mean differences for ORM and SLW compared with the Actigraph were computationally reproduced. However, the reported mean difference between WGO and the Actigraph was not reproduced. The reported mean difference for HF WGO was -204, but this does not match figure 3A or the reproduced results of -604, this may have been a typo. The authors stated that they cleaned the data based on days of full wear (though no variable indicating day was provided) and removed outliers, but they did not provide sufficient detail to replicate this process. The analysis was based on mean differences reported in Table 4; however, the number of observations was not reported, making it unclear whether discrepancies were due to the use of cleaned or unprocessed data. Overall, although some models were partially reproduced, this paper was considered not computationally reproducible because the results did not support the authors’ conclusions.

Inferential reproducibility results

For the two models (OMR and SLW) that were only partially computationally reproducible, neither was considered inferentially reproducible. The difference in the range of the standardized confidence interval exceeded 0.2 for OMR and was slightly lower for SLW, reflecting artificially narrow confidence intervals due to unaccounted repeated measures. This was supported by substantial intraclass correlation coefficients (>0.25), indicating that a linear model was inappropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that differences between devices varied with the magnitude of the measurement. Overall, the paper was not considered inferentially reproducible.

Recommended changes

Use mixed models to calculate Bland-Altman statistics.
Assess mean and proportional bias for Bland-Altman plots to determine systematic differences between the activity monitors and Actigraph using linear mixed models, update paper with conclusions.
Evaluate the assumptions of the linear regression models by examining residuals, identifying influential outliers, and assessing multicollinearity among predictors. If any assumptions are violated, address them using appropriate methods.
Present all statistics mentioned in the methods section in either the the results or supporting information.
Update data make sure it matches results presented.
Consider creating a reproducible analysis workflow and sharing the code.
Include a data dictionary.

Model 1

Model results for Diffence in OMR - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−475					0.05
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Diffence in OMR - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Diffence in OMR - Actigraph

Term	SS	DF	MS	F	p-value
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Diffence in OMR - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−474.804	135.368	−746.698	−202.910	−3.508	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Diffence in OMR - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.000	0.000	0.000	848.860	957.193
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Diffence in OMR - Actigraph

Term	SS	DF	MS	F	p-value
Residuals	46,727,120.039	50	934,542.401
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Diffence in OMR - Actigraph	51	−474.804	966.717	−246.000	−4,230.000	1,165.000	−1.368	2.865
.fitted	51	−474.804	0.000	−474.804	−474.804	−474.804
.resid	51	−0.000	966.717	228.804	−3,755.196	1,639.804	−1.368	2.865
.leverage	51	0.020	0.000	0.020	0.020	0.020	−6.932	46.020
.sigma	51	966.423	24.094	973.554	812.449	976.529	−5.200	29.852
.cooksd	51	0.020	0.046	0.006	0.000	0.308	5.025	28.245
.std.resid	51	0.000	1.010	0.239	−3.923	1.713	−1.368	2.865
dfb.1_	51	−0.003	0.153	0.033	−0.660	0.247	−1.745	4.923
dffit	51	−0.003	0.153	0.033	−0.660	0.247	−1.745	4.923
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if an observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.

DFBETA threshold

DFBETA measures the change in a regression coefficient, in units of its standard error, when a particular observation is removed from the model. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. In practice, DFBETA \(\pm 1\) is often used to identify outliers.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dffit
3	−0.956	0.020	0.018	−0.135	−0.135
2	−1.151	0.020	0.026	−0.163	−0.163
9	−2.237	0.020	0.093	−0.316	−0.316
48	−4.668	0.020	0.308	−0.660	−0.660
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

While one observation had higher Cook’s distance values than the remaining observations, all values were below 0.5, and both DFBETAS and DFFITS were within conventional thresholds. The COVRATIO suggested that this observation may affect the precision of the parameter estimates if removed.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.134	0.2925	Exact one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.905	<0.001	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
0.198	1.566	0.1160

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.

Assumption conclusions

Given that this is an intercept-only model, the assumptions assessed were normality and independence, with additional checks for outliers. The Q-Q plot suggested the data were approximately normally distributed. Although one small outlier was identified, it fell within acceptable limits for Cook’s distance, DFBETA, and DFFITS, although this observation may effect precision. The Durbin–Watson test showed no significant autocorrelation. However, the assumption of independence was violated, as participants were measured on multiple days.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Diffence in OMR - Actigraph

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept	−475	−474.8039	0.1961	Reproduced
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept	0.05	<0.001	−0.0490	Not Reproduced	Non-sig to sig, B same direction
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison.

Results for p-values

The p-value was not explicitly reported. Instead, the authors inferred non-significance, stating that Bland–Altman plots showed no systematic differences, and a p-value of 0.05 was therefore imputed. In contrast, the reproduced analysis produced a p-value of <0.001; therefore, the p-value was not reproduced.

Conclusion computational reproducibility

This model was found to be partially reproducible, with the regression coefficient reproduced but the p-value was not reproduced.

Methods

This model was found to be partially reproducible. The mean difference between the devices and the Actigraph (intercept model) was successfully reproduced; however, the p-values, which the authors implied were not significant, could not be replicated. As the data appeared to be correct, an inferential reproducibility assessment was conducted to determine whether the authors should have used linear mixed models to calculate Bland–Altman statistics and plots, given that each participant was measured over multiple days. First, the data were scaled but not centered, and the initial analysis used linear regression to assess whether the intercept was equal to zero, which would indicate no mean bias. The same analysis was then performed using a linear mixed model with a random intercept. To further assess bias, a second linear mixed model was fit to examine proportional bias.

Results

The linear regression found that the OMR device, on average, recorded measurements -0.44 standard deviations lower than the Actigraph (95% CI: –0.69 to –0.19, p<0.001).

Linear regression coefficients for mean bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	−0.4400	0.1254	−3.5075	−0.6920	−0.1880	<0.001
R² = 0.000; Sigma = 0.896; AIC = 137; Residual df = 50; No. Obs. = 51
¹ SE = Standard Error

A linear mixed model with a random intercept was first used to estimate the mean bias between the two devices. The outcome was the standardized difference between devices, and the model included only a fixed intercept and a random intercept, which allows each individual to start at a different baseline level of device difference, accounting for the clustering of repeated measurements within individuals. The fixed intercept estimates whether, on average, one device gives higher or lower values than the other across the sample. The analysis found that the OMR device, on average, recorded measurements -0.45 standard deviations lower than the Actigraph (95% CI: –0.81 to –0.10, p=0.014). Substantial correlation was observed for measurements within individuals with ICC=0.354.

Linear mixed model regression coefficients for mean bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	−0.4525	0.1778	−2.5455	−0.8099	−0.0951	0.014
id.SD (Intercept)	0.5413
Residual.SD (Observations)	0.7309
Sigma = 0.731; AIC = 135; Residual df = 48; No. Obs. = 51
¹ SE = Standard Error

Reproduced linear model compared to the bootstrapped linear mixed model

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.4400	−0.4511	0.0111	2.5300	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.6920	−0.7885	0.0966	13.9600	Yes	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.1880	−0.1119	−0.0761	−40.4900	Yes	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.5039	0.6766	0.1727	34.2700	Yes	Yes	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	<0.001	0.0108	−0.0098	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

A linear mixed model was used to assess mean and proportional bias between two devices. The outcome was the standardized difference between devices, with the standardized mean as the predictor. Random intercepts and slopes were included to allow individuals to start at different baseline levels of device difference and to have individual-specific slopes. The fixed intercept estimates overall mean bias, while the fixed slope captures proportional bias, whether the difference between devices changes with measurement magnitude.

After adjusting for the mean of the two devices, no evidence of mean bias remained; the OMR device recorded measurements 0.05 standard deviations higher than the Actigraph on average (95% CI: –0.27 to 0.37, p = 0.746). However, there was evidence of proportional bias: for every one standard deviation increase in the mean of the two devices, the difference between the OMR and Actigraph decreased by -0.60 standard deviations, this suggests that the OMR tends to underestimate higher values relative to the Actigraph.

Linear mixed model regression coefficients for mean and proportional bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	0.0501	0.1537	0.3263	−0.2594	0.3597	0.746
z_meanomr	−0.5966	0.2466	−2.4190	−1.0933	−0.0999	0.020
id.SD (Intercept)	0.0105
id.SD (z_meanomr)	0.6319
id.Cor (Intercept~z_meanomr)	−0.9452
Residual.SD (Observations)	0.4680
Sigma = 0.468; AIC = 105; Residual df = 45; No. Obs. = 51
¹ SE = Standard Error

Scatterplot showing proportional bias between two devices

Assumption check

The performance package produces diagnostic plots to assess model assumptions, including residual normality, homoscedasticity, and potential non-linearity. It also provides component+residual plots for continuous predictors and an outlier plot with Mahalanobis distance contours.

The residuals were approximately normal, and although some outliers were present, Cook’s distance indicated that they were not influential. Linearity was assessed by adding quadratic terms to the model, it was significant, indicating that a linear specification was adequate. No funnelling was observed in the residuals, suggesting that the assumption of homoscedasticity was reasonable.

Cooks distance

Linearity check

Inferential reproducibility conclusion based on linear mixed models

Although the mean bias difference was less than 3%, the bootstrapped confidence interval (CI) was 34% wider than the linear model’s, indicating an underestimation of the standard error relative to the linear mixed model. Because the CI range exceeded 0.2, the model was not considered reproducible. There was substantial within-individual correlation (intraclass correlation coefficient [ICC] = 0.354), indicating that a linear model was not appropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that the difference between devices varied with the magnitude of the measurement. Overall, the model was found not to be inferentially reproducible.

Model 2

Model results for Diffence in SLW - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−474					0.05
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Diffence in SLW - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Diffence in SLW - Actigraph

Term	SS	DF	MS	F	p-value
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Diffence in SLW - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−473.529	123.814	−722.218	−224.841	−3.825	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for Diffence in SLW - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.000	0.000	0.000	839.761	875.500
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Diffence in SLW - Actigraph

Term	SS	DF	MS	F	p-value
Residuals	39,091,552.706	50	781,831.054
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Diffence in SLW - Actigraph	51	−473.529	884.212	−394.000	−2,872.000	1,243.000	−0.298	−0.473
.fitted	51	−473.529	0.000	−473.529	−473.529	−473.529
.resid	51	−0.000	884.212	79.529	−2,398.471	1,716.529	−0.298	−0.473
.leverage	51	0.020	0.000	0.020	0.020	0.020	−6.932	46.020
.sigma	51	884.134	11.866	888.351	823.430	893.177	−2.877	11.135
.cooksd	51	0.020	0.026	0.011	0.000	0.150	2.780	10.446
.std.resid	51	0.000	1.010	0.091	−2.740	1.961	−0.298	−0.473
dfb.1_	51	−0.001	0.145	0.013	−0.416	0.286	−0.347	−0.250
dffit	51	−0.001	0.145	0.013	−0.416	0.286	−0.347	−0.250
* categorical variable

Cooks threshold

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when an observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.

DFBETA threshold

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dffit
3	−1.287	0.020	0.033	−0.182	−0.182
2	−1.440	0.020	0.041	−0.204	−0.204
23	2.020	0.020	0.077	0.286	0.286
44	−2.942	0.020	0.150	−0.416	−0.416
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

While one observation had higher Cook’s distance values than the remaining observations, all values were below 0.5, and both DFBETAS and DFFITS were within conventional thresholds. The COVRATIO suggested that this observation may affect the precision of the parameter estimates if removed.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.100	0.6891	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.968	0.1751	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
0.063	1.846	0.5620

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.

Assumption conclusions

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Diffence in SLW - Actigraph

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept	−474	−473.5294	0.4706	Reproduced
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept	0.05	<0.001	−0.0490	Not Reproduced	Non-sig to sig, B same direction
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison.

Results for p-values

Conclusion computational reproducibility

This model was found to be partially reproducible, with the regression coefficient reproduced but the p-value was not reproduced.

This model was found to be partially reproducible. The mean difference between slw and the Actigraph (intercept model) was successfully reproduced; however, the p-values, which the authors implied were not significant, could not be replicated. As the data appeared to be correct, an inferential reproducibility assessment was conducted to determine whether the authors should have used linear mixed models to calculate Bland–Altman statistics and plots, given that each participant was measured over multiple days. First, the data were scaled but not centred, and the initial analysis used linear regression to assess whether the intercept was equal to zero, which would indicate no mean bias. The same analysis was then performed using a linear mixed model with a random intercept. To further assess bias, a second linear mixed model was fit to examine proportional bias.

Linear regression coefficients for mean bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	−0.4711	0.1232	−3.8245	−0.7184	−0.2237	<0.001
R² = 0.000; Sigma = 0.880; AIC = 135; Residual df = 50; No. Obs. = 51
¹ SE = Standard Error

The linear regression found that the SLW device, on average, recorded measurements -0.47 standard deviations lower than the Actigraph (95% CI: –0.72 to –0.22, p<0.001).

A linear mixed model with a random intercept was first used to estimate the mean bias between the two devices. The outcome was the standardized difference between devices, and the model included only a fixed intercept and a random intercept which allows each individual to start at a different baseline level of device difference, accounting for the clustering of repeated measurements within individuals. The fixed intercept estimates whether, on average, one device gives higher or lower values than the other across the sample. The analysis found that the SLW device, on average, recorded measurements -0.49 standard deviations lower than the Actigraph (95% CI: –0.81 to –0.16, p = 0.004). Substantial correlation was observed for measurements within individuals with ICC=0.26.

Linear mixed model regression coefficients for mean bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	−0.4852	0.1613	−3.0081	−0.8095	−0.1609	0.004
id.SD (Intercept)	0.4508
Residual.SD (Observations)	0.7600
Sigma = 0.760; AIC = 135; Residual df = 48; No. Obs. = 51
¹ SE = Standard Error

Reproduced linear model compared to bootstrapped linear mixed model

Change in regression coefficients for intercept only model

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.4400	−0.4837	0.0437	9.9300	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval for intercept only model

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.6920	−0.7905	0.0986	14.2400	Yes	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval for intercept only model

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.1880	−0.1754	−0.0126	−6.7100	No	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval for intercept only model

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.5039	0.6151	0.1112	22.0600	Yes	Yes	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction for intercept only model

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	<0.001	0.0024	−0.0014	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centred near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multi-modality was observed.

A linear mixed model was used to assess mean and proportional bias between two devices. The outcome was the standardized difference between devices, with the standardized mean as the predictor. Random intercepts and slopes were included to allow individuals to start at different baseline levels of device difference and to have individual-specific slopes. However, there, was not enough variance to fit a random slope model, so a less complex random intercept model was used. The fixed intercept estimates overall mean bias, while the fixed slope captures proportional bias, whether the difference between devices changes with measurement magnitude.

After adjusting for the mean of the two devices, no evidence of mean bias remained; the SLW device recorded measurements 0.07 standard deviations higher than the Actigraph on average (95% CI: –0.40 to 0.54, p = 0.770). However, there was evidence of proportional bias: for every one standard deviation increase in the mean of the two devices, the difference between the SLW and Actigraph decreased by -0.65 standard deviations, indicating that the SLW tends to underestimate at higher values compared to the Actigraph.

Linear mixed model regression coefficients for mean and proportional bias

Term	B	SE¹	t	95% CI		p-value
Term	B	SE¹	t	Lower	Upper	p-value
(Intercept)	0.0754	0.2278	0.3309	−0.3829	0.5337	0.742
z_meanslw	−0.6521	0.2141	−3.0453	−1.0829	−0.2213	0.004
id.SD (Intercept)	0.3388
Residual.SD (Observations)	0.7147
Sigma = 0.715; AIC = 127; Residual df = 47; No. Obs. = 51
¹ SE = Standard Error

Scatterplot showing proportional bias between two devices

Assumption check

The residuals were approximately normal, and although some outliers were present, Cook’s distance indicated that they were not influential. Linearity was assessed by adding quadratic and cubic terms to the model; neither term improved the model, indicating that a linear specification was adequate. Residuals showed only mild variation in spread across fitted values; however, this was small and not considered to materially affect inference. Therefore, the assumption of homoscedasticity was judged to be reasonable.

Residual plots

Cooks Distance

Linearity check

Inferential reproducibility conclusion based on linear mixed models

The mean bias difference slightly was less than 10%, the bootstrapped confidence interval (CI) was 22% wider than the linear model’s, indicating an underestimation of the standard error relative to the linear mixed model. Because the CI range exceeded 0.1, the model was not considered reproducible. There was substantial within-individual correlation, (intraclass correlation coefficient [ICC] = 0.260), indicating that a linear model was not appropriate. After adjusting for the mean of the two devices, there was no evidence of mean bias; however, proportional bias was present, suggesting that the difference between devices varied with the magnitude of the measurement. Overall, the model was found not to be inferentially reproducible.

Model 3

Model results for Diffence in WGO - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−204					0.05
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics Diffence in WGO - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Diffence in WGO - Actigraph

Term	SS	DF	MS	F	p-value
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Diffence in WGO - Actigraph

Term	B	SE	Lower	Upper	t	p-value
Intercept	−604.167	121.068	−846.999	−361.334	−4.990	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Diffence in WGO - Actigraph

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.000	0.000	0.000	889.648	881.392
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Diffence in WGO - Actigraph

Term	SS	DF	MS	F	p-value
Residuals	41,949,965.500	53	791,508.783
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Diffence in WGO - Actigraph

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept	−204	−604.1667	−400.1667	Not Reproduced
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept	0.05	<0.001	−0.0490	Not Reproduced	Non-sig to sig, B same direction
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were >0.05 were set to 0.05 for the purposes of comparison.

Results for p-values

The p-value was not explicitly reported. Instead, the authors inferred non-significance, stating that Bland–Altman plots showed no systematic differences, and a p-value of 0.05 was therefore imputed. In contrast, the reproduced analysis yielded a p-value of <0.001; accordingly, the p-values was not reproducible.

Conclusion computational reproducibility

This model was not reproducible, as the regression coefficient and p-value could not be reproduced.

As this model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced. Therefore, statistical assumptions could not be meaningfully compared or interpreted.