Paper 41: Weight loss is associated with improved quality of life among rural women completers of a web-based lifestyle intervention

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 28, 2026

References

Hageman PA, Mroz JE, Yoerger MA, Pullen CH (2019) Weight loss is associated with improved quality of life among rural women completers of a web-based lifestyle intervention. PLoS ONE 14(11): e0225446. https://doi.org/10.1371/journal.pone.0225446

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability, and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Hageman et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper examined weight change and health quality of life (HQOL) in a population after a health intervention. Multivariable linear regression was used to determine whether weight loss is associated with improvements in QOL across seven health domains. The author checked raw data for normality but did not mention other assumptions, outliers or collinearity. Direction, but not the size, of the regression coefficients was interpreted.

Data availability and software used

The authors provide a wide-format SPSS dataset in the supporting information, which has an in-built data dictionary. The analyses were conducted using SPSS.

Regression sample

Seven multivariable linear regression models were reported. Three outcomes: depression, sleep disturbance, and fatigue—were selected at random for assessment. The primary predictor was percentage change in weight, with models adjusted for age, number of comorbidities, change in physical activity from baseline, intervention group, and baseline outcome scores. The authors did not report results for all covariates included in the models, presenting estimates only for the primary variable of interest, percentage weight loss.

Computational reproducibility results

This paper was computationally reproducible. All reported statistics for the depression model were successfully reproduced, and the other two models were mostly reproducible. The p-value for percentage weight loss in the sleep-disturbance model was not reproduced. However, this discrepancy was considered a typographical error, as the reported value of 0.67 was reproduced as 0.57 and did not alter the statistical significance of the result. A minor rounding discrepancy was observed in one coefficient in the fatigue model, and R² instead of adjusted R² was mistakenly reported for this model.

Inferential reproducibility results

All three models were inferentially reproducible. Residual diagnostics did not indicate major violations of linear model assumptions; however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline scores. A small number of larger standardized residuals were also present, which may contribute to wider confidence intervals but did not materially affect point estimates or inference. Although some statistics differed by 10% or more between models, these differences were not considered meaningful, as changes in standardized regression coefficients were less than 0.10. The direction of effects and statistical significance remained consistent between the reproduced models and the bootstrapped sensitivity analyses.

Recommended changes

Provide tables in the Supporting Information that present all analyses conducted in the paper, including full model outputs such as regression coefficients and all variables used for adjustment.
Correct the error in the p-value in the sleep-disturbance model.
Report adjusted R² in the fatigue model.
Evaluate the assumptions of the linear regression models by examining residuals, identifying influential outliers, and assessing multicollinearity among predictors. If any assumptions are violated, address them using appropriate methods.

Model 1

Model results for Depression change

Term	B	Lower	Upper	p-value
Intercept
agPerWtChange	0.13	0.03	0.24	0.01
Age
comorbcatsum
agmodup_act
group:
Discussion – Internet Only
E-Mail – Internet Only
adepsc
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Depression change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
		0.30			14.09
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Depression change

Term	SS	DF	MS	F	p-value
agPerWtChange
Age
comorbcatsum
agmodup_act
group
adepsc
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Depression change

Term	B	SE	Lower	Upper	t	p-value
Intercept	35.127	4.400	26.452	43.803	7.983	<0.001
agPerWtChange	0.133	0.054	0.027	0.238	2.473	0.0142
Age	−0.233	0.062	−0.355	−0.112	−3.785	<0.001
comorbcatsum	1.259	0.329	0.610	1.908	3.824	<0.001
agmodup_act	0.013	0.022	−0.031	0.056	0.580	0.5624
group:
Discussion – Internet Only	−0.352	0.976	−2.275	1.571	−0.361	0.7185
E-Mail – Internet Only	0.771	0.964	−1.130	2.672	0.800	0.4248
adepsc	−0.501	0.060	−0.619	−0.383	−8.352	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Depression change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.568	0.323	0.300	1,373.339	5.658	14.087	7	207	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Depression change

Term	SS	DF	MS	F	p-value
agPerWtChange	203.349	1	203.349	6.116	0.0142
Age	476.181	1	476.181	14.323	<0.001
comorbcatsum	486.085	1	486.085	14.621	<0.001
agmodup_act	11.191	1	11.191	0.337	0.5624
group	44.457	2	22.229	0.669	0.5135
adepsc	2,318.876	1	2,318.876	69.748	<0.001
Residuals	6,881.993	207	33.246
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term	Statistic	p-value	Results
agPerWtChange	0.449	0.6539	No linearity violation
Age	−0.277	0.7820	No linearity violation
comorbcatsum	0.381	0.7033	No linearity violation
agmodup_act	−0.789	0.4309	No linearity violation
group
adepsc	1.293	0.1973	No linearity violation
Tukey test	0.211	0.8329	No linearity violation
Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
9.747	0.2034	7	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test supports homoscedasticity.
There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Depression change	215	0.020	6.890	0.000	−16.300	32.300	0.707	2.309
agPerWtChange	215	−4.471	7.648	−2.879	−35.702	12.220	−0.996	1.650
Age	215	54.660	6.771	55.000	40.000	69.000	0.001	−0.669
comorbcatsum	215	1.474	1.271	1.000	0.000	5.000	0.778	−0.073
agmodup_act	215	−2.098	18.624	−0.952	−71.016	47.405	−0.404	1.480
group*	215	1.963	0.825	2.000	1.000	3.000	0.069	−1.535
adepsc	215	47.368	6.722	49.000	41.000	62.200	0.394	−1.337
.fitted	215	0.020	3.914	0.896	−9.724	7.592	−0.261	−0.884
.resid	215	−0.000	5.671	−0.974	−11.056	25.929	0.843	1.769
.leverage	215	0.037	0.016	0.034	0.015	0.138	2.200	8.297
.sigma	215	5.766	0.028	5.774	5.479	5.780	−6.208	53.826
.cooksd	215	0.005	0.010	0.002	0.000	0.098	5.621	44.016
.std.resid	215	0.000	1.002	−0.172	−1.949	4.580	0.841	1.752
dfb.1_	215	0.000	0.088	−0.003	−0.245	0.666	2.547	16.953
dfb.aPWC	215	−0.000	0.069	0.001	−0.526	0.196	−2.301	15.355
dfb.Age	215	−0.000	0.083	−0.002	−0.617	0.304	−1.833	15.932
dfb.cmrb	215	0.000	0.068	0.001	−0.294	0.240	−0.333	3.575
dfb.agm_	215	−0.000	0.061	0.001	−0.315	0.268	−1.033	7.011
dfb.grpD	215	0.000	0.066	−0.001	−0.223	0.317	0.433	3.726
dfb.gE.M	215	0.000	0.074	0.000	−0.235	0.388	0.920	4.978
dfb.adps	215	−0.000	0.068	0.012	−0.309	0.212	−0.851	3.045
dffit	215	0.001	0.199	−0.032	−0.440	0.931	0.911	2.163
cov.r	215	1.041	0.068	1.058	0.459	1.134	−4.218	27.075
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.aPWC	dfb.Age	dfb.cmrb	dfb.agm_	dfb.grpD	dfb.gE.M	dfb.adps	dffit	cov.r
20	−0.269	0.086	0.001	−0.020	0.018	0.004	−0.001	0.074	0.016	0.019	0.024	−0.082	1.134
97	3.133	0.029	0.035	−0.245	0.035	0.279	−0.141	−0.216	0.033	0.315	0.063	0.543	0.738
157	1.661	0.138	0.055	−0.139	−0.526	−0.040	0.108	−0.315	0.005	0.123	0.197	0.666	1.085
131	4.820	0.036	0.098	0.666	0.068	−0.617	0.115	−0.182	−0.038	0.388	−0.309	0.931	0.459
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals > 3. Both had low leverage and small Cook’s distance, with DFBETAS and DFFITS within conventional ranges. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.082	0.1083	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.960	<0.001	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing collinearity with VIF

Term	VIF	Tolerance
agPerWtChange	1.040	0.961
Age	1.058	0.945
comorbcatsum	1.061	0.942
agmodup_act	1.038	0.963
group	1.015	0.985
adepsc	1.023	0.977
VIF = Variance Inflation Factor.

Collinearity results

All VIF values are under three, indicating no collinearity issues.
Overall, when taking into account VIF and SE, the model does not have collinearity issues.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
−0.063	2.123	0.3800

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
While the study design was longitudinal change scores were used, therefore, no violation of linearity.

Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline depression scores. Outlier diagnostics indicated that point estimates were unlikely to be substantially affected by influential points, but confidence-interval width could be affected and should be further investigated.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Depression change

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept		35.1272
agPerWtChange	0.13	0.1326	0.0026	Reproduced
Age		−0.2332
comorbcatsum		1.2586
agmodup_act		0.0127
group:
Discussion – Internet Only		−0.3521
E-Mail – Internet Only		0.7711
adepsc		−0.5010
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term	O_lower	R_lower	Change.lci	Reproduce.lower
Intercept		26.4519
agPerWtChange	0.03	0.0269	−0.0031	Reproduced
Age		−0.3546
comorbcatsum		0.6097
agmodup_act		−0.0306
group:
Discussion – Internet Only		−2.2754
E-Mail – Internet Only		−1.1297
adepsc		−0.6193
O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term	O_upper	R_upper	Change.uci	Reproduce.upper
Intercept		43.8025
agPerWtChange	0.24	0.2382	−0.0018	Reproduced
Age		−0.1117
comorbcatsum		1.9076
agmodup_act		0.0561
group:
Discussion – Internet Only		1.5711
E-Mail – Internet Only		2.6718
adepsc		−0.3827
O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R²

O_R2Adj	R_R2Adj	Change.R2Adj	Reproduce.R2Adj
0.300	0.2998	−0.0002	Reproduced
O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	14.09	14.0872	−0.0028	Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept		<0.001
agPerWtChange	0.01	0.0142	0.0042	Reproduced	Remains sig, B same direction
Age		<0.001
comorbcatsum		<0.001
agmodup_act		0.5624
group:
Discussion – Internet Only		0.7185
E-Mail – Internet Only		0.4248
adepsc		<0.001
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardized, and inference was assessed using bootstrapped standardized regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.0252	−0.0263	0.0011	4.2500	No	No	No
z_agPerWtChange	0.1470	0.1507	−0.0038	−2.5600	No	No	No
z_Age	−0.2294	−0.2296	0.0002	0.0900	No	No	No
z_comorbcatsum	0.2317	0.2309	0.0008	0.3600	No	No	No
z_agmodup_act	0.0346	0.0362	−0.0017	−4.8300	No	No	No
groupDiscussion	−0.0511	−0.0491	−0.0020	−3.9800	No	No	No
groupE-Mail	0.1119	0.1123	−0.0004	−0.3300	No	No	No
z_adepsc	−0.4944	−0.4965	0.0021	0.4300	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.2149	−0.1854	−0.0295	−13.7300	Yes	No	No
z_agPerWtChange	0.0298	0.0315	−0.0017	−5.6200	No	No	No
z_Age	−0.3489	−0.3750	0.0261	7.4700	No	No	No
z_comorbcatsum	0.1122	0.1172	−0.0050	−4.4100	No	No	No
z_agmodup_act	−0.0829	−0.0743	−0.0087	−10.4400	Yes	No	No
groupDiscussion	−0.3302	−0.3154	−0.0148	−4.4700	No	No	No
groupE-Mail	−0.1640	−0.1546	−0.0093	−5.6900	No	No	No
z_adepsc	−0.6111	−0.6111	0.0000	0.0000	No	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.1646	0.1463	0.0183	11.0900	Yes	No	No
z_agPerWtChange	0.2641	0.2551	0.0090	3.4200	No	No	No
z_Age	−0.1099	−0.0997	−0.0102	−9.2600	No	No	No
z_comorbcatsum	0.3511	0.3491	0.0021	0.5800	No	No	No
z_agmodup_act	0.1521	0.1284	0.0237	15.5700	Yes	No	No
groupDiscussion	0.2280	0.2098	0.0182	7.9700	No	No	No
groupE-Mail	0.3878	0.4137	−0.0259	−6.6800	No	No	No
z_adepsc	−0.3777	−0.3897	0.0120	3.1800	No	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.3795	0.3317	−0.0478	−12.5900	Yes	No	No
z_agPerWtChange	0.2343	0.2236	−0.0107	−4.5700	No	No	No
z_Age	0.2390	0.2752	0.0362	15.1600	Yes	No	No
z_comorbcatsum	0.2389	0.2319	−0.0070	−2.9300	No	No	No
z_agmodup_act	0.2350	0.2026	−0.0323	−13.7600	Yes	No	No
groupDiscussion	0.5582	0.5253	−0.0329	−5.9000	No	No	No
groupE-Mail	0.5517	0.5683	0.0166	3.0000	No	No	No
z_adepsc	0.2334	0.2214	−0.0120	−5.1400	No	No	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	0.7937	0.7563	0.0375	Remains non-sig, B same direction
z_agPerWtChange	0.0142	0.0085	0.0057	Remains sig, B same direction
z_Age	<0.001	0.0011	−0.0009	Remains sig, B same direction
z_comorbcatsum	<0.001	<0.001	0.0001	Remains sig, B same direction
z_agmodup_act	0.5624	0.4808	0.0816	Remains non-sig, B same direction
groupDiscussion	0.7185	0.7126	0.0058	Remains non-sig, B same direction
groupE-Mail	0.4248	0.4397	−0.0149	Remains non-sig, B same direction
z_adepsc	<0.001	<0.001	0.0000	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Model 2

Model results for Sleep disturbance change

Term	B	Lower	Upper	p-value
Intercept
agPerWtChange	0.03	−0.07	0.13	0.67
Age
comorbcatsum
agmodup_act
group:
Discussion – Internet Only
E-Mail – Internet Only
asldsc
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Sleep disturbance change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
		0.16			6.86
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Sleep disturbance change

Term	SS	DF	MS	F	p-value
agPerWtChange
Age
comorbcatsum
agmodup_act
group
asldsc
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Sleep disturbance change

Term	B	SE	Lower	Upper	t	p-value
Intercept	22.625	4.195	14.355	30.895	5.394	<0.001
agPerWtChange	0.030	0.053	−0.074	0.134	0.564	0.5734
Age	−0.109	0.060	−0.228	0.010	−1.812	0.0714
comorbcatsum	0.479	0.324	−0.160	1.117	1.478	0.1408
agmodup_act	0.006	0.021	−0.036	0.048	0.292	0.7707
group:
Discussion – Internet Only	−0.465	0.952	−2.342	1.412	−0.488	0.6258
E-Mail – Internet Only	0.417	0.950	−1.455	2.290	0.439	0.6608
asldsc	−0.349	0.053	−0.454	−0.243	−6.521	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for Sleep disturbance change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.433	0.188	0.160	1,373.052	5.572	6.864	7	208	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Sleep disturbance change

Term	SS	DF	MS	F	p-value
agPerWtChange	10.256	1	10.256	0.318	0.5734
Age	105.847	1	105.847	3.283	0.0714
comorbcatsum	70.472	1	70.472	2.186	0.1408
agmodup_act	2.747	1	2.747	0.085	0.7707
group	26.478	2	13.239	0.411	0.6638
asldsc	1,370.835	1	1,370.835	42.518	<0.001
Residuals	6,706.214	208	32.241
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term	Statistic	p-value	Results
agPerWtChange	1.731	0.0850	No linearity violation
Age	1.479	0.1406	No linearity violation
comorbcatsum	1.315	0.1899	No linearity violation
agmodup_act	−0.186	0.8525	No linearity violation
group
asldsc	−0.555	0.5798	No linearity violation
Tukey test	−0.713	0.4758	No linearity violation
Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
10.752	0.1498	7	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test supports homoscedasticity.
There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Sleep disturbance change	216	0.262	6.197	0.000	−20.400	31.300	0.568	3.132
agPerWtChange	216	−4.446	7.640	−2.875	−35.702	12.220	−1.003	1.664
Age	216	54.699	6.779	55.000	40.000	69.000	−0.007	−0.680
comorbcatsum	216	1.477	1.268	1.000	0.000	5.000	0.773	−0.070
agmodup_act	216	−1.961	18.689	−0.940	−71.016	47.405	−0.402	1.437
group*	216	1.963	0.823	2.000	1.000	3.000	0.068	−1.528
asldsc	216	48.531	7.327	48.400	32.000	68.800	−0.430	0.280
.fitted	216	0.262	2.684	0.110	−6.993	7.648	0.207	0.173
.resid	216	0.000	5.585	−0.158	−18.553	27.011	0.510	3.042
.leverage	216	0.037	0.016	0.033	0.014	0.127	1.908	5.714
.sigma	216	5.678	0.032	5.687	5.355	5.692	−6.442	53.018
.cooksd	216	0.005	0.015	0.001	0.000	0.168	7.363	65.581
.std.resid	216	0.001	1.004	−0.028	−3.335	4.889	0.530	3.130
dfb.1_	216	0.000	0.071	0.000	−0.297	0.316	0.043	4.002
dfb.aPWC	216	−0.000	0.073	0.000	−0.481	0.411	−0.986	14.448
dfb.Age	216	0.000	0.067	−0.001	−0.329	0.232	−0.511	5.335
dfb.cmrb	216	0.000	0.074	−0.001	−0.241	0.619	3.078	25.828
dfb.agm_	216	0.000	0.084	0.000	−0.492	0.562	−0.110	17.911
dfb.grpD	216	−0.000	0.070	−0.001	−0.326	0.245	−0.299	4.099
dfb.gE.M	216	−0.000	0.075	0.000	−0.421	0.312	−0.339	6.114
dfb.asld	216	−0.000	0.079	0.000	−0.572	0.263	−1.403	13.985
dffit	216	0.003	0.211	−0.005	−0.699	1.230	1.171	6.941
cov.r	216	1.042	0.075	1.059	0.413	1.164	−4.481	28.054
* categorical variable

Cooks threshold

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.aPWC	dfb.Age	dfb.cmrb	dfb.agm_	dfb.grpD	dfb.gE.M	dfb.asld	dffit	cov.r
134	−0.156	0.108	0.000	−0.018	0.040	0.005	−0.014	0.012	−0.005	−0.012	0.029	−0.054	1.164
157	1.528	0.127	0.042	−0.061	−0.481	−0.035	0.111	−0.300	0.028	0.112	0.074	0.583	1.089
89	3.866	0.056	0.103	0.176	0.411	−0.310	−0.241	0.562	−0.108	0.312	0.219	0.938	0.630
71	5.184	0.053	0.168	0.223	−0.331	0.168	0.619	0.302	−0.326	−0.421	−0.572	1.230	0.413
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals > 3. DFBETAS and Cook’s distance within conventional ranges but the DFFIT for the largest outlier maybe of concern. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.055	0.5349	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.961	<0.001	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing collinearity with VIF

Term	VIF	Tolerance
agPerWtChange	1.039	0.962
Age	1.057	0.946
comorbcatsum	1.061	0.943
agmodup_act	1.032	0.969
group	1.012	0.988
asldsc	1.012	0.988
VIF = Variance Inflation Factor.

Collinearity results

All VIF values are under three, indicating no collinearity issues.
Overall, when taking into account VIF and SE, the model does not have collinearity issues.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
−0.088	2.174	0.1460

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
While the study design was longitudinal change scores were used, thefore no violation of linearity.

Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline scores. Outlier diagnostics indicated that a couple obervation may be a concern and should be further investigated.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Sleep disturbance change

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept		22.6248
agPerWtChange	0.03	0.0297	−0.0003	Reproduced
Age		−0.1094
comorbcatsum		0.4787
agmodup_act		0.0062
group:
Discussion – Internet Only		−0.4649
E-Mail – Internet Only		0.4175
asldsc		−0.3487
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term	O_lower	R_lower	Change.lci	Reproduce.lower
Intercept		14.3551
agPerWtChange	−0.07	−0.0741	−0.0041	Reproduced
Age		−0.2285
comorbcatsum		−0.1596
agmodup_act		−0.0359
group:
Discussion – Internet Only		−2.3415
E-Mail – Internet Only		−1.4551
asldsc		−0.4541
O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term	O_upper	R_upper	Change.uci	Reproduce.upper
Intercept		30.8945
agPerWtChange	0.13	0.1335	0.0035	Reproduced
Age		0.0096
comorbcatsum		1.1170
agmodup_act		0.0484
group:
Discussion – Internet Only		1.4118
E-Mail – Internet Only		2.2900
asldsc		−0.2433
O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R²

O_R2Adj	R_R2Adj	Change.R2Adj	Reproduce.R2Adj
0.160	0.1603	0.0003	Reproduced
O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	6.86	6.8637	0.0037	Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept		<0.001
agPerWtChange	0.67	0.5734	−0.0966	Not Reproduced	Remains non-sig, B same direction
Age		0.0714
comorbcatsum		0.1408
agmodup_act		0.7707
group:
Discussion – Internet Only		0.6258
E-Mail – Internet Only		0.6608
asldsc		<0.001
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value was not reproduced.

Conclusion computational reproducibility

This model was mostly computationally reproducible. With a likely typographic error identified for the p-value results, although the p-value had the same interpretation, and regression coefficients did not change direction.

Methods

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.0028	0.0026	0.0002	8.0500	No	No	No
z_agPerWtChange	0.0366	0.0377	−0.0011	−2.9100	No	No	No
z_Age	−0.1197	−0.1216	0.0019	1.5500	No	No	No
z_comorbcatsum	0.0980	0.0981	−0.0001	−0.0700	No	No	No
z_agmodup_act	0.0188	0.0197	−0.0009	−4.7200	No	No	No
groupDiscussion	−0.0750	−0.0752	0.0001	0.1800	No	No	No
groupE-Mail	0.0674	0.0655	0.0018	2.7300	No	No	No
z_asldsc	−0.4123	−0.4136	0.0013	0.3200	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.2047	−0.2249	0.0202	9.8500	No	No	No
z_agPerWtChange	−0.0914	−0.0949	0.0036	3.9000	No	No	No
z_Age	−0.2500	−0.2435	−0.0065	−2.6100	No	No	No
z_comorbcatsum	−0.0327	−0.0163	−0.0164	−50.2500	Yes	No	No
z_agmodup_act	−0.1084	−0.1312	0.0229	21.1000	Yes	No	No
groupDiscussion	−0.3779	−0.3820	0.0041	1.1000	No	No	No
groupE-Mail	−0.2348	−0.2608	0.0260	11.0700	Yes	No	No
z_asldsc	−0.5370	−0.5651	0.0281	5.2300	No	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.2103	0.2521	−0.0418	−19.8700	Yes	No	No
z_agPerWtChange	0.1646	0.1652	−0.0006	−0.3700	No	No	No
z_Age	0.0105	0.0030	0.0076	71.9900	Yes	No	No
z_comorbcatsum	0.2286	0.2556	−0.0269	−11.7700	Yes	No	No
z_agmodup_act	0.1460	0.1669	−0.0209	−14.3200	Yes	No	No
groupDiscussion	0.2278	0.2174	0.0104	4.5800	No	No	No
groupE-Mail	0.3696	0.3785	−0.0089	−2.4100	No	No	No
z_asldsc	−0.2877	−0.2837	−0.0039	−1.3600	No	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.4151	0.4770	0.0620	14.9300	Yes	No	No
z_agPerWtChange	0.2560	0.2602	0.0042	1.6300	No	No	No
z_Age	0.2605	0.2464	−0.0141	−5.4200	No	No	No
z_comorbcatsum	0.2613	0.2718	0.0105	4.0100	No	No	No
z_agmodup_act	0.2544	0.2982	0.0438	17.2100	Yes	No	No
groupDiscussion	0.6057	0.5994	−0.0063	−1.0400	No	No	No
groupE-Mail	0.6044	0.6393	0.0349	5.7800	No	No	No
z_asldsc	0.2493	0.2813	0.0320	12.8400	Yes	No	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	0.9789	0.9831	−0.0042	Remains non-sig, B same direction
z_agPerWtChange	0.5734	0.5681	0.0052	Remains non-sig, B same direction
z_Age	0.0714	0.0533	0.0182	Remains non-sig, B same direction
z_comorbcatsum	0.1408	0.1528	−0.0120	Remains non-sig, B same direction
z_agmodup_act	0.7707	0.7941	−0.0235	Remains non-sig, B same direction
groupDiscussion	0.6258	0.6254	0.0004	Remains non-sig, B same direction
groupE-Mail	0.6608	0.6861	−0.0253	Remains non-sig, B same direction
z_asldsc	<0.001	<0.001	−0.0000	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

Conclusions based on the bootstrapped model

Model 3

Model results for Fatigue change

Term	B	Lower	Upper	p-value
Intercept
agPerWtChange	0.24	0.12	0.38	<0.001
Age
comorbcatsum
agmodup_act
group:
Discussion – Internet Only
E-Mail – Internet Only
afatsc
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics Fatigue change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
		0.20			7.56
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fatigue change

Term	SS	DF	MS	F	p-value
agPerWtChange
Age
comorbcatsum
agmodup_act
group
afatsc
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Fatigue change

Term	B	SE	Lower	Upper	t	p-value
Intercept	30.488	5.609	19.431	41.546	5.436	<0.001
agPerWtChange	0.249	0.066	0.120	0.379	3.794	<0.001
Age	−0.233	0.075	−0.381	−0.084	−3.086	0.0023
comorbcatsum	0.840	0.404	0.043	1.637	2.078	0.0389
agmodup_act	−0.025	0.027	−0.078	0.027	−0.942	0.3474
group:
Discussion – Internet Only	−0.728	1.191	−3.076	1.621	−0.611	0.5420
E-Mail – Internet Only	0.300	1.189	−2.045	2.645	0.252	0.8009
afatsc	−0.368	0.073	−0.512	−0.224	−5.044	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Fatigue change

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.450	0.203	0.176	1,468.766	6.954	7.563	7	208	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fatigue change

Term	SS	DF	MS	F	p-value
agPerWtChange	722.947	1	722.947	14.396	<0.001
Age	478.333	1	478.333	9.525	0.0023
comorbcatsum	216.902	1	216.902	4.319	0.0389
agmodup_act	44.540	1	44.540	0.887	0.3474
group	38.165	2	19.083	0.380	0.6843
afatsc	1,277.792	1	1,277.792	25.445	<0.001
Residuals	10,445.335	208	50.218
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term	Statistic	p-value	Results
agPerWtChange	−0.010	0.9924	No linearity violation
Age	0.498	0.6191	No linearity violation
comorbcatsum	−0.162	0.8718	No linearity violation
agmodup_act	1.613	0.1083	No linearity violation
group
afatsc	0.079	0.9370	No linearity violation
Tukey test	0.729	0.4658	No linearity violation
Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
5.585	0.5890	7	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test supports homoscedasticity.
There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Fatigue change	216	−1.156	7.807	0.000	−23.300	29.800	0.157	1.429
agPerWtChange	216	−4.446	7.640	−2.875	−35.702	12.220	−1.003	1.664
Age	216	54.699	6.779	55.000	40.000	69.000	−0.007	−0.680
comorbcatsum	216	1.477	1.268	1.000	0.000	5.000	0.773	−0.070
agmodup_act	216	−1.961	18.689	−0.940	−71.016	47.405	−0.402	1.437
group*	216	1.963	0.823	2.000	1.000	3.000	0.068	−1.528
afatsc	216	51.487	6.716	51.000	33.700	71.600	−0.438	0.670
.fitted	216	−1.156	3.517	−1.414	−11.440	10.155	0.198	0.258
.resid	216	−0.000	6.970	0.286	−20.702	26.061	−0.019	1.063
.leverage	216	0.037	0.016	0.033	0.014	0.128	1.870	5.063
.sigma	216	7.086	0.031	7.097	6.855	7.104	−3.785	19.206
.cooksd	216	0.005	0.011	0.002	0.000	0.105	5.214	35.288
.std.resid	216	0.001	1.003	0.041	−2.995	3.784	−0.011	1.091
dfb.1_	216	0.000	0.075	−0.002	−0.281	0.431	1.266	8.439
dfb.aPWC	216	0.000	0.080	0.000	−0.452	0.546	1.542	17.409
dfb.Age	216	0.000	0.083	0.001	−0.345	0.476	0.939	9.811
dfb.cmrb	216	−0.000	0.070	0.000	−0.359	0.218	−1.181	5.949
dfb.agm_	216	−0.000	0.068	0.000	−0.284	0.550	2.229	21.110
dfb.grpD	216	−0.000	0.067	−0.001	−0.212	0.377	0.799	5.319
dfb.gE.M	216	0.000	0.072	−0.000	−0.198	0.340	0.564	2.970
dfb.afts	216	−0.000	0.063	0.000	−0.245	0.225	−0.373	3.447
dffit	216	0.004	0.204	0.007	−0.690	0.946	0.217	2.839
cov.r	216	1.041	0.066	1.059	0.622	1.133	−2.799	10.680
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potentially influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice, a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.aPWC	dfb.Age	dfb.cmrb	dfb.agm_	dfb.grpD	dfb.gE.M	dfb.afts	dffit	cov.r
16	−0.620	0.096	0.005	−0.100	0.027	0.031	−0.095	−0.078	−0.067	−0.003	0.133	−0.202	1.133
157	1.430	0.128	0.037	−0.070	−0.452	−0.030	0.103	−0.284	0.022	0.099	0.084	0.548	1.102
100	3.202	0.047	0.061	−0.198	−0.174	0.428	0.143	−0.097	0.377	0.051	−0.228	0.714	0.741
89	3.911	0.055	0.105	0.431	0.403	−0.327	−0.201	0.550	−0.063	0.340	−0.207	0.946	0.622
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.054	0.5665	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.983	0.0098	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing collinearity with VIF

Term	VIF	Tolerance
agPerWtChange	1.038	0.963
Age	1.058	0.945
comorbcatsum	1.061	0.943
agmodup_act	1.031	0.970
group	1.014	0.987
afatsc	1.014	0.986
VIF = Variance Inflation Factor.

Collinearity results

All VIF values are under three, indicating no collinearity issues.
Overall, when taking into account VIF and SE, the model does not have collinearity issues.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
−0.050	2.098	0.4860

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
While the study design was longitudinal change scores were used, therefore no violation of linearity.

Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, outlier diagnostics indicated that point estimates were unlikely to be substantially affected by influential points, but confidence-interval width could be affected and should be further investigated.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Fatigue change

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept		30.4885
agPerWtChange	0.24	0.2492	0.0092	Incorrect Rounding
Age		−0.2328
comorbcatsum		0.8399
agmodup_act		−0.0251
group:
Discussion – Internet Only		−0.7277
E-Mail – Internet Only		0.3003
afatsc		−0.3681
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term	O_lower	R_lower	Change.lci	Reproduce.lower
Intercept		19.4307
agPerWtChange	0.12	0.1197	−0.0003	Reproduced
Age		−0.3815
comorbcatsum		0.0432
agmodup_act		−0.0776
group:
Discussion – Internet Only		−3.0765
E-Mail – Internet Only		−2.0447
afatsc		−0.5120
O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term	O_upper	R_upper	Change.uci	Reproduce.upper
Intercept		41.5463
agPerWtChange	0.38	0.3787	−0.0013	Reproduced
Age		−0.0841
comorbcatsum		1.6366
agmodup_act		0.0274
group:
Discussion – Internet Only		1.6210
E-Mail – Internet Only		2.6452
afatsc		−0.2243
O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R²

O_R2Adj	R_R2Adj	Change.R2Adj	Reproduce.R2Adj
0.200	0.1761	−0.0239	Not Reproduced
O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	7.56	7.5633	0.0033	Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

The p-value was reproduced.

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept		<0.001
agPerWtChange	<0.001	<0.001	0.0000	Reproduced	Remains sig, B same direction
Age		0.0023
comorbcatsum		0.0389
agmodup_act		0.3474
group:
Discussion – Internet Only		0.5420
E-Mail – Internet Only		0.8009
afatsc		<0.001
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction. Unadjusted R² instead adjusted R² was mistakely reported for this model.

Methods

Bootstrapping results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.0179	0.0147	0.0032	17.7900	Yes	No	No
z_agPerWtChange	0.2439	0.2442	−0.0003	−0.1200	No	No	No
z_Age	−0.2021	−0.2025	0.0004	0.2000	No	No	No
z_comorbcatsum	0.1365	0.1391	−0.0026	−1.9100	No	No	No
z_agmodup_act	−0.0601	−0.0590	−0.0011	−1.7600	No	No	No
groupDiscussion	−0.0932	−0.0909	−0.0023	−2.5000	No	No	No
groupE-Mail	0.0385	0.0389	−0.0004	−1.1600	No	No	No
z_afatsc	−0.3167	−0.3156	−0.0011	−0.3400	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.1883	−0.1789	−0.0094	−4.9900	No	No	No
z_agPerWtChange	0.1172	0.1121	0.0051	4.3600	No	No	No
z_Age	−0.3312	−0.3479	0.0167	5.0400	No	No	No
z_comorbcatsum	0.0070	0.0084	−0.0014	−20.0200	Yes	No	No
z_agmodup_act	−0.1859	−0.1655	−0.0204	−10.9600	Yes	No	No
groupDiscussion	−0.3941	−0.3733	−0.0207	−5.2600	No	No	No
groupE-Mail	−0.2619	−0.2560	−0.0059	−2.2600	No	No	No
z_afatsc	−0.4405	−0.4340	−0.0065	−1.4700	No	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.2241	0.2075	0.0166	7.4000	No	No	No
z_agPerWtChange	0.3706	0.3949	−0.0243	−6.5600	No	No	No
z_Age	−0.0730	−0.0476	−0.0255	−34.8700	Yes	No	No
z_comorbcatsum	0.2659	0.2625	0.0034	1.2800	No	No	No
z_agmodup_act	0.0657	0.0791	−0.0134	−20.3800	Yes	No	No
groupDiscussion	0.2076	0.1983	0.0093	4.4800	No	No	No
groupE-Mail	0.3388	0.3465	−0.0077	−2.2800	No	No	No
z_afatsc	−0.1929	−0.2101	0.0171	8.8800	No	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.4124	0.3864	−0.0260	−6.3000	No	No	No
z_agPerWtChange	0.2534	0.2829	0.0294	11.6100	Yes	No	No
z_Age	0.2582	0.3004	0.0422	16.3300	Yes	No	No
z_comorbcatsum	0.2589	0.2541	−0.0048	−1.8500	No	No	No
z_agmodup_act	0.2516	0.2446	−0.0070	−2.7700	No	No	No
groupDiscussion	0.6017	0.5717	−0.0300	−4.9900	No	No	No
groupE-Mail	0.6007	0.6025	0.0018	0.3000	No	No	No
z_afatsc	0.2475	0.2239	−0.0236	−9.5300	No	No	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	0.8641	0.8822	−0.0181	Remains non-sig, B same direction
z_agPerWtChange	<0.001	<0.001	−0.0004	Remains sig, B same direction
z_Age	0.0023	0.0078	−0.0055	Remains sig, B same direction
z_comorbcatsum	0.0389	0.0322	0.0067	Remains sig, B same direction
z_agmodup_act	0.3474	0.3363	0.0111	Remains non-sig, B same direction
groupDiscussion	0.5420	0.5310	0.0110	Remains non-sig, B same direction
groupE-Mail	0.8009	0.8019	−0.0009	Remains non-sig, B same direction
z_afatsc	<0.001	<0.001	0.0000	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

References

Disclosure

Methods used in our reproducibility analyses

Summary from statistical review

Data availability and software used

Regression sample

Computational reproducibility results

Inferential reproducibility results

Recommended changes

Model 1

Model results for Depression change

Fit statistics for Depression change

ANOVA table for Depression change

Model results for Depression change

Fit statistics for Depression change

ANOVA table for Depression change

Visualisation of regression model

Checking residuals plots for patterns

Testing residuals for non linear relationships

Checking univariate relationships with the dependent variable using scatterplots

Linearity results

Testing for homoscedasticity

Homoscedasticity results

Model descriptives including cook’s distance and leverage to understand outliers

Cooks threshold

DFFIT threshold

DFBETA threshold

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

Results for outliers and influential points

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Normality results

Assessing collinearity with VIF

Collinearity results

Assessing independence with the Durbin–Watson test for autocorrelation

Independence results

Assumption conclusions

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Depression change

Change in regression coefficients

Change in lower 95% confidence intervals for coefficients

Change in upper 95% confidence intervals for coefficients

Change in Adjusted R2

Change in global F

Change in p-values

Results for p-values

Conclusion computational reproducibility

Methods

Bootstrapped results

Change in regression coefficients

Change in lower 95% confidence interval

Change in upper 95% confidence interval

Change in Range of 95% confidence interval

Change in p-value significance and regression coefficient direction

Check the distribution of bootstrap estimates

Conclusions based on the bootstrapped model

Model 2

Model results for Sleep disturbance change

Fit statistics for Sleep disturbance change

ANOVA table for Sleep disturbance change

Model results Sleep disturbance change

Model fit for Sleep disturbance change

ANOVA table for Sleep disturbance change

Visualisation of regression model

Checking residuals plots for patterns

Testing residuals for non linear relationships

Checking univariate relationships with the dependent variable using scatterplots

Linearity results

Testing for homoscedasticity

Homoscedasticity results

Model descriptives including cook’s distance and leverage to understand outliers

Cooks threshold

DFFIT threshold

DFBETA threshold

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

Results for outliers and influential points

Checking for normality of the residuals using a Q–Q plot

Change in Adjusted R²

Change in Adjusted R²

Change in Adjusted R²