Paper 41: Weight loss is associated with improved quality of life among rural women completers of a web-based lifestyle intervention

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 28, 2026

References

Hageman PA, Mroz JE, Yoerger MA, Pullen CH (2019) Weight loss is associated with improved quality of life among rural women completers of a web-based lifestyle intervention. PLoS ONE 14(11): e0225446. https://doi.org/10.1371/journal.pone.0225446

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability, and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Hageman et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper examined weight change and health quality of life (HQOL) in a population after a health intervention. Multivariable linear regression was used to determine whether weight loss is associated with improvements in QOL across seven health domains. The author checked raw data for normality but did not mention other assumptions, outliers or collinearity. Direction, but not the size, of the regression coefficients was interpreted.

Data availability and software used

The authors provide a wide-format SPSS dataset in the supporting information, which has an in-built data dictionary. The analyses were conducted using SPSS.

Regression sample

Seven multivariable linear regression models were reported. Three outcomes: depression, sleep disturbance, and fatigue—were selected at random for assessment. The primary predictor was percentage change in weight, with models adjusted for age, number of comorbidities, change in physical activity from baseline, intervention group, and baseline outcome scores. The authors did not report results for all covariates included in the models, presenting estimates only for the primary variable of interest, percentage weight loss.

Computational reproducibility results

This paper was computationally reproducible. All reported statistics for the depression model were successfully reproduced, and the other two models were mostly reproducible. The p-value for percentage weight loss in the sleep-disturbance model was not reproduced. However, this discrepancy was considered a typographical error, as the reported value of 0.67 was reproduced as 0.57 and did not alter the statistical significance of the result. A minor rounding discrepancy was observed in one coefficient in the fatigue model, and R2 instead of adjusted R2 was mistakenly reported for this model.

Inferential reproducibility results

All three models were inferentially reproducible. Residual diagnostics did not indicate major violations of linear model assumptions; however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline scores. A small number of larger standardized residuals were also present, which may contribute to wider confidence intervals but did not materially affect point estimates or inference. Although some statistics differed by 10% or more between models, these differences were not considered meaningful, as changes in standardized regression coefficients were less than 0.10. The direction of effects and statistical significance remained consistent between the reproduced models and the bootstrapped sensitivity analyses.

Model 1

Model results for Depression change

Term

B

SE

Lower

Upper

t

p-value

Intercept

agPerWtChange

0.13

0.03

0.24

0.01

Age

comorbcatsum

agmodup_act

group:

Discussion – Internet Only

E-Mail – Internet Only

adepsc

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Depression change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.30

14.09

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Depression change

Term

SS

DF

MS

F

p-value

agPerWtChange

Age

comorbcatsum

agmodup_act

group

adepsc

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Depression change

Term

B

SE

Lower

Upper

t

p-value

Intercept

35.127

4.400

26.452

43.803

7.983

<0.001

agPerWtChange

0.133

0.054

0.027

0.238

2.473

0.0142

Age

−0.233

0.062

−0.355

−0.112

−3.785

<0.001

comorbcatsum

1.259

0.329

0.610

1.908

3.824

<0.001

agmodup_act

0.013

0.022

−0.031

0.056

0.580

0.5624

group:

Discussion – Internet Only

−0.352

0.976

−2.275

1.571

−0.361

0.7185

E-Mail – Internet Only

0.771

0.964

−1.130

2.672

0.800

0.4248

adepsc

−0.501

0.060

−0.619

−0.383

−8.352

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Depression change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.568

0.323

0.300

1,373.339

5.658

14.087

7

207

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Depression change

Term

SS

DF

MS

F

p-value

agPerWtChange

203.349

1

203.349

6.116

0.0142

Age

476.181

1

476.181

14.323

<0.001

comorbcatsum

486.085

1

486.085

14.621

<0.001

agmodup_act

11.191

1

11.191

0.337

0.5624

group

44.457

2

22.229

0.669

0.5135

adepsc

2,318.876

1

2,318.876

69.748

<0.001

Residuals

6,881.993

207

33.246

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

agPerWtChange

0.449

0.6539

No linearity violation

Age

−0.277

0.7820

No linearity violation

comorbcatsum

0.381

0.7033

No linearity violation

agmodup_act

−0.789

0.4309

No linearity violation

group

adepsc

1.293

0.1973

No linearity violation

Tukey test

0.211

0.8329

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results
  • No linearity violation was observed in either plots or tests.
Testing for homoscedasticity

Statistic

p-value

Parameter

Method

9.747

0.2034

7

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Depression change

215

0.020

6.890

0.000

−16.300

32.300

0.707

2.309

agPerWtChange

215

−4.471

7.648

−2.879

−35.702

12.220

−0.996

1.650

Age

215

54.660

6.771

55.000

40.000

69.000

0.001

−0.669

comorbcatsum

215

1.474

1.271

1.000

0.000

5.000

0.778

−0.073

agmodup_act

215

−2.098

18.624

−0.952

−71.016

47.405

−0.404

1.480

group*

215

1.963

0.825

2.000

1.000

3.000

0.069

−1.535

adepsc

215

47.368

6.722

49.000

41.000

62.200

0.394

−1.337

.fitted

215

0.020

3.914

0.896

−9.724

7.592

−0.261

−0.884

.resid

215

−0.000

5.671

−0.974

−11.056

25.929

0.843

1.769

.leverage

215

0.037

0.016

0.034

0.015

0.138

2.200

8.297

.sigma

215

5.766

0.028

5.774

5.479

5.780

−6.208

53.826

.cooksd

215

0.005

0.010

0.002

0.000

0.098

5.621

44.016

.std.resid

215

0.000

1.002

−0.172

−1.949

4.580

0.841

1.752

dfb.1_

215

0.000

0.088

−0.003

−0.245

0.666

2.547

16.953

dfb.aPWC

215

−0.000

0.069

0.001

−0.526

0.196

−2.301

15.355

dfb.Age

215

−0.000

0.083

−0.002

−0.617

0.304

−1.833

15.932

dfb.cmrb

215

0.000

0.068

0.001

−0.294

0.240

−0.333

3.575

dfb.agm_

215

−0.000

0.061

0.001

−0.315

0.268

−1.033

7.011

dfb.grpD

215

0.000

0.066

−0.001

−0.223

0.317

0.433

3.726

dfb.gE.M

215

0.000

0.074

0.000

−0.235

0.388

0.920

4.978

dfb.adps

215

−0.000

0.068

0.012

−0.309

0.212

−0.851

3.045

dffit

215

0.001

0.199

−0.032

−0.440

0.931

0.911

2.163

cov.r

215

1.041

0.068

1.058

0.459

1.134

−4.218

27.075

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.aPWC

dfb.Age

dfb.cmrb

dfb.agm_

dfb.grpD

dfb.gE.M

dfb.adps

dffit

cov.r

20

−0.269

0.086

0.001

−0.020

0.018

0.004

−0.001

0.074

0.016

0.019

0.024

−0.082

1.134

97

3.133

0.029

0.035

−0.245

0.035

0.279

−0.141

−0.216

0.033

0.315

0.063

0.543

0.738

157

1.661

0.138

0.055

−0.139

−0.526

−0.040

0.108

−0.315

0.005

0.123

0.197

0.666

1.085

131

4.820

0.036

0.098

0.666

0.068

−0.617

0.115

−0.182

−0.038

0.388

−0.309

0.931

0.459

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals > 3. Both had low leverage and small Cook’s distance, with DFBETAS and DFFITS within conventional ranges. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.082

0.1083

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.960

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

agPerWtChange

1.040

0.961

Age

1.058

0.945

comorbcatsum

1.061

0.942

agmodup_act

1.038

0.963

group

1.015

0.985

adepsc

1.023

0.977

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.063

2.123

0.3800

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • While the study design was longitudinal change scores were used, therefore, no violation of linearity.
Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline depression scores. Outlier diagnostics indicated that point estimates were unlikely to be substantially affected by influential points, but confidence-interval width could be affected and should be further investigated.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Depression change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

35.1272

agPerWtChange

0.13

0.1326

0.0026

Reproduced

Age

−0.2332

comorbcatsum

1.2586

agmodup_act

0.0127

group:

Discussion – Internet Only

−0.3521

E-Mail – Internet Only

0.7711

adepsc

−0.5010

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

26.4519

agPerWtChange

0.03

0.0269

−0.0031

Reproduced

Age

−0.3546

comorbcatsum

0.6097

agmodup_act

−0.0306

group:

Discussion – Internet Only

−2.2754

E-Mail – Internet Only

−1.1297

adepsc

−0.6193

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

43.8025

agPerWtChange

0.24

0.2382

−0.0018

Reproduced

Age

−0.1117

comorbcatsum

1.9076

agmodup_act

0.0561

group:

Discussion – Internet Only

1.5711

E-Mail – Internet Only

2.6718

adepsc

−0.3827

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R2

O_R2Adj

R_R2Adj

Change.R2Adj

Reproduce.R2Adj

0.300

0.2998

−0.0002

Reproduced

O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term

O_F

R_F

Change.F

Reproduce.F

Intercept

14.09

14.0872

−0.0028

Reproduced

O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

agPerWtChange

0.01

0.0142

0.0042

Reproduced

Remains sig, B same direction

Age

<0.001

comorbcatsum

<0.001

agmodup_act

0.5624

group:

Discussion – Internet Only

0.7185

E-Mail – Internet Only

0.4248

adepsc

<0.001

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-value was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardized, and inference was assessed using bootstrapped standardized regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.0252

−0.0263

0.0011

4.2500

No

No

No

z_agPerWtChange

0.1470

0.1507

−0.0038

−2.5600

No

No

No

z_Age

−0.2294

−0.2296

0.0002

0.0900

No

No

No

z_comorbcatsum

0.2317

0.2309

0.0008

0.3600

No

No

No

z_agmodup_act

0.0346

0.0362

−0.0017

−4.8300

No

No

No

groupDiscussion

−0.0511

−0.0491

−0.0020

−3.9800

No

No

No

groupE-Mail

0.1119

0.1123

−0.0004

−0.3300

No

No

No

z_adepsc

−0.4944

−0.4965

0.0021

0.4300

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.2149

−0.1854

−0.0295

−13.7300

Yes

No

No

z_agPerWtChange

0.0298

0.0315

−0.0017

−5.6200

No

No

No

z_Age

−0.3489

−0.3750

0.0261

7.4700

No

No

No

z_comorbcatsum

0.1122

0.1172

−0.0050

−4.4100

No

No

No

z_agmodup_act

−0.0829

−0.0743

−0.0087

−10.4400

Yes

No

No

groupDiscussion

−0.3302

−0.3154

−0.0148

−4.4700

No

No

No

groupE-Mail

−0.1640

−0.1546

−0.0093

−5.6900

No

No

No

z_adepsc

−0.6111

−0.6111

0.0000

0.0000

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.1646

0.1463

0.0183

11.0900

Yes

No

No

z_agPerWtChange

0.2641

0.2551

0.0090

3.4200

No

No

No

z_Age

−0.1099

−0.0997

−0.0102

−9.2600

No

No

No

z_comorbcatsum

0.3511

0.3491

0.0021

0.5800

No

No

No

z_agmodup_act

0.1521

0.1284

0.0237

15.5700

Yes

No

No

groupDiscussion

0.2280

0.2098

0.0182

7.9700

No

No

No

groupE-Mail

0.3878

0.4137

−0.0259

−6.6800

No

No

No

z_adepsc

−0.3777

−0.3897

0.0120

3.1800

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.3795

0.3317

−0.0478

−12.5900

Yes

No

No

z_agPerWtChange

0.2343

0.2236

−0.0107

−4.5700

No

No

No

z_Age

0.2390

0.2752

0.0362

15.1600

Yes

No

No

z_comorbcatsum

0.2389

0.2319

−0.0070

−2.9300

No

No

No

z_agmodup_act

0.2350

0.2026

−0.0323

−13.7600

Yes

No

No

groupDiscussion

0.5582

0.5253

−0.0329

−5.9000

No

No

No

groupE-Mail

0.5517

0.5683

0.0166

3.0000

No

No

No

z_adepsc

0.2334

0.2214

−0.0120

−5.1400

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.7937

0.7563

0.0375

Remains non-sig, B same direction

z_agPerWtChange

0.0142

0.0085

0.0057

Remains sig, B same direction

z_Age

<0.001

0.0011

−0.0009

Remains sig, B same direction

z_comorbcatsum

<0.001

<0.001

0.0001

Remains sig, B same direction

z_agmodup_act

0.5624

0.4808

0.0816

Remains non-sig, B same direction

groupDiscussion

0.7185

0.7126

0.0058

Remains non-sig, B same direction

groupE-Mail

0.4248

0.4397

−0.0149

Remains non-sig, B same direction

z_adepsc

<0.001

<0.001

0.0000

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Model 2

Model results for Sleep disturbance change

Term

B

SE

Lower

Upper

t

p-value

Intercept

agPerWtChange

0.03

−0.07

0.13

0.67

Age

comorbcatsum

agmodup_act

group:

Discussion – Internet Only

E-Mail – Internet Only

asldsc

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Sleep disturbance change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.16

6.86

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Sleep disturbance change

Term

SS

DF

MS

F

p-value

agPerWtChange

Age

comorbcatsum

agmodup_act

group

asldsc

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Sleep disturbance change

Term

B

SE

Lower

Upper

t

p-value

Intercept

22.625

4.195

14.355

30.895

5.394

<0.001

agPerWtChange

0.030

0.053

−0.074

0.134

0.564

0.5734

Age

−0.109

0.060

−0.228

0.010

−1.812

0.0714

comorbcatsum

0.479

0.324

−0.160

1.117

1.478

0.1408

agmodup_act

0.006

0.021

−0.036

0.048

0.292

0.7707

group:

Discussion – Internet Only

−0.465

0.952

−2.342

1.412

−0.488

0.6258

E-Mail – Internet Only

0.417

0.950

−1.455

2.290

0.439

0.6608

asldsc

−0.349

0.053

−0.454

−0.243

−6.521

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for Sleep disturbance change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.433

0.188

0.160

1,373.052

5.572

6.864

7

208

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Sleep disturbance change

Term

SS

DF

MS

F

p-value

agPerWtChange

10.256

1

10.256

0.318

0.5734

Age

105.847

1

105.847

3.283

0.0714

comorbcatsum

70.472

1

70.472

2.186

0.1408

agmodup_act

2.747

1

2.747

0.085

0.7707

group

26.478

2

13.239

0.411

0.6638

asldsc

1,370.835

1

1,370.835

42.518

<0.001

Residuals

6,706.214

208

32.241

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

agPerWtChange

1.731

0.0850

No linearity violation

Age

1.479

0.1406

No linearity violation

comorbcatsum

1.315

0.1899

No linearity violation

agmodup_act

−0.186

0.8525

No linearity violation

group

asldsc

−0.555

0.5798

No linearity violation

Tukey test

−0.713

0.4758

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

10.752

0.1498

7

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Sleep disturbance change

216

0.262

6.197

0.000

−20.400

31.300

0.568

3.132

agPerWtChange

216

−4.446

7.640

−2.875

−35.702

12.220

−1.003

1.664

Age

216

54.699

6.779

55.000

40.000

69.000

−0.007

−0.680

comorbcatsum

216

1.477

1.268

1.000

0.000

5.000

0.773

−0.070

agmodup_act

216

−1.961

18.689

−0.940

−71.016

47.405

−0.402

1.437

group*

216

1.963

0.823

2.000

1.000

3.000

0.068

−1.528

asldsc

216

48.531

7.327

48.400

32.000

68.800

−0.430

0.280

.fitted

216

0.262

2.684

0.110

−6.993

7.648

0.207

0.173

.resid

216

0.000

5.585

−0.158

−18.553

27.011

0.510

3.042

.leverage

216

0.037

0.016

0.033

0.014

0.127

1.908

5.714

.sigma

216

5.678

0.032

5.687

5.355

5.692

−6.442

53.018

.cooksd

216

0.005

0.015

0.001

0.000

0.168

7.363

65.581

.std.resid

216

0.001

1.004

−0.028

−3.335

4.889

0.530

3.130

dfb.1_

216

0.000

0.071

0.000

−0.297

0.316

0.043

4.002

dfb.aPWC

216

−0.000

0.073

0.000

−0.481

0.411

−0.986

14.448

dfb.Age

216

0.000

0.067

−0.001

−0.329

0.232

−0.511

5.335

dfb.cmrb

216

0.000

0.074

−0.001

−0.241

0.619

3.078

25.828

dfb.agm_

216

0.000

0.084

0.000

−0.492

0.562

−0.110

17.911

dfb.grpD

216

−0.000

0.070

−0.001

−0.326

0.245

−0.299

4.099

dfb.gE.M

216

−0.000

0.075

0.000

−0.421

0.312

−0.339

6.114

dfb.asld

216

−0.000

0.079

0.000

−0.572

0.263

−1.403

13.985

dffit

216

0.003

0.211

−0.005

−0.699

1.230

1.171

6.941

cov.r

216

1.042

0.075

1.059

0.413

1.164

−4.481

28.054

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.aPWC

dfb.Age

dfb.cmrb

dfb.agm_

dfb.grpD

dfb.gE.M

dfb.asld

dffit

cov.r

134

−0.156

0.108

0.000

−0.018

0.040

0.005

−0.014

0.012

−0.005

−0.012

0.029

−0.054

1.164

157

1.528

0.127

0.042

−0.061

−0.481

−0.035

0.111

−0.300

0.028

0.112

0.074

0.583

1.089

89

3.866

0.056

0.103

0.176

0.411

−0.310

−0.241

0.562

−0.108

0.312

0.219

0.938

0.630

71

5.184

0.053

0.168

0.223

−0.331

0.168

0.619

0.302

−0.326

−0.421

−0.572

1.230

0.413

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals > 3. DFBETAS and Cook’s distance within conventional ranges but the DFFIT for the largest outlier maybe of concern. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.055

0.5349

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.961

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

agPerWtChange

1.039

0.962

Age

1.057

0.946

comorbcatsum

1.061

0.943

agmodup_act

1.032

0.969

group

1.012

0.988

asldsc

1.012

0.988

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.088

2.174

0.1460

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • While the study design was longitudinal change scores were used, thefore no violation of linearity.
Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, structured residual patterns were observed, consistent with an ordinal measure being treated as continuous and the limited variation in baseline scores. Outlier diagnostics indicated that a couple obervation may be a concern and should be further investigated.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Sleep disturbance change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

22.6248

agPerWtChange

0.03

0.0297

−0.0003

Reproduced

Age

−0.1094

comorbcatsum

0.4787

agmodup_act

0.0062

group:

Discussion – Internet Only

−0.4649

E-Mail – Internet Only

0.4175

asldsc

−0.3487

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

14.3551

agPerWtChange

−0.07

−0.0741

−0.0041

Reproduced

Age

−0.2285

comorbcatsum

−0.1596

agmodup_act

−0.0359

group:

Discussion – Internet Only

−2.3415

E-Mail – Internet Only

−1.4551

asldsc

−0.4541

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

30.8945

agPerWtChange

0.13

0.1335

0.0035

Reproduced

Age

0.0096

comorbcatsum

1.1170

agmodup_act

0.0484

group:

Discussion – Internet Only

1.4118

E-Mail – Internet Only

2.2900

asldsc

−0.2433

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R2

O_R2Adj

R_R2Adj

Change.R2Adj

Reproduce.R2Adj

0.160

0.1603

0.0003

Reproduced

O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term

O_F

R_F

Change.F

Reproduce.F

Intercept

6.86

6.8637

0.0037

Reproduced

O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

agPerWtChange

0.67

0.5734

−0.0966

Not Reproduced

Remains non-sig, B same direction

Age

0.0714

comorbcatsum

0.1408

agmodup_act

0.7707

group:

Discussion – Internet Only

0.6258

E-Mail – Internet Only

0.6608

asldsc

<0.001

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-value was not reproduced.

Conclusion computational reproducibility

This model was mostly computationally reproducible. With a likely typographic error identified for the p-value results, although the p-value had the same interpretation, and regression coefficients did not change direction.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardized, and inference was assessed using bootstrapped standardized regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0028

0.0026

0.0002

8.0500

No

No

No

z_agPerWtChange

0.0366

0.0377

−0.0011

−2.9100

No

No

No

z_Age

−0.1197

−0.1216

0.0019

1.5500

No

No

No

z_comorbcatsum

0.0980

0.0981

−0.0001

−0.0700

No

No

No

z_agmodup_act

0.0188

0.0197

−0.0009

−4.7200

No

No

No

groupDiscussion

−0.0750

−0.0752

0.0001

0.1800

No

No

No

groupE-Mail

0.0674

0.0655

0.0018

2.7300

No

No

No

z_asldsc

−0.4123

−0.4136

0.0013

0.3200

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.2047

−0.2249

0.0202

9.8500

No

No

No

z_agPerWtChange

−0.0914

−0.0949

0.0036

3.9000

No

No

No

z_Age

−0.2500

−0.2435

−0.0065

−2.6100

No

No

No

z_comorbcatsum

−0.0327

−0.0163

−0.0164

−50.2500

Yes

No

No

z_agmodup_act

−0.1084

−0.1312

0.0229

21.1000

Yes

No

No

groupDiscussion

−0.3779

−0.3820

0.0041

1.1000

No

No

No

groupE-Mail

−0.2348

−0.2608

0.0260

11.0700

Yes

No

No

z_asldsc

−0.5370

−0.5651

0.0281

5.2300

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2103

0.2521

−0.0418

−19.8700

Yes

No

No

z_agPerWtChange

0.1646

0.1652

−0.0006

−0.3700

No

No

No

z_Age

0.0105

0.0030

0.0076

71.9900

Yes

No

No

z_comorbcatsum

0.2286

0.2556

−0.0269

−11.7700

Yes

No

No

z_agmodup_act

0.1460

0.1669

−0.0209

−14.3200

Yes

No

No

groupDiscussion

0.2278

0.2174

0.0104

4.5800

No

No

No

groupE-Mail

0.3696

0.3785

−0.0089

−2.4100

No

No

No

z_asldsc

−0.2877

−0.2837

−0.0039

−1.3600

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.4151

0.4770

0.0620

14.9300

Yes

No

No

z_agPerWtChange

0.2560

0.2602

0.0042

1.6300

No

No

No

z_Age

0.2605

0.2464

−0.0141

−5.4200

No

No

No

z_comorbcatsum

0.2613

0.2718

0.0105

4.0100

No

No

No

z_agmodup_act

0.2544

0.2982

0.0438

17.2100

Yes

No

No

groupDiscussion

0.6057

0.5994

−0.0063

−1.0400

No

No

No

groupE-Mail

0.6044

0.6393

0.0349

5.7800

No

No

No

z_asldsc

0.2493

0.2813

0.0320

12.8400

Yes

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.9789

0.9831

−0.0042

Remains non-sig, B same direction

z_agPerWtChange

0.5734

0.5681

0.0052

Remains non-sig, B same direction

z_Age

0.0714

0.0533

0.0182

Remains non-sig, B same direction

z_comorbcatsum

0.1408

0.1528

−0.0120

Remains non-sig, B same direction

z_agmodup_act

0.7707

0.7941

−0.0235

Remains non-sig, B same direction

groupDiscussion

0.6258

0.6254

0.0004

Remains non-sig, B same direction

groupE-Mail

0.6608

0.6861

−0.0253

Remains non-sig, B same direction

z_asldsc

<0.001

<0.001

−0.0000

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Model 3

Model results for Fatigue change

Term

B

SE

Lower

Upper

t

p-value

Intercept

agPerWtChange

0.24

0.12

0.38

<0.001

Age

comorbcatsum

agmodup_act

group:

Discussion – Internet Only

E-Mail – Internet Only

afatsc

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics Fatigue change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.20

7.56

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fatigue change

Term

SS

DF

MS

F

p-value

agPerWtChange

Age

comorbcatsum

agmodup_act

group

afatsc

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Fatigue change

Term

B

SE

Lower

Upper

t

p-value

Intercept

30.488

5.609

19.431

41.546

5.436

<0.001

agPerWtChange

0.249

0.066

0.120

0.379

3.794

<0.001

Age

−0.233

0.075

−0.381

−0.084

−3.086

0.0023

comorbcatsum

0.840

0.404

0.043

1.637

2.078

0.0389

agmodup_act

−0.025

0.027

−0.078

0.027

−0.942

0.3474

group:

Discussion – Internet Only

−0.728

1.191

−3.076

1.621

−0.611

0.5420

E-Mail – Internet Only

0.300

1.189

−2.045

2.645

0.252

0.8009

afatsc

−0.368

0.073

−0.512

−0.224

−5.044

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Fatigue change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.450

0.203

0.176

1,468.766

6.954

7.563

7

208

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fatigue change

Term

SS

DF

MS

F

p-value

agPerWtChange

722.947

1

722.947

14.396

<0.001

Age

478.333

1

478.333

9.525

0.0023

comorbcatsum

216.902

1

216.902

4.319

0.0389

agmodup_act

44.540

1

44.540

0.887

0.3474

group

38.165

2

19.083

0.380

0.6843

afatsc

1,277.792

1

1,277.792

25.445

<0.001

Residuals

10,445.335

208

50.218

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

agPerWtChange

−0.010

0.9924

No linearity violation

Age

0.498

0.6191

No linearity violation

comorbcatsum

−0.162

0.8718

No linearity violation

agmodup_act

1.613

0.1083

No linearity violation

group

afatsc

0.079

0.9370

No linearity violation

Tukey test

0.729

0.4658

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

5.585

0.5890

7

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Fatigue change

216

−1.156

7.807

0.000

−23.300

29.800

0.157

1.429

agPerWtChange

216

−4.446

7.640

−2.875

−35.702

12.220

−1.003

1.664

Age

216

54.699

6.779

55.000

40.000

69.000

−0.007

−0.680

comorbcatsum

216

1.477

1.268

1.000

0.000

5.000

0.773

−0.070

agmodup_act

216

−1.961

18.689

−0.940

−71.016

47.405

−0.402

1.437

group*

216

1.963

0.823

2.000

1.000

3.000

0.068

−1.528

afatsc

216

51.487

6.716

51.000

33.700

71.600

−0.438

0.670

.fitted

216

−1.156

3.517

−1.414

−11.440

10.155

0.198

0.258

.resid

216

−0.000

6.970

0.286

−20.702

26.061

−0.019

1.063

.leverage

216

0.037

0.016

0.033

0.014

0.128

1.870

5.063

.sigma

216

7.086

0.031

7.097

6.855

7.104

−3.785

19.206

.cooksd

216

0.005

0.011

0.002

0.000

0.105

5.214

35.288

.std.resid

216

0.001

1.003

0.041

−2.995

3.784

−0.011

1.091

dfb.1_

216

0.000

0.075

−0.002

−0.281

0.431

1.266

8.439

dfb.aPWC

216

0.000

0.080

0.000

−0.452

0.546

1.542

17.409

dfb.Age

216

0.000

0.083

0.001

−0.345

0.476

0.939

9.811

dfb.cmrb

216

−0.000

0.070

0.000

−0.359

0.218

−1.181

5.949

dfb.agm_

216

−0.000

0.068

0.000

−0.284

0.550

2.229

21.110

dfb.grpD

216

−0.000

0.067

−0.001

−0.212

0.377

0.799

5.319

dfb.gE.M

216

0.000

0.072

−0.000

−0.198

0.340

0.564

2.970

dfb.afts

216

−0.000

0.063

0.000

−0.245

0.225

−0.373

3.447

dffit

216

0.004

0.204

0.007

−0.690

0.946

0.217

2.839

cov.r

216

1.041

0.066

1.059

0.622

1.133

−2.799

10.680

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potentially influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice, a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.aPWC

dfb.Age

dfb.cmrb

dfb.agm_

dfb.grpD

dfb.gE.M

dfb.afts

dffit

cov.r

16

−0.620

0.096

0.005

−0.100

0.027

0.031

−0.095

−0.078

−0.067

−0.003

0.133

−0.202

1.133

157

1.430

0.128

0.037

−0.070

−0.452

−0.030

0.103

−0.284

0.022

0.099

0.084

0.548

1.102

100

3.202

0.047

0.061

−0.198

−0.174

0.428

0.143

−0.097

0.377

0.051

−0.228

0.714

0.741

89

3.911

0.055

0.105

0.431

0.403

−0.327

−0.201

0.550

−0.063

0.340

−0.207

0.946

0.622

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals > 3. Both had low leverage and small Cook’s distance, with DFBETAS and DFFITS within conventional ranges. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.054

0.5665

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.983

0.0098

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

agPerWtChange

1.038

0.963

Age

1.058

0.945

comorbcatsum

1.061

0.943

agmodup_act

1.031

0.970

group

1.014

0.987

afatsc

1.014

0.986

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.050

2.098

0.4860

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • While the study design was longitudinal change scores were used, therefore no violation of linearity.
Assumption conclusions

Residual diagnostics did not indicate major violations of linear model assumptions. however, outlier diagnostics indicated that point estimates were unlikely to be substantially affected by influential points, but confidence-interval width could be affected and should be further investigated.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Fatigue change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

30.4885

agPerWtChange

0.24

0.2492

0.0092

Incorrect Rounding

Age

−0.2328

comorbcatsum

0.8399

agmodup_act

−0.0251

group:

Discussion – Internet Only

−0.7277

E-Mail – Internet Only

0.3003

afatsc

−0.3681

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

19.4307

agPerWtChange

0.12

0.1197

−0.0003

Reproduced

Age

−0.3815

comorbcatsum

0.0432

agmodup_act

−0.0776

group:

Discussion – Internet Only

−3.0765

E-Mail – Internet Only

−2.0447

afatsc

−0.5120

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

41.5463

agPerWtChange

0.38

0.3787

−0.0013

Reproduced

Age

−0.0841

comorbcatsum

1.6366

agmodup_act

0.0274

group:

Discussion – Internet Only

1.6210

E-Mail – Internet Only

2.6452

afatsc

−0.2243

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in Adjusted R2

O_R2Adj

R_R2Adj

Change.R2Adj

Reproduce.R2Adj

0.200

0.1761

−0.0239

Not Reproduced

O_R2Adj = original R2 Adjusted; R_R2Adj = reproduced R2 Adjusted; Change.R2Adj = change in R2Adj (R2Adj - O_R2Adj); Reproduce.R2Adj = R2 Adjusted reproduced.

Change in global F

Term

O_F

R_F

Change.F

Reproduce.F

Intercept

7.56

7.5633

0.0033

Reproduced

O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in p-values

The p-value was reproduced.

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

agPerWtChange

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

Age

0.0023

comorbcatsum

0.0389

agmodup_act

0.3474

group:

Discussion – Internet Only

0.5420

E-Mail – Internet Only

0.8009

afatsc

<0.001

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction. Unadjusted R2 instead adjusted R2 was mistakely reported for this model.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardized, and inference was assessed using bootstrapped standardized regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapping results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0179

0.0147

0.0032

17.7900

Yes

No

No

z_agPerWtChange

0.2439

0.2442

−0.0003

−0.1200

No

No

No

z_Age

−0.2021

−0.2025

0.0004

0.2000

No

No

No

z_comorbcatsum

0.1365

0.1391

−0.0026

−1.9100

No

No

No

z_agmodup_act

−0.0601

−0.0590

−0.0011

−1.7600

No

No

No

groupDiscussion

−0.0932

−0.0909

−0.0023

−2.5000

No

No

No

groupE-Mail

0.0385

0.0389

−0.0004

−1.1600

No

No

No

z_afatsc

−0.3167

−0.3156

−0.0011

−0.3400

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.1883

−0.1789

−0.0094

−4.9900

No

No

No

z_agPerWtChange

0.1172

0.1121

0.0051

4.3600

No

No

No

z_Age

−0.3312

−0.3479

0.0167

5.0400

No

No

No

z_comorbcatsum

0.0070

0.0084

−0.0014

−20.0200

Yes

No

No

z_agmodup_act

−0.1859

−0.1655

−0.0204

−10.9600

Yes

No

No

groupDiscussion

−0.3941

−0.3733

−0.0207

−5.2600

No

No

No

groupE-Mail

−0.2619

−0.2560

−0.0059

−2.2600

No

No

No

z_afatsc

−0.4405

−0.4340

−0.0065

−1.4700

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2241

0.2075

0.0166

7.4000

No

No

No

z_agPerWtChange

0.3706

0.3949

−0.0243

−6.5600

No

No

No

z_Age

−0.0730

−0.0476

−0.0255

−34.8700

Yes

No

No

z_comorbcatsum

0.2659

0.2625

0.0034

1.2800

No

No

No

z_agmodup_act

0.0657

0.0791

−0.0134

−20.3800

Yes

No

No

groupDiscussion

0.2076

0.1983

0.0093

4.4800

No

No

No

groupE-Mail

0.3388

0.3465

−0.0077

−2.2800

No

No

No

z_afatsc

−0.1929

−0.2101

0.0171

8.8800

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.4124

0.3864

−0.0260

−6.3000

No

No

No

z_agPerWtChange

0.2534

0.2829

0.0294

11.6100

Yes

No

No

z_Age

0.2582

0.3004

0.0422

16.3300

Yes

No

No

z_comorbcatsum

0.2589

0.2541

−0.0048

−1.8500

No

No

No

z_agmodup_act

0.2516

0.2446

−0.0070

−2.7700

No

No

No

groupDiscussion

0.6017

0.5717

−0.0300

−4.9900

No

No

No

groupE-Mail

0.6007

0.6025

0.0018

0.3000

No

No

No

z_afatsc

0.2475

0.2239

−0.0236

−9.5300

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.8641

0.8822

−0.0181

Remains non-sig, B same direction

z_agPerWtChange

<0.001

<0.001

−0.0004

Remains sig, B same direction

z_Age

0.0023

0.0078

−0.0055

Remains sig, B same direction

z_comorbcatsum

0.0389

0.0322

0.0067

Remains sig, B same direction

z_agmodup_act

0.3474

0.3363

0.0111

Remains non-sig, B same direction

groupDiscussion

0.5420

0.5310

0.0110

Remains non-sig, B same direction

groupE-Mail

0.8009

0.8019

−0.0009

Remains non-sig, B same direction

z_afatsc

<0.001

<0.001

0.0000

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.