Paper 25: Impact of negative tuberculin skin test on growth among disadvantaged Bangladeshi children

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 15, 2026

Reference

Gaffar SMA, Chisti MJ, Mahfuz M, Ahmed T (2019) Impact of negative tuberculin skin test on growth among disadvantaged Bangladeshi children. PLoS ONE 14(11): e0224752. https://doi.org/10.1371/journal.pone.0224752

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Gaffar et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper examines growth differences between Tuberculin skin test and negative undernourished children growth among disadvantaged Bangladeshi children. Statistical methods were described well and presented clearly in tables identifying the test used. There was the overuse of acronyms, but generally easy to read. No assumptions were discussed. The author’s used pre-post change scores for linear regression but did not explicitly mention independence. They have not interpreted the coefficients except in terms of significance.

Data availability and software used

The data was available in the supporting information as a wide CSV file, with no data dictionary provided. Linear regression results were analysed using Stata.

Regression sample

Twelve linear regression models identified, Three main outcomes with multivariable linear regression reported, therefore, randomisation was not required. The three outcomes considered were change in length for age Z score LAZ, Change in weight for age Z score WAZ, and change in weight for length Z score WLZ, the main independent variable was whether Tuberculin skin test was positive or negative with adjustment for sex (male or female) and post-intervention age.

Computational reproducibility results

The models assessed were mostly computationally reproducible, with only minor rounding differences. P-values reported in the paper showed minor rounding discrepancies, consistent with truncation to two decimal places rather than rounding, but these differences did not affect their interpretation. Regression coefficients retained their direction. Some data manipulation was required to create change scores, and age measured in days required converting into months. Overall, the study was manageable to reproduce.

Inferential reproducibility results

Two of the three models were inferentially reproducible, meaning that overall the paper was not inferentially reproducible. For the LAZ model, the Breusch–Pagan test was not statistically significant. However, the residuals versus fitted plot showed a funnel-shaped pattern, suggesting some heteroscedasticity and wild bootstrapping was used as a comparison. The Q–Q plot and Kolmogorov–Smirnov test indicated that the residuals were approximately normal, although the Shapiro–Wilk test suggested minor deviations from normality. The covariance ratio flagged potential outliers that could inflate coefficient variances and widen confidence intervals. The study sampling method was unclear and independence could not be tested as there were no identifiers in the dataset. The WAZ and WLZ models exhibited similar assumption issues but did not show evidence of heteroscedasticity and BCa bootsrapping was used as a comparison for these models. The direction and statistical significance of effects remained consistent between the reproduced and bootstrapped models for all outcomes. However, the WLZ model was not inferentially reproducible, with standardized coefficient range differing by 0.11 (14%). In contrast, the LAZ and WAZ models differences in standardized coefficients and CI ranges were <0.1, indicating inferential reproducibility.

Model 1

Model results for LAZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

TST:

Positive – Negative

−0.04

−0.15

0.08

0.51

Sex:

Female – Male

−0.02

−0.10

0.05

0.52

Age

0.003

−0.01

0.02

0.74

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for LAZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for LAZ change

Term

SS

DF

MS

F

p-value

TST

Sex

Age

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for LAZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

0.004

0.171

−0.333

0.340

0.021

0.9831

TST:

Positive – Negative

−0.039

0.059

−0.155

0.077

−0.657

0.5116

Sex:

Female – Male

−0.025

0.038

−0.100

0.051

−0.642

0.5213

Age

0.003

0.009

−0.015

0.021

0.321

0.7487

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for LAZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.063

0.004

−0.009

105.179

0.294

0.314

3

239

0.8155

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for LAZ change

Term

SS

DF

MS

F

p-value

TST

0.038

1

0.038

0.432

0.5116

Sex

0.036

1

0.036

0.413

0.5213

Age

0.009

1

0.009

0.103

0.7487

Residuals

21.049

239

0.088

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

TST

Sex

Age

−1.576

0.1163

No linearity violation

Tukey test

−0.098

0.9222

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results
  • No linearity violation was observed in either plots or tests.
Testing for homoscedasticity

Statistic

p-value

Parameter

Method

5.341

0.1484

3

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

LAZ change

243

0.040

0.296

0.000

−0.810

1.020

0.358

0.329

TST*

243

1.119

0.325

1.000

1.000

2.000

2.334

3.461

Sex*

243

1.473

0.500

1.000

1.000

2.000

0.106

−1.997

Age

243

18.632

2.127

18.533

14.867

23.067

0.122

−1.174

.fitted

243

0.040

0.019

0.039

−0.015

0.069

−0.748

0.648

.resid

243

0.000

0.295

−0.030

−0.842

0.994

0.370

0.337

.leverage

243

0.016

0.011

0.013

0.008

0.054

1.967

2.780

.sigma

243

0.297

0.001

0.297

0.290

0.297

−3.483

15.964

.cooksd

243

0.004

0.006

0.002

0.000

0.042

3.296

12.789

.std.resid

243

−0.000

1.002

−0.102

−2.850

3.372

0.371

0.325

dfb.1_

243

0.000

0.061

0.000

−0.175

0.301

0.362

2.623

dfb.TSTP

243

−0.000

0.064

0.002

−0.308

0.362

0.938

11.083

dfb.SxMl

243

0.000

0.065

0.003

−0.205

0.228

0.160

0.448

dfb.Age

243

−0.000

0.059

−0.001

−0.255

0.174

−0.175

1.666

dffit

243

−0.002

0.126

−0.012

−0.350

0.413

0.491

0.684

cov.r

243

1.017

0.028

1.023

0.848

1.066

−2.475

10.174

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients if the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 indicate inflated variances and reduced precision (wider confidence intervals), whereas values above 1 indicate deflated variances and increased precision (narrower confidence intervals). A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.TSTP

dfb.SxMl

dfb.Age

dffit

cov.r

50

−1.064

0.050

0.015

0.125

−0.196

−0.062

−0.120

−0.244

1.050

19

3.268

0.011

0.028

0.119

−0.100

0.228

−0.117

0.340

0.862

233

1.409

0.054

0.028

−0.161

0.272

−0.104

0.174

0.337

1.040

89

3.448

0.014

0.039

0.301

−0.091

−0.205

−0.255

0.405

0.848

49

2.085

0.038

0.042

−0.015

0.362

0.123

0.002

0.413

0.983

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • Two observations had studentized residuals greater than 3, both had low leverage and Cook’s distance values, with cook’s distance, DFBETAs, and DFFITS within reasonable ranges, but the covariance ratio values were substantially lower than one. Suggesting minimal impact on the coefficient estimates but potential inflation of standard errors and consequent widening of confidence intervals.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.065

0.2613

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.988

0.0415

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

TST

1.010

0.990

Sex

1.003

0.997

Age

1.007

0.993

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.256

1.483

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • Autocorrelation of residuals was assessed using the Durbin–Watson test, which evaluates first-order serial correlation. Evidence of autocorrelation may arise from unmodelled clustering or repeated measurements, but can also reflect artefacts of data ordering (e.g. sorting by time or subject) rather than true temporal dependence.
  • It was unclear if the study design was independent as recruitment of children was not explained.
Assumption conclusions

The model met the assumption of linearity. The residuals versus fitted plot showed a funnel-shaped pattern, suggesting mild heteroscedasticity; however, the Breusch–Pagan test was not statistically significant. A sensitivity analysis using weighted regression or wild bootstrapping is recommended. The Q–Q plot and Kolmogorov–Smirnov test indicated that the residuals were approximately normal, although the Shapiro–Wilk test suggested some deviation from normality. The covariance ratio identified potential outliers that could inflate coefficient variances and widen confidence intervals. The study design and recruitment of children were not clearly described, making it unclear whether observations were independent.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for LAZ change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

0.0036

TST:

Positive – Negative

−0.040

−0.0388

0.0012

Reproduced

Sex:

Female – Male

−0.020

−0.0245

−0.0045

Reproduced

Age

0.003

0.0029

−0.0001

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

−0.3329

TST:

Positive – Negative

−0.15

−0.1551

−0.0051

Incorrect Rounding

Sex:

Female – Male

−0.10

−0.0998

0.0002

Reproduced

Age

−0.01

−0.0148

−0.0048

Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

0.3402

TST:

Positive – Negative

0.08

0.0775

−0.0025

Reproduced

Sex:

Female – Male

0.05

0.0507

0.0007

Reproduced

Age

0.02

0.0206

0.0006

Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.9831

TST:

Positive – Negative

0.51

0.5116

0.0016

Reproduced

Remains non-sig, B same direction

Sex:

Female – Male

0.52

0.5213

0.0013

Reproduced

Remains non-sig, B same direction

Age

0.74

0.7487

0.0087

Incorrect Rounding

Remains non-sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman plot showing differences between original and reproduced p-values for LAZ change

Results for p-values

P-values for this model were mostly reproduced. One p-value differed due to a rounding error, but this did not alter its statistical significance.

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction.

Methods

The model was successfully reproduced; however, the residuals showed mild heteroscedasticity. To further verify the findings, bootstrapped standardized regression coefficients and their 95% confidence intervals were examined. Percentage and absolute changes in estimates and confidence-interval bounds relative to the linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Coefficient direction and statistical significance were assessed for consistency. Wild bootstrapping was used to account for potential heteroscedasticity.

Results

Bootstrapped results

Wild bootstrapping was performed with 10,000 iterations.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

TSTPositive

−0.1313

−0.1287

−0.0026

−1.9700

No

No

No

SexMale

0.0830

0.0826

0.0004

0.4500

No

No

No

z_Age

0.0208

0.0207

0.0001

0.5100

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

TSTPositive

−0.5248

−0.5063

−0.0185

−3.5200

No

No

No

SexMale

−0.1716

−0.1712

−0.0004

−0.2500

No

No

No

z_Age

−0.1068

−0.0933

−0.0135

−12.6600

Yes

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

TSTPositive

0.2622

0.2427

0.0195

7.4300

No

No

No

SexMale

0.3376

0.3371

0.0005

0.1500

No

No

No

z_Age

0.1484

0.1327

0.0157

10.5800

Yes

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

TSTPositive

0.7870

0.7491

−0.0379

−4.8200

No

No

No

SexMale

0.5092

0.5083

−0.0010

−0.1900

No

No

No

z_Age

0.2552

0.2260

−0.0292

−11.4500

Yes

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

TSTPositive

0.5116

0.5006

0.0110

Remains non-sig, B same direction

SexMale

0.5213

0.5200

0.0013

Remains non-sig, B same direction

z_Age

0.7487

0.7193

0.0294

Remains non-sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more these differences were not meaningful with change in standardized regression coefficients and confidence intervals less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Model 2

Model results for WAZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

TST:

Positive – Negative

0.04

−0.11

0.18

0.62

Sex:

Female – Male

0.01

−0.08

0.11

0.80

Age

0.01

−0.01

0.04

0.23

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for WAZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for WAZ change

Term

SS

DF

MS

F

p-value

TST

Sex

Age

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results WAZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.198

0.217

−0.626

0.230

−0.911

0.3634

TST:

Positive – Negative

0.037

0.075

−0.111

0.185

0.491

0.6236

Sex:

Female – Male

0.012

0.049

−0.084

0.108

0.246

0.8059

Age

0.014

0.011

−0.009

0.036

1.180

0.2391

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for WAZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.081

0.007

−0.006

222.309

0.375

0.532

3

239

0.6609

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for WAZ change

Term

SS

DF

MS

F

p-value

TST

0.034

1

0.034

0.241

0.6236

Sex

0.009

1

0.009

0.061

0.8059

Age

0.199

1

0.199

1.393

0.2391

Residuals

34.086

239

0.143

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

TST

Sex

Age

0.154

0.8776

No linearity violation

Tukey test

0.073

0.9415

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

1.068

0.7848

3

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

WAZ change

243

0.065

0.377

0.070

−0.950

1.830

0.512

1.509

TST*

243

1.119

0.325

1.000

1.000

2.000

2.334

3.461

Sex*

243

1.473

0.500

1.000

1.000

2.000

0.106

−1.997

Age

243

18.632

2.127

18.533

14.867

23.067

0.122

−1.174

.fitted

243

0.065

0.031

0.064

0.006

0.150

0.161

−0.827

.resid

243

−0.000

0.375

−0.016

−1.011

1.753

0.498

1.471

.leverage

243

0.016

0.011

0.013

0.008

0.054

1.967

2.780

.sigma

243

0.378

0.002

0.378

0.361

0.378

−6.668

64.342

.cooksd

243

0.004

0.010

0.002

0.000

0.090

5.314

34.707

.std.resid

243

0.000

1.003

−0.042

−2.741

4.663

0.488

1.450

dfb.1_

243

−0.000

0.066

0.001

−0.254

0.249

−0.403

2.442

dfb.TSTP

243

−0.000

0.073

0.001

−0.486

0.406

−1.197

18.331

dfb.SxMl

243

0.000

0.065

−0.002

−0.293

0.211

−0.091

1.521

dfb.Age

243

0.000

0.066

−0.001

−0.238

0.248

0.360

2.111

dffit

243

0.000

0.135

−0.004

−0.608

0.504

−0.074

2.801

cov.r

243

1.017

0.031

1.023

0.700

1.070

−5.086

44.810

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients if the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 indicate inflated variances and reduced precision (wider confidence intervals), whereas values above 1 indicate deflated variances and increased precision (narrower confidence intervals). A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.TSTP

dfb.SxMl

dfb.Age

dffit

cov.r

50

0.506

0.050

0.003

−0.060

0.093

0.030

0.057

0.116

1.066

233

−0.706

0.054

0.007

0.081

−0.136

0.052

−0.087

−0.169

1.066

2

3.028

0.010

0.023

0.095

−0.091

0.211

−0.093

0.310

0.883

110

4.881

0.009

0.048

−0.040

−0.089

−0.293

0.111

0.458

0.700

57

−2.299

0.048

0.065

0.249

−0.420

−0.134

−0.238

−0.516

0.978

238

−2.779

0.046

0.090

−0.240

−0.486

0.201

0.222

−0.608

0.938

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • Two observations had studentized residuals greater than 3, both observations had low leverage and Cook’s distance values, with cook’s distance, DFBETAs, and DFFITS within reasonable ranges, but the covariance ratio values were substantially lower than one. Suggesting minimal impact on the coefficient estimates but potential inflation of standard errors and consequent widening of confidence intervals.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.032

0.9661

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.981

0.0029

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

TST

1.010

0.990

Sex

1.003

0.997

Age

1.007

0.993

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.084

1.808

0.1480

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • Autocorrelation of residuals was assessed using the Durbin–Watson test, which evaluates first-order serial correlation. Evidence of autocorrelation may arise from unmodelled clustering or repeated measurements, but can also reflect artefacts of data ordering (e.g. sorting by time or subject) rather than true temporal dependence.
  • It was unclear if the study design was independent as recruitment of children was not explained.
Assumption conclusions

The model was found to meet the assumptions of linearity and homoscedasticity. The Q-Q plot and the Kolmogorov-Smirnov test indicated that residuals were approximately normal, but the Shapiro-Wilk test indicated mild deviations of normality. The covariance ratio flagged potential outliers that could inflate coefficient variances and widen confidence intervals. The study design and recruitment of children were not clearly described, making it unclear whether observations were independent.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for WAZ change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−0.1980

TST:

Positive – Negative

0.04

0.0369

−0.0031

Reproduced

Sex:

Female – Male

0.01

0.0120

0.0020

Reproduced

Age

0.01

0.0135

0.0035

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

−0.6263

TST:

Positive – Negative

−0.11

−0.1111

−0.0011

Reproduced

Sex:

Female – Male

−0.08

−0.0838

−0.0038

Reproduced

Age

−0.01

−0.0090

0.0010

Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

0.2303

TST:

Positive – Negative

0.18

0.1849

0.0049

Reproduced

Sex:

Female – Male

0.11

0.1077

−0.0023

Reproduced

Age

0.04

0.0361

−0.0039

Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.3634

TST:

Positive – Negative

0.62

0.6236

0.0036

Reproduced

Remains non-sig, B same direction

Sex:

Female – Male

0.80

0.8059

0.0059

Incorrect Rounding

Remains non-sig, B same direction

Age

0.23

0.2391

0.0091

Incorrect Rounding

Remains non-sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman plot showing differences between original and reproduced p-values for WAZ change

Results for p-values

P-values for model had rounding error, indicating that p-values were most likely truncated to two decimal places rather than being rounded, but this did not alter its statistical significance.

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction.

Methods

The linear model was computationally reproduced. Residual diagnostics suggested possible non-normality. Non-parametric BCa bootstrapping (10,000 resamples) was used to obtain standardized coefficients and 95% confidence intervals (CIs), and CI widths were examined. Percentage and absolute changes in estimates and CI bounds, relative to the original linear-model estimates, were summarised using <0.10 and <0.20 thresholds. Coefficient direction and statistical significance were assessed for consistency.

Results

Bootstrapped results

Bias-Corrected and Accelerated (BCa) bootstrap confidence intervals were calculated using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0033

0.0049

−0.0016

−47.7200

Yes

No

No

TSTPositive

0.0980

0.1020

−0.0040

−4.0500

No

No

No

SexMale

−0.0318

−0.0342

0.0025

7.7200

No

No

No

z_Age

0.0764

0.0762

0.0002

0.2000

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.1758

−0.1724

−0.0034

−1.9400

No

No

No

TSTPositive

−0.2950

−0.3653

0.0704

23.8600

Yes

No

No

SexMale

−0.2860

−0.2838

−0.0023

−0.7900

No

No

No

z_Age

−0.0511

−0.0512

0.0001

0.1300

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.1825

0.1920

−0.0096

−5.2500

No

No

No

TSTPositive

0.4910

0.5027

−0.0118

−2.4000

No

No

No

SexMale

0.2225

0.2193

0.0032

1.4600

No

No

No

z_Age

0.2038

0.2074

−0.0036

−1.7700

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.3583

0.3644

0.0062

1.7200

No

No

No

TSTPositive

0.7859

0.8681

0.0821

10.4500

Yes

No

No

SexMale

0.5085

0.5030

−0.0055

−1.0800

No

No

No

z_Age

0.2549

0.2586

0.0037

1.4400

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.9708

0.9572

0.0136

Remains non-sig, B same direction

TSTPositive

0.6236

0.6429

−0.0193

Remains non-sig, B same direction

SexMale

0.8059

0.7891

0.0168

Remains non-sig, B same direction

z_Age

0.2391

0.2506

−0.0116

Remains non-sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more these differences were not meaningful with change in standardized regression coefficients less than 0.1. The direction of effects and statistical significance remained consistent between the two models.

Model 3

Model results for WLZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

TST:

Positive – Negative

0.09

−0.11

0.29

0.37

Sex:

Female – Male

0.05

−0.08

0.18

0.43

Age

0.05

0.02

0.08

0.002

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics WLZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for WLZ change

Term

SS

DF

MS

F

p-value

TST

Sex

Age

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for WLZ change

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.874

0.290

−1.446

−0.302

−3.010

0.0029

TST:

Positive – Negative

0.090

0.100

−0.108

0.287

0.896

0.3712

Sex:

Female – Male

0.050

0.065

−0.078

0.178

0.776

0.4386

Age

0.047

0.015

0.017

0.077

3.086

0.0023

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for WLZ change

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.205

0.042

0.030

362.901

0.500

3.484

3

239

0.0166

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for WLZ change

Term

SS

DF

MS

F

p-value

TST

0.204

1

0.204

0.803

0.3712

Sex

0.153

1

0.153

0.602

0.4386

Age

2.422

1

2.422

9.522

0.0023

Residuals

60.792

239

0.254

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

TST

Sex

Age

0.749

0.4548

No linearity violation

Tukey test

0.448

0.6544

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

Plots and tests show an unaccounted non-linear relationship may be present.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

1.296

0.7300

3

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

WLZ change

243

0.043

0.512

0.080

−1.190

2.140

0.178

0.352

TST*

243

1.119

0.325

1.000

1.000

2.000

2.334

3.461

Sex*

243

1.473

0.500

1.000

1.000

2.000

0.106

−1.997

Age

243

18.632

2.127

18.533

14.867

23.067

0.122

−1.174

.fitted

243

0.043

0.105

0.037

−0.163

0.309

0.105

−0.990

.resid

243

0.000

0.501

0.047

−1.191

2.046

0.173

0.206

.leverage

243

0.016

0.011

0.013

0.008

0.054

1.967

2.780

.sigma

243

0.504

0.002

0.505

0.488

0.505

−5.350

48.829

.cooksd

243

0.005

0.008

0.002

0.000

0.070

4.403

24.268

.std.resid

243

0.000

1.003

0.093

−2.417

4.076

0.167

0.195

dfb.1_

243

0.000

0.067

0.001

−0.218

0.236

−0.216

1.573

dfb.TSTP

243

−0.000

0.075

0.001

−0.427

0.349

−1.211

12.496

dfb.SxMl

243

0.000

0.065

−0.003

−0.253

0.176

−0.213

0.134

dfb.Age

243

0.000

0.066

−0.001

−0.225

0.210

0.212

1.421

dffit

243

0.001

0.136

0.009

−0.534

0.433

−0.240

1.451

cov.r

243

1.017

0.026

1.023

0.769

1.068

−3.925

33.616

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients if the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 indicate inflated variances and reduced precision (wider confidence intervals), whereas values above 1 indicate deflated variances and increased precision (narrower confidence intervals). A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.TSTP

dfb.SxMl

dfb.Age

dffit

cov.r

50

0.987

0.050

0.013

−0.116

0.182

0.058

0.112

0.226

1.053

233

−1.511

0.054

0.032

0.173

−0.291

0.111

−0.187

−0.361

1.035

110

4.216

0.009

0.037

−0.034

−0.077

−0.253

0.096

0.396

0.769

57

−2.172

0.048

0.059

0.236

−0.397

−0.127

−0.225

−0.488

0.988

238

−2.442

0.046

0.070

−0.211

−0.427

0.176

0.195

−0.534

0.965

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • One observation had studentized residuals greater than 3, this observation had low leverage and Cook’s distance values, with cook’s distance, DFBETAs, and DFFITS within reasonable ranges, but the covariance ratio value was substantially lower than one. Suggesting minimal impact on the coefficient estimates but potential inflation of standard errors and consequent widening of confidence intervals.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.045

0.7071

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.988

0.0353

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

TST

1.010

0.990

Sex

1.003

0.997

Age

1.007

0.993

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.188

1.609

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • Autocorrelation of residuals was assessed using the Durbin–Watson test, which evaluates first-order serial correlation. Evidence of autocorrelation may arise from unmodelled clustering or repeated measurements, but can also reflect artefacts of data ordering (e.g. sorting by time or subject) rather than true temporal dependence.
  • It was unclear if the study design was independent as recruitment of children was not explained.
Assumption conclusions

The model was found to meet the assumptions of linearity and homoscedasticity. The Q-Q plot and the Kolmogorov-Smirnov test indicated that residuals were approximately normal, but the Shapiro-Wilk test indicated there may be mild devation from normality. The covariance ratio flagged potential outliers that could inflate coefficient variances and widen confidence intervals. The study design and recruitment of children were not clearly described, making it unclear whether observations were independent.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for WLZ change

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−0.8739

TST:

Positive – Negative

0.09

0.0899

−0.0001

Reproduced

Sex:

Female – Male

0.05

0.0504

0.0004

Reproduced

Age

0.05

0.0472

−0.0028

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

−1.4459

TST:

Positive – Negative

−0.11

−0.1077

0.0023

Reproduced

Sex:

Female – Male

−0.08

−0.0775

0.0025

Reproduced

Age

0.02

0.0171

−0.0029

Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

−0.3020

TST:

Positive – Negative

0.29

0.2875

−0.0025

Reproduced

Sex:

Female – Male

0.18

0.1782

−0.0018

Reproduced

Age

0.08

0.0773

−0.0027

Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.0029

TST:

Positive – Negative

0.37

0.3712

0.0012

Reproduced

Remains non-sig, B same direction

Sex:

Female – Male

0.43

0.4386

0.0086

Incorrect Rounding

Remains non-sig, B same direction

Age

0.002

0.0023

0.0003

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman Plot showing differences between original and reproduced p-values for WLZ change

Results for p-values

P-values for this model were mostly reproduced. One p-value differed due to a rounding error, but this did not alter its statistical significance.

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction.

Methods

The linear model was computationally reproduced. Residual diagnostics suggested possible non-normality. Non-parametric BCa bootstrapping (10,000 resamples) was used to obtain standardized coefficients and 95% confidence intervals (CIs), and CI widths were examined. Percentage and absolute changes in estimates and CI bounds, relative to the original linear-model estimates, were summarised using <0.10 and <0.20 thresholds. Coefficient direction and statistical significance were assessed for consistency.

Results

Bootstrapped results

Bias-Corrected and Accelerated (BCa) bootstrap confidence intervals were calculated using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0256

0.0266

−0.0010

−3.9800

No

No

No

TSTPositive

0.1755

0.1794

−0.0039

−2.2300

No

No

No

SexMale

−0.0983

−0.1004

0.0021

2.0900

No

No

No

z_Age

0.1961

0.1958

0.0003

0.1300

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.1503

−0.1411

−0.0092

−6.1000

No

No

No

TSTPositive

−0.2104

−0.2947

0.0842

40.0400

Yes

No

No

SexMale

−0.3481

−0.3504

0.0024

0.6800

No

No

No

z_Age

0.0709

0.0685

0.0024

3.3400

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2015

0.2039

−0.0023

−1.1600

No

No

No

TSTPositive

0.5614

0.5883

−0.0269

−4.7900

No

No

No

SexMale

0.1514

0.1500

0.0014

0.9300

No

No

No

z_Age

0.3212

0.3245

−0.0033

−1.0200

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.3518

0.3450

−0.0068

−1.9400

No

No

No

TSTPositive

0.7719

0.8830

0.1111

14.4000

Yes

Yes

No

SexMale

0.4994

0.5004

0.0010

0.1900

No

No

No

z_Age

0.2503

0.2560

0.0057

2.2600

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.7747

0.7609

0.0137

Remains non-sig, B same direction

TSTPositive

0.3712

0.4215

−0.0503

Remains non-sig, B same direction

SexMale

0.4386

0.4282

0.0105

Remains non-sig, B same direction

z_Age

0.0023

0.0026

−0.0004

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of the bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was not inferentially reproduced with differences in the confidence interval range of standardized coefficients for TST changing by or by 14%.