Paper 42: Physical fitness and physical activity of 6-7-year-old children according to weight status and sports participation

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 5, 2026

References

Riso E-M, Toplaan L, Viira P, Vaiksaar S, Jurimae J (2019) Physical fitness and physical activity of 6-7-year-old children according to weight status and sports participation. PLoS ONE 14(6): e0218901. https://doi.org/10.1371/journal.pone.0218901

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability, and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Riso et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper investigated associations between body composition and physical activity in children aged six to seven years. Linear regression was used in two directions: first, with three measures of body composition as predictors of physical activity performance, and second, with time spent in different physical activity intensities used to predict performance outcomes. For each of the 30 analyses, three models were fitted: an unadjusted model and two models adjusted for confounders. Although the methods indicate that standardized regression coefficients were reported, the results were interpreted in the text as unit changes rather than standard deviation changes, indicating a mismatch between the reported metrics and their interpretation. Interpretation was based primarily on statistical significance rather than scientific or clinical importance, with no reporting of 95% confidence intervals or standard errors. No assessment or discussion of model assumptions or influential observations was reported, although multicollinearity was assessed. In addition, randomisation occurred at the kindergarten level rather than the individual level; this clustering was neither discussed nor accounted for in the analyses.

Data availability and software used

The authors provide data provided PDF, which took substantial effort to get into useable format. SPSS was used to conduct the statistical analyses.

Regression sample

The paper reported a total of 90 linear regression models. Of these, 63 models examined associations between physical fitness and physical activity with body composition, and a further 27 models explored associations between physical fitness and sedentary time and different levels of physical activity. Three models were selected at random for detailed assessment. The first model examined fat mass predicted by vigorous physical activity, adjusting for age, sex, parental educational level, and sports club participation. The second model assessed the association between BMI and sedentary time. The third model examined the association between fat-free mass and sedentary time. The Authors only reported the coefficient and p-value of the main independent variable along with the R2 values.

Computational reproducibility results

The two univariate models were successfully reproduced; however, the multivariable model for fat mass could not be reproduced. It was unclear whether this was due to differences in variable adjustment from those specified in the paper or whether errors were introduced during data extraction from the PDF. The number of observations used in the regression models was also unclear. The dataset appeared to contain 283 participants; however, a table describing the overall sample reported 256 participants, suggesting that analyses may have been restricted to participants with available BMI data. Descriptive statistics for the model variables were largely reproducible. The only discrepancy was for vigorous physical activity, which was reported as 21 (10) but reproduced as 20.54 (10.55); this difference was considered consistent with rounding error. Multiple sample-size specifications were explored, including analyses with and without restriction to participants with complete BMI data; however, the multivariable model results remained unreproducible under all specifications. Consequently, the paper was not considered computationally reproducible.

Inferential reproducibility results

This paper was not considered inferentially reproducible. The BMI model residuals indicated deviations from normality, and both univariate models contained observations that could influence the precision of the estimates. Although the change in standardised regression coefficients between the reproduced and bootstrapped models was less than 0.1, suggesting that departures from normality or the presence of influential observations were not consequential, the study employed randomisation at the kindergarten level. As a result, observations were clustered and could not be assumed to be independent. Because this clustering was neither accounted for nor discussed, and could not be formally assessed due to the absence of cluster identifiers in the available data, the study was not inferentially reproducible.

Model 1

Model results for Fat Mass

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

VPAday

0.615

−0.057

Age

Gender:

Girl – Boy

Educational_level:

University degree – No university degree

SC_attendance:

Yes – No

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit statistics for Fat Mass

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.020

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Fat Mass

Term

SS

DF

MS

F

p-value

VPAday

Age

Gender

Educational_level

SC_attendance

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Fat Mass

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

7.524

2.158

3.264

11.784

3.486

<0.001

VPAday

−0.010

0.015

−0.039

0.019

−0.673

0.5020

−0.053

Age

−0.225

0.311

−0.840

0.389

−0.724

0.4702

−0.055

Gender:

Girl – Boy

0.022

0.317

−0.603

0.647

0.069

0.9448

0.011

Educational_level:

University degree – No university degree

−0.361

0.372

−1.095

0.373

−0.972

0.3326

−0.177

SC_attendance:

Yes – No

−0.223

0.368

−0.950

0.504

−0.606

0.5454

−0.109

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Fit statistics for Fat Mass

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.124

0.015

−0.013

769.350

2.019

0.538

5

172

0.7472

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Fat Mass

Term

SS

DF

MS

F

p-value

VPAday

1.910

1

1.910

0.453

0.5020

Age

2.211

1

2.211

0.524

0.4702

Gender

0.020

1

0.020

0.005

0.9448

Educational_level

3.984

1

3.984

0.944

0.3326

SC_attendance

1.549

1

1.549

0.367

0.5454

Residuals

725.925

172

4.220

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Change in regression coefficients

term

O_std.B

R_std.B

changestdB

reproduce.std.B

Intercept

VPAday

−0.057

−0.0530

0.0040

Not Reproduced

Age

−0.0553

Gender:

Girl – Boy

0.0108

Educational_level:

University degree – No university degree

−0.1770

SC_attendance:

Yes – No

−0.1093

O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.020

0.0154

−0.0046

Not Reproduced

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

VPAday

0.615

0.5020

−0.1130

Not Reproduced

Age

0.4702

Gender:

Girl – Boy

0.9448

Educational_level:

University degree – No university degree

0.3326

SC_attendance:

Yes – No

0.5454

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-value was not reproduced for this model.

Conclusion computational reproducibility

This model was not computationally reproducible.

As this Model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced and therefore, statistical assumptions could not be meaningfully compared or interpreted.

Model 2

Model results for BMI

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

SEDday

0.797

−0.018

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit statistics for BMI

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.000

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for BMI

Term

SS

DF

MS

F

p-value

SEDday

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results BMI

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

16.148

0.571

15.022

17.273

28.294

<0.001

SEDday

−0.000

0.001

−0.003

0.002

−0.258

0.7965

−0.018

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Model fit for BMI

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.018

0.000

−0.005

804.381

1.763

0.067

1

199

0.7965

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for BMI

Term

SS

DF

MS

F

p-value

SEDday

0.209

1

0.209

0.067

0.7965

Residuals

624.826

199

3.140

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

SEDday

1.050

0.2950

No linearity violation

Tukey test

1.050

0.2937

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

0.091

0.7628

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

BMI

201

16.004

1.768

15.709

12.559

23.531

1.202

1.892

SEDday

201

410.458

92.354

392.429

251.333

901.025

2.078

6.127

.fitted

201

16.004

0.032

16.010

15.832

16.060

−2.078

6.127

.resid

201

−0.000

1.768

−0.297

−3.486

7.525

1.200

1.873

.leverage

201

0.010

0.014

0.006

0.005

0.146

6.014

44.720

.sigma

201

1.772

0.009

1.775

1.694

1.776

−4.972

32.829

.cooksd

201

0.006

0.031

0.001

0.000

0.425

12.382

163.800

.std.resid

201

0.000

1.004

−0.168

−1.980

4.257

1.199

1.854

dfb.1_

201

−0.000

0.089

−0.002

−0.856

0.266

−3.796

42.003

dfb.SEDd

201

0.000

0.090

0.000

−0.290

0.916

4.770

53.002

dffit

201

0.002

0.115

−0.012

−0.300

0.932

2.932

20.480

cov.r

201

1.010

0.024

1.014

0.839

1.125

−1.741

17.097

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.SEDd

dffit

cov.r

243

−0.263

0.077

0.003

0.068

−0.074

−0.076

1.094

68

4.454

0.005

0.045

0.089

−0.020

0.316

0.839

42

3.545

0.008

0.049

0.254

−0.204

0.323

0.901

159

2.253

0.146

0.425

−0.856

0.916

0.932

1.125

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

One observation had a relatively high Cook’s distance (0.425), indicating a moderate influence on the overall fitted model. A second observation had a studentised residual greater than 3 and a large DFFITS value (0.9), suggesting that this observation substantially influenced its own fitted value. Despite this, the observation had low Cook’s distance, indicating limited impact on the overall model estimates. Several observations had COVRATIO values greater than 1, implying that their removal could reduce standard errors and lead to narrower confidence intervals.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.111

0.0146

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.922

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.203

2.397

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

There were no meaningful departures from the assumptions of linearity or homoscedasticity. However, a small number of observations may influence estimated means and the width of confidence intervals and would warrant further investigation. Residuals were not normally distributed. In addition, the study design may violate the assumption of independence, as sampling occurred within kindergartens; however, this could not be formally assessed because kindergarten identifiers were not available in the provided dataset.

Change in regression coefficients

term

O_std.B

R_std.B

changestdB

reproduce.std.B

Intercept

SEDday

−0.018

−0.0183

−0.0003

Reproduced

O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.000

0.0003

0.0003

Reproduced

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

SEDday

0.797

0.7965

−0.0005

Reproduced

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The P-value for this model was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardised, and inference was assessed using bootstrapped standardised regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardised coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.0243

−0.0261

0.0018

7.3400

No

No

No

z_SEDday

−0.0181

−0.0226

0.0046

25.2600

Yes

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.1618

−0.1530

−0.0089

−5.4800

No

No

No

z_SEDday

−0.1559

−0.1629

0.0069

4.4500

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.1133

0.1237

−0.0105

−9.2300

No

No

No

z_SEDday

0.1198

0.1491

−0.0292

−24.4000

Yes

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2751

0.2767

0.0016

0.5800

No

No

No

z_SEDday

0.2758

0.3120

0.0362

13.1200

Yes

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.7281

0.7070

0.0210

Remains non-sig, B same direction

z_SEDday

0.7965

0.7777

0.0189

Remains non-sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Conclusions Inferential reproducibility

The study employed randomisation at the kindergarten level. Consequently, observations were clustered and could not be assumed to be independent. Because this clustering was neither accounted for nor discussed in the analysis, and could not be assessed as no data was provided, the model was not inferentially reproducible.

Model 3

Model results for Fat free mass

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

SEDday

0.412

−0.060

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit Statistics Fat free mass

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.004

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fat free mass

Term

SS

DF

MS

F

p-value

SEDday

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Fat free mass

Term

B

SE

Lower

Upper

t

p-value

stdEst

Intercept

20.700

0.921

18.884

22.516

22.483

<0.001

SEDday

−0.002

0.002

−0.006

0.003

−0.822

0.4123

−0.060

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Fit statistics for Fat free mass

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.060

0.004

−0.002

942.812

2.811

0.675

1

189

0.4123

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Fat free mass

Term

SS

DF

MS

F

p-value

SEDday

5.390

1

5.390

0.675

0.4123

Residuals

1,508.950

189

7.984

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

SEDday

−0.001

0.9993

No linearity violation

Tukey test

−0.001

0.9993

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

0.169

0.6808

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Fat free mass

191

19.963

2.823

19.494

14.352

27.692

0.448

−0.165

SEDday

191

411.419

93.945

393.050

251.333

901.025

2.056

5.876

.fitted

191

19.963

0.168

19.996

19.085

20.250

−2.056

5.876

.resid

191

0.000

2.818

−0.490

−5.358

7.618

0.458

−0.175

.leverage

191

0.010

0.015

0.006

0.005

0.148

5.865

42.516

.sigma

191

2.826

0.010

2.829

2.778

2.833

−2.457

6.799

.cooksd

191

0.006

0.017

0.002

0.000

0.143

5.672

35.940

.std.resid

191

0.000

1.003

−0.180

−1.901

2.706

0.455

−0.176

dfb.1_

191

−0.000

0.082

−0.000

−0.431

0.477

0.097

12.318

dfb.SEDd

191

0.000

0.083

−0.001

−0.521

0.462

−0.020

17.131

dffit

191

−0.000

0.111

−0.014

−0.539

0.470

0.214

5.343

cov.r

191

1.011

0.021

1.014

0.941

1.171

1.949

17.633

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potentially influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice, a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.SEDd

dffit

cov.r

97

2.753

0.008

0.028

0.174

−0.133

0.240

0.941

188

2.729

0.019

0.069

−0.268

0.320

0.377

0.953

159

1.127

0.148

0.110

−0.431

0.462

0.470

1.171

243

−1.849

0.078

0.143

0.477

−0.521

−0.539

1.058

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations exhibited studentised residuals close to 3 but had low leverage and small Cook’s distance, with DFBETAS and DFFITS within accepted ranges. One observation had a COVRATIO value substantially greater than 1, indicating potential influence on the precision of the estimates rather than on the point estimates themselves.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.083

0.1428

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.978

0.0046

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot looks roughly normal.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.069

2.129

0.3840

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

There were no meaningful departures from the assumptions of linearity, normality or homoscedasticity. However, a small number of observations may influence estimated means and the width of confidence intervals and would warrant further investigation. In addition, the study design may violate the assumption of independence, as sampling occurred within kindergartens; however, this could not be formally assessed because kindergarten identifiers were not available in the provided dataset.

Change in regression coefficients

term

O_std.B

R_std.B

changestdB

reproduce.std.B

Intercept

SEDday

−0.06

−0.0597

0.0003

Reproduced

O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.004

0.0036

−0.0004

Reproduced

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

SEDday

0.412

0.4123

0.0003

Reproduced

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-values in this model were reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated a small number of observations that may contribute to wider confidence intervals. All continuous variables in the model were standardised, and inference was assessed using bootstrapped standardised regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardised coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapping results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0269

0.0261

0.0008

3.0000

No

No

No

z_SEDday

−0.0596

−0.0606

0.0010

1.5900

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.1184

−0.1133

−0.0051

−4.3000

No

No

No

z_SEDday

−0.2027

−0.2230

0.0203

9.9900

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.1721

0.1720

0.0001

0.0700

No

No

No

z_SEDday

0.0835

0.0902

−0.0067

−8.0200

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2904

0.2852

−0.0052

−1.8000

No

No

No

z_SEDday

0.2863

0.3132

0.0270

9.4200

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

0.7156

0.7179

−0.0024

Remains non-sig, B same direction

z_SEDday

0.4123

0.4454

−0.0331

Remains non-sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. Change in standardized regression coefficients was below 10% and less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Conclusions Inferential reproducibility

The study employed randomisation at the kindergarten level. Consequently, observations were clustered and could not be assumed to be independent. Because this clustering was neither accounted for nor discussed in the analysis, and could not be assessed as no data was provided, the model was not inferentially reproducible.