Paper 46: Ventilatory efficiency during constant-load test at lactate threshold intensity: Endurance versus resistance exercises

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 5, 2026

References

Albesa-Albiol L, Serra-Paya N, Garnacho-Castaño MA, Guirao Cano L, Pleguezuelos Cobo E, Mate´-Muñoz JL, et al. (2019) Ventilatory efficiency during constant-load test at lactate threshold intensity: Endurance versus resistance exercises. PLoS ONE 14(5): e0216824. https://doi.org/10.1371/journal.pone.0216824

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability, and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Albesa-Albiol et al. (2019) paper are presented below. An overall summary of results is provided first, followed by model-specific results displayed in tab panels. In each panel, the original results tab shows the linear regression results extracted from the published paper. The reproduced results tab presents the results reproduced from the data provided by the authors, including a comprehensive check of linear regression assumptions. The third tab displays the differences between the original and reproduced models, assessing computational reproducibility. The final tab presents sensitivity analysis which evaluates inferential reproducibility by examining whether identified assumption violations had a meaningful effect on the results.

Summary from statistical review

This paper examines ventilatory efficiency in resistance exercises. Repeated measures ANOVA was the primary analysis, and linear regression was used in conjunction with correlation to explore the relationships between variables such as ventilation and carbon dioxide. While the analysis seems to account for repeated measures (SPSS output provided), this did not seem to be the case for the regression analysis as there were many more observations than participants. The normality of the data was assessed using the Shapiro-Wilk test rather than the model residuals, and no other assumptions were discussed. Regression coefficients were not discussed, only displayed in equations on scatterplots.

Data availability and software used

The authors provided a poorly formatted Excel file spread across multiple worksheets, with data copied and pasted into arbitrary columns and no participant identifiers or formal variable names. Variables required for the regression analyses had to be extracted manually, one at a time. The regression analyses were conducted using SPSS.

Regression sample

Four regressions were shown in scatterplots with regression lines, equations and correlation coefficients. The three randomly chosen regressions were Cycle ergometer VO₂ Half-squat VO₂, and Cycle ergometer ventilatory efficiency.

Computational reproducibility results

All three regression analyses were computationally reproduced. However, reproducing the regressions for cycle and half-squat VO₂, which used ergometer_log_VE as the independent variable, was challenging. The authors provided the data in an unlabelled Excel file, making the identification of the outcome and predictor variables ambiguous and easily confused. In addition, the data were presented graphically using scatterplots in which several outlying observations were omitted, giving the appearance of a more linear relationship than was present in the full dataset. Further complicating interpretation, the x-axis was labelled as ergometer_log10_VE, although the plotted values corresponded to the raw data. This mislabelling made the reported model equations difficult to interpret, as the equations specified a log base 10 transformed predictor, whereas the transformation applied in the analyses were natural logarithms.

Inferential reproducibility results

This paper was not inferentially reproducible, as all three models exhibited violations of the linearity assumption. In the Cycle Ergometer VO₂ model, these violations were largely driven by a single extremely influential observation. Linear (BIC = −61), quadratic (BIC = −87), and cubic (BIC = −215) specifications were therefore examined. Although the quadratic and cubic models showed BIC reductions exceeding 10 relative to the linear model, none provided an adequate fit. The cubic specification appeared to be dominated by extreme observations rather than reflecting the underlying trend, while the linear and quadratic models were pulled downward at the upper end of the predictor range.

After removal of the influential observation, the linear model remained a poor fit (BIC = −193), whereas the quadratic model provided the best fit (BIC = −230). Log transformation of the variables visually improved the alignment between the fitted line and the observed data; however, one high-leverage observation continued to have disproportionate influence on the slope fitted using ordinary least squares. Robust regression was therefore applied to down-weight this observation, producing a fitted relationship that more closely reflected the central trend of the data.

The remaining two models did not exhibit extreme outliers but showed violations of linearity, with linear terms failing to adequately characterise the associations. Both models were therefore refitted with log-transformed predictor and outcome variables. The log–log specification provided an adequate final model for the second analysis. In contrast, the third model continued to display residual curvature, and a natural cubic spline with three degrees of freedom was adopted as the final specification to allow modest curvature while limiting model complexity. Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or to formally assess whether the observed patterns reflect within-participant clustering rather than the underlying association.

Model 1

Model results for Cycle ergometer VO₂

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.8679

Cycle_ergometer_log_VE

5.7886

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Cycle ergometer VO₂

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.875

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Cycle ergometer VO₂

Term

SS

DF

MS

F

p-value

Cycle_ergometer_log_VE

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Cycle ergometer VO₂

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.868

0.085

−1.036

−0.700

−10.172

<0.001

Cycle_ergometer_log_VE

5.789

0.158

5.478

6.099

36.696

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval; Calculated using type III SS.

Fit statistics for Cycle ergometer VO₂

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.884

0.782

0.781

−99.541

0.210

1,346.579

1

376

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Cycle ergometer VO₂

Term

SS

DF

MS

F

p-value

Cycle_ergometer_log_VE

59.952

1

59.952

1,346.579

<0.001

Residuals

16.740

376

0.045

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Cycle_ergometer_log_VE

−2.178

0.0300

Linearity violation

Tukey test

−2.178

0.0294

Linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results
  • Plots and tests show an unaccounted non-linear relationship may be present.
  • The outlier in this data is causing linearity problems with the regression line being pulled down so it does not fit the data.
Testing for homoscedasticity

Statistic

p-value

Parameter

Method

40.441

<0.001

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test indicates heteroscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Cycle ergometer VO₂

378

2.238

0.451

2.178

0.829

3.534

−0.142

0.906

Cycle_ergometer_log_VE

378

0.536

0.069

0.537

0.188

0.890

−0.716

4.114

.fitted

378

2.238

0.399

2.240

0.220

4.286

−0.716

4.114

.resid

378

−0.000

0.211

−0.002

−1.956

0.609

−2.465

20.365

.leverage

378

0.005

0.007

0.003

0.003

0.073

6.399

54.992

.sigma

378

0.211

0.002

0.211

0.183

0.211

−16.364

291.909

.cooksd

378

0.013

0.188

0.001

0.000

3.629

18.981

362.736

.std.resid

378

−0.000

1.011

−0.010

−9.627

2.992

−2.617

22.382

dfb.1_

378

0.002

0.167

−0.003

−0.229

2.943

14.915

255.211

dfb.C___

378

−0.002

0.172

0.001

−3.043

0.246

−14.901

255.544

dffit

378

−0.001

0.181

−0.001

−3.100

0.833

−13.119

226.552

cov.r

378

1.005

0.022

1.007

0.616

1.035

−14.550

247.008

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.C___

dffit

cov.r

266

−4.359

0.011

0.105

0.378

−0.410

−0.468

0.921

106

3.024

0.071

0.340

0.831

−0.817

0.833

1.031

84

−11.075

0.073

3.629

2.943

−3.043

−3.100

0.616

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Three observations had studentized residuals greater than three. Two of these did not make a substantial difference to estimates with their DFBETAS, DFFITS, and COVRATIO statistics were within conventional thresholds. The third observation was an extreme outlier, with a Cook’s distance of approximately 11 and DFBETAS and DFFITS values around three, indicating that both the regression coefficients and their standard errors would change substantially if this observation were removed. The COVRATIO for this observation was approximately 0.6, suggesting that it also affected the precision of the parameter estimates. Residual diagnostics further suggested that this observation may be contributing to violations of the linearity assumption.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.094

0.0025

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.838

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.602

0.795

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

The model assumptions were dominated by an extreme outlier, with a Cook’s distance of approximately 11 and DFBETAS and DFFITS values around three, indicating that both the regression coefficients and their standard errors would change substantially if this observation were removed. The COVRATIO for this observation was approximately 0.6, suggesting a material impact on the precision of the parameter estimates. Residual diagnostics further indicated that this observation may be contributing to violations of the linearity assumption.

Residual plots also suggested departures from normality, evidence of autocorrelation, and mild heteroscedasticity. In addition, the study design involved repeated measures, which were not accounted for in the analysis. Taken together, these issues indicate that the fitted model poorly represents the underlying data structure.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Cycle ergometer VO₂

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−0.8679

−0.8679

0.0000

Reproduced

Cycle_ergometer_log_VE

5.7886

5.7886

0.0000

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Cycle_ergometer_log_VE

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value for the model was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated non-linearity and the presence of influential observations that may affect both point estimates and confidence intervals. The data was standardized and fitted with and without the outlying observation to compare the differences in coefficients.

Linear, quadratic, and cubic models were therefore visualised using the same variable transformations as specified in the original model. Model fit was compared using AIC and BIC, with comparisons restricted to models fitted to the same dataset. Models with and without the influential outlier were assessed descriptively through visualisation rather than formal information-criterion comparison. To address assumption violations, the outcome variable (cycle ergometer VO₂) was log-transformed and the model refitted. Robust regression was then applied to reduce the effect of influential observations.

Results

After standardisation, the regression coefficients changed only slightly (0.88 with the outlier included vs 0.91 excluded). On this basis alone, one might conclude that the outlier does not substantially affect the results. However, this conclusion would be misleading. The apparent stability of the coefficient arises because the linear model provides a poor fit to the data both with and without the outlier, rather than because the outlier is inconsequential.

Characteristic
All_data
Outlier Removed
N Beta 95% CI1 p-value N Beta 95% CI1 p-value
(Intercept) 378 0.0000 -0.0473, 0.0473 >0.999 377 0.0000 -0.0411, 0.0411 >0.999
z_Cycle_ergometer_log_VE 378 0.8842 0.8368, 0.9315 <0.001 377 0.9141 0.8729, 0.9552 <0.001
1 CI = Confidence Interval

Going back to the unstandardized data, the initial model had substantial violations of regression assumptions, driven by an extreme influential outlier, as shown by fitting three models: Linear (BIC = -61), quadratic (BIC = -87), and cubic (BIC = -215). The Cubic and quadratic models showed BIC reductions of more than 10; however, none of the models provided an adequate fit. The cubic model appeared to follow individual extreme observations rather than the underlying trend, while both the linear and quadratic fits were artificially pulled downward at the upper end of the predictor range. After removal of the outlier, the linear model remained a poor representation of the data (BIC = -193), with quadratic fitting best (BIC = -230) than cubic (BIC = -224), with residual curvature and continued influence from observations at the upper end of the distribution.

Comparison of non linear models with full data

Comparison of non linear models with outlier removed

The full dataset was re-analysed with the outlier retained, applying log-transformations to both the predictor and outcome variables. Visually, the resulting log–log model performed similarly to the quadratic model fitted to the reduced dataset. In the log–log specification, ventilation (VE) was strongly associated with carbon dioxide production (VO₂), with a 1% increase in VE associated with an approximately 2.9% increase in VO₂.

Although the log–log transformation did not substantially improve the residual diagnostics, it visually improved the alignment of the fitted line with the observed data. However, one high-leverage observation exerted disproportionate influence on the OLS slope. Robust regression was therefore examined as a sensitivity analysis, downweighting this observation and producing a fitted line that more closely reflected the central trend of the data. The exponent for VE indicates that VO₂ is proportional to VE raised to the power of 3.4. This implies an elastic relationship, where proportional changes in VE are associated with amplified proportional changes in VO₂. For example, doubling VE would be associated with an approximate 2\(^{3.4}\) ≈ 10.6-fold increase in VO₂.

Log-log model with comparison of OLS and robust regression

Comparison of Robust regression with full data on log scale

Ordinary least squares

Robust regression

Variable

Estimate

SE

Lower

Upper

p-value

Estimate

SE

Lower

Upper

p-value

Intercept

−0.785

0.038

−0.860

−0.710

<0.001

−1.014

0.048

−1.107

−0.920

<0.001

Cycle_ergometer_log_VE

2.922

0.071

2.784

3.061

<0.001

3.357

0.084

3.192

3.522

<0.001

Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or formally investigate whether the observed patterns are driven by within-participant clustering rather than the underlying association.

Model 2

Model results for Half-squat VO₂

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.5116

Half_squat_log_VE

4.3696

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Half-squat VO₂

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.853

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Half-squat VO₂

Term

SS

DF

MS

F

p-value

Half_squat_log_VE

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Half-squat VO₂

Term

B

SE

Lower

Upper

t

p-value

Intercept

−0.512

0.069

−0.648

−0.376

−7.396

<0.001

Half_squat_log_VE

4.370

0.142

4.091

4.648

30.830

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval; Calculated using type III SS.

Model fit for Half-squat VO₂

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.850

0.722

0.721

−494.657

0.123

950.494

1

366

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Half-squat VO₂

Term

SS

DF

MS

F

p-value

Half_squat_log_VE

14.355

1

14.355

950.494

<0.001

Residuals

5.527

366

0.015

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Half_squat_log_VE

4.352

<0.001

Linearity violation

Tukey test

4.352

<0.001

Linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

Plots and tests show an unaccounted non-linear relationship may be present.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

0.006

0.9373

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test indicates heteroscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Half-squat VO₂

368

1.612

0.233

1.594

0.913

2.269

0.248

0.171

Half_squat_log_VE

368

0.486

0.045

0.493

0.324

0.601

−0.387

0.741

.fitted

368

1.612

0.198

1.645

0.902

2.113

−0.387

0.741

.resid

368

−0.000

0.123

−0.012

−0.440

0.437

0.497

0.740

.leverage

368

0.005

0.005

0.004

0.003

0.038

3.593

17.028

.sigma

368

0.123

0.000

0.123

0.121

0.123

−3.742

19.314

.cooksd

368

0.003

0.011

0.001

0.000

0.134

8.094

78.252

.std.resid

368

0.000

1.002

−0.100

−3.582

3.592

0.503

0.754

dfb.1_

368

0.000

0.065

0.000

−0.222

0.506

2.627

20.490

dfb.H___

368

−0.000

0.065

0.000

−0.491

0.236

−2.256

18.506

dffit

368

0.005

0.084

−0.006

−0.202

0.527

1.878

7.841

cov.r

368

1.005

0.009

1.008

0.939

1.041

−2.275

11.999

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.H___

dffit

cov.r

1

0.687

0.036

0.009

0.131

−0.128

0.133

1.041

88

−3.641

0.003

0.020

0.050

−0.068

−0.202

0.939

64

2.344

0.038

0.107

0.457

−0.448

0.465

1.014

7

3.652

0.020

0.134

0.506

−0.491

0.527

0.955

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • Although two observations had residuals greater than three Cook’s distances, DFBETAS, DFFITS and COVRATIO were within acceptable limits, suggesting minimal influence on model estimates.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.074

0.0357

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.979

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.555

0.883

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

The residual diagnostics indicated departures from linearity and normality; however, the identified outliers were not considered influential. There was no indication of substantial heteroscedasticity.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Half-squat VO₂

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−0.5116

−0.5116

0.0000

Reproduced

Half_squat_log_VE

4.3696

4.3696

0.0000

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Half_squat_log_VE

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-value for this model was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated that the residuals were not normally distributed and non-linear. Linear, quadratic, and cubic models were fit and compared using AIC and BIC and standardize for reporting of coefficients. To address assumption violations, the outcome variable (Half squat VO₂) was log-transformed, and the model was refitted, and the residuals were re-examined.

Results

Visual inspection of the linear, quadratic and cubic models suggested pronounced curvature at the lower end of the predictor range, with the linear model being pulled down at the upper end. Model fit statistics indicated substantial improvement for the quadratic model, with reductions in AIC and BIC exceeding 10 relative to the linear model, providing strong evidence that a linear specification was inadequate.

Comparison of non-linear terms standardized coefficients
Characteristic
Linear
Quadratic
Cubic
Beta 95% CI1 p-value Beta 95% CI1 p-value Beta 95% CI1 p-value
(Intercept) -0.0007 -0.0549, 0.0534 0.978 -0.0731 -0.1353, -0.0110 0.021 -0.0648 -0.1277, -0.0020 0.043
z_Half_squat_log_VE 0.8487 0.7945, 0.9028 <0.001 0.8767 0.8223, 0.9310 <0.001 0.9247 0.8443, 1.0051 <0.001
z_Half_squat_log_VE_sq


0.0724 0.0397, 0.1051 <0.001 0.0586 0.0217, 0.0954 0.002
z_Half_squat_log_VE_cube





-0.0142 -0.0317, 0.0033 0.112
1 CI = Confidence Interval
Comparison of non-linear terms orginal scales

Comparison of non-linear terms fit

Model_type

R2adj

sigma

df

logLik

AIC

BIC

deviance

df.residual

nobs

Linear

0.721

0.5279874

1

−286.1314

578.2628

589.9870

102.03009

366

368

Quadratic

0.734

0.5155056

2

−276.8238

561.6476

577.2800

96.99730

365

368

Cubic

0.735

0.5144226

3

−275.5451

561.0901

580.6305

96.32552

364

368

Because the X variable (Half_squat_log_VE) had already been log transformed rather than fitting a further quadratic a more simple solution was to reduce the curvature by also logging the the y variable (Half_squat_VO2). In the log–log model, the exponent for VE indicates that VO₂ is proportional to VE raised to the power of 3.4. This implies an elastic relationship, where proportional changes in VE are associated with amplified proportional changes in VO₂. For example, doubling VE would be associated with an approximate 2\(^{2.8}\) ≈ 7-fold increase in VO₂.

Model with log coefficients
Characteristic Beta 95% CI1 p-value
(Intercept) -0.8760 -0.9593, -0.7926 <0.001
Half_squat_log_VE 2.7633 2.5925, 2.9341 <0.001
1 CI = Confidence Interval
Model with log transformed variables

Residual diagnostics for the log–log model indicated no evidence of residual nonlinearity. Although a small number of observations had relatively large residuals, Cook’s distance values were well below conventional thresholds, suggesting no undue influence on the fitted model. Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or formally investigate whether the observed patterns are driven by within-participant clustering rather than the underlying association.

Residual plots

Model 3

Model results for Cycle ergometer VE

Term

B

SE

Lower

Upper

t

p-value

Intercept

−3.4427

VCO2

27.816

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics Cycle ergometer VE

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.892

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Cycle ergometer VE

Term

SS

DF

MS

F

p-value

VCO2

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Cycle ergometer VE

Term

B

SE

Lower

Upper

t

p-value

Intercept

−3.443

1.281

−5.961

−0.924

−2.687

0.0075

VCO2

27.816

0.598

26.641

28.992

46.525

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval; Calculated using type III SS.

Fit statistics for Cycle ergometer VE

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.915

0.838

0.837

2,653.921

5.617

2,164.557

1

419

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Cycle ergometer VE

Term

SS

DF

MS

F

p-value

VCO2

68,627.952

1

68,627.952

2,164.557

<0.001

Residuals

13,284.526

419

31.705

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

VCO2

4.399

<0.001

Linearity violation

Tukey test

4.399

<0.001

Linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

Plots and tests show an unaccounted non-linear relationship may be present.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

0.448

0.5033

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Cycle ergometer VE

421

54.790

13.965

51.500

16.100

115.400

0.752

1.550

VCO2

421

2.093

0.460

2.027

0.629

3.515

0.141

0.925

.fitted

421

54.790

12.783

52.941

14.054

94.332

0.141

0.925

.resid

421

0.000

5.624

−0.419

−15.779

41.353

3.261

19.897

.leverage

421

0.005

0.004

0.003

0.002

0.027

2.566

6.753

.sigma

421

5.631

0.033

5.636

5.261

5.637

−9.667

99.964

.cooksd

421

0.003

0.015

0.000

0.000

0.207

9.025

95.495

.std.resid

421

0.000

1.002

−0.074

−2.809

7.358

3.259

19.840

dfb.1_

421

0.000

0.059

−0.001

−0.560

0.360

−2.046

34.262

dfb.VCO2

421

−0.000

0.061

0.001

−0.284

0.620

3.916

39.777

dffit

421

0.005

0.079

−0.005

−0.202

0.656

3.968

23.510

cov.r

421

1.005

0.021

1.007

0.765

1.032

−9.239

94.524

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potiential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a theshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potiential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potiential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentised residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cutoff.

ID

StudRes

Leverage

CookD

dfb.1_

dfb.VCO2

dffit

cov.r

346

−0.329

0.025

0.001

0.046

−0.050

−0.053

1.030

94

0.368

0.027

0.002

0.061

−0.058

0.061

1.032

311

7.875

0.004

0.100

0.360

−0.284

0.478

0.765

312

7.875

0.004

0.100

0.360

−0.284

0.478

0.765

232

3.620

0.017

0.108

−0.388

0.436

0.471

0.961

234

4.364

0.022

0.207

−0.560

0.620

0.656

0.940

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • Four observations had studentized residuals > 3. They Cook’s distance, with DFBETAS and DFFITS within conventional ranges. However, two COVRATIO values were substantially < 1, but potential inflation of standard errors (wider confidence intervals). Therefore, these should be further invesigated.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.133

<0.001

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.756

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.798

0.403

<0.001

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

Residual plots indicated departures from normality. Diagnostic plots and tests also suggested potential non-linearity, indicating that the fitted linear models may not have fully captured the underlying relationships. COVRATIO values suggested reduced precision of the confidence intervals, and residuals showed evidence of autocorrelation. Given the repeated-measures study design, the assumption of independence was unlikely to hold.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Cycle ergometer VE

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−3.4427

−3.4427

0.0000

Reproduced

VCO2

27.8160

27.8164

0.0004

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.0075

VCO2

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values
  • The p-value for this model was reproduced.

Conclusion computational reproducibility

  • This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated departures from normality and evidence of nonlinearity. The predictors were standardized, and linear, quadratic, and cubic specifications were fitted and compared using AIC and BIC. Model visualisations were presented on the original scale, as specified by the authors. To address assumption violations, variables were log-transformed and the model refitted. Residual curvature persisted following transformation; therefore, quadratic and a spline models were subsequently compared using AIC and BIC to assess relative model fit. A natural cubic spline with three degrees of freedom was used to capture residual curvature after log-transformation, while limiting model complexity and improving stability at the extremes of the predictor range.

Results

Visual inspection of the linear, quadratic and cubic models suggested pronounced curvature at the upper end of the predictor range. Model fit statistics indicated substantial improvement for the quadratic model, with reductions in AIC and BIC exceeding 10 relative to the linear model, providing strong evidence that a linear specification was inadequate. Therefore, this model was not inferentially reproducible.

Comparison of non-linear terms standardized coefficients
Characteristic
Linear
Quadratic
Cubic
Beta 95% CI1 p-value Beta 95% CI1 p-value Beta 95% CI1 p-value
(Intercept) 0.0004 -0.0383, 0.0390 0.985 -0.0492 -0.0930, -0.0053 0.028 -0.0516 -0.0955, -0.0077 0.021
z_VCO2 0.9152 0.8766, 0.9539 <0.001 0.9081 0.8702, 0.9461 <0.001 0.8652 0.7979, 0.9324 <0.001
z_VCO2_sq


0.0495 0.0274, 0.0717 <0.001 0.0504 0.0283, 0.0726 <0.001
z_VCO2_cube





0.0109 -0.0032, 0.0249 0.129
1 CI = Confidence Interval
Comparison of non-linear terms orginal scales

Fit comparison of non-linear

Model_type

R2adj

sigma

df

logLik

AIC

BIC

deviance

df.residual

nobs

Linear

0.837

0.4035717

1

−214.3548

434.7097

446.8376

68.24257

419

421

Quadratic

0.844

0.3950157

2

−204.8304

417.6607

433.8313

65.22362

418

421

Cubic

0.845

0.3943966

3

−203.6658

417.3317

437.5448

64.86378

417

421

To reduce curvature, both the predictor and outcome variables were log-transformed. Residual diagnostics indicated some remaining non-linearity. Quadratic and spline models were therefore fitted and compared with the linear model using AIC and BIC, both of which showed substantially improved fit (ΔAIC and ΔBIC > 10). The spline model provided a marginally better fit than the quadratic model (ΔAIC = 6.6; ΔBIC = 2.6) and was selected as the final model, after visual inspection.

Comparison log-log model

Model_type

R2adj

sigma

df

logLik

AIC

BIC

deviance

df.residual

nobs

log Linear

0.873

0.09230962

1

406.7068

−807.4135

−795.2856

3.570327

419

421

log Quadratic

0.877

0.09095109

2

413.4517

−818.9034

−802.7329

3.457738

418

421

Log Spline

0.879

0.09013299

3

417.7599

−825.5198

−805.3067

3.387689

417

421

The spline terms were all statistically significant; however, individual spline coefficients are not directly interpretable. The fitted spline indicated a clear non-linear association between log(VCO₂) and log(ventilation).

Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or formally investigate whether the observed patterns are driven by within-participant clustering rather than the underlying association.

Spline fit

Spline results

term

estimate

std.error

statistic

p.value

(Intercept)

2.832

0.049

57.401

<0.001

ns(log(VCO2), df = 3)1

0.991

0.027

36.507

<0.001

ns(log(VCO2), df = 3)2

2.275

0.109

20.952

<0.001

ns(log(VCO2), df = 3)3

1.402

0.033

42.652

<0.001