Paper 46: Ventilatory efficiency during constant-load test at lactate threshold intensity: Endurance versus resistance exercises

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 5, 2026

References

Albesa-Albiol L, Serra-Paya N, Garnacho-Castaño MA, Guirao Cano L, Pleguezuelos Cobo E, Mate´-Muñoz JL, et al. (2019) Ventilatory efficiency during constant-load test at lactate threshold intensity: Endurance versus resistance exercises. PLoS ONE 14(5): e0216824. https://doi.org/10.1371/journal.pone.0216824

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability, and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Albesa-Albiol et al. (2019) paper are presented below. An overall summary of results is provided first, followed by model-specific results displayed in tab panels. In each panel, the original results tab shows the linear regression results extracted from the published paper. The reproduced results tab presents the results reproduced from the data provided by the authors, including a comprehensive check of linear regression assumptions. The third tab displays the differences between the original and reproduced models, assessing computational reproducibility. The final tab presents sensitivity analysis which evaluates inferential reproducibility by examining whether identified assumption violations had a meaningful effect on the results.

Summary from statistical review

This paper examines ventilatory efficiency in resistance exercises. Repeated measures ANOVA was the primary analysis, and linear regression was used in conjunction with correlation to explore the relationships between variables such as ventilation and carbon dioxide. While the analysis seems to account for repeated measures (SPSS output provided), this did not seem to be the case for the regression analysis as there were many more observations than participants. The normality of the data was assessed using the Shapiro-Wilk test rather than the model residuals, and no other assumptions were discussed. Regression coefficients were not discussed, only displayed in equations on scatterplots.

Data availability and software used

The authors provided a poorly formatted Excel file spread across multiple worksheets, with data copied and pasted into arbitrary columns and no participant identifiers or formal variable names. Variables required for the regression analyses had to be extracted manually, one at a time. The regression analyses were conducted using SPSS.

Regression sample

Four regressions were shown in scatterplots with regression lines, equations and correlation coefficients. The three randomly chosen regressions were Cycle ergometer VO₂ Half-squat VO₂, and Cycle ergometer ventilatory efficiency.

Computational reproducibility results

All three regression analyses were computationally reproduced. However, reproducing the regressions for cycle and half-squat VO₂, which used ergometer_log_VE as the independent variable, was challenging. The authors provided the data in an unlabelled Excel file, making the identification of the outcome and predictor variables ambiguous and easily confused. In addition, the data were presented graphically using scatterplots in which several outlying observations were omitted, giving the appearance of a more linear relationship than was present in the full dataset. Further complicating interpretation, the x-axis was labelled as ergometer_log10_VE, although the plotted values corresponded to the raw data. This mislabelling made the reported model equations difficult to interpret, as the equations specified a log base 10 transformed predictor, whereas the transformation applied in the analyses were natural logarithms.

Inferential reproducibility results

This paper was not inferentially reproducible, as all three models exhibited violations of the linearity assumption. In the Cycle Ergometer VO₂ model, these violations were largely driven by a single extremely influential observation. Linear (BIC = −61), quadratic (BIC = −87), and cubic (BIC = −215) specifications were therefore examined. Although the quadratic and cubic models showed BIC reductions exceeding 10 relative to the linear model, none provided an adequate fit. The cubic specification appeared to be dominated by extreme observations rather than reflecting the underlying trend, while the linear and quadratic models were pulled downward at the upper end of the predictor range.

After removal of the influential observation, the linear model remained a poor fit (BIC = −193), whereas the quadratic model provided the best fit (BIC = −230). Log transformation of the variables visually improved the alignment between the fitted line and the observed data; however, one high-leverage observation continued to have disproportionate influence on the slope fitted using ordinary least squares. Robust regression was therefore applied to down-weight this observation, producing a fitted relationship that more closely reflected the central trend of the data.

The remaining two models did not exhibit extreme outliers but showed violations of linearity, with linear terms failing to adequately characterise the associations. Both models were therefore refitted with log-transformed predictor and outcome variables. The log–log specification provided an adequate final model for the second analysis. In contrast, the third model continued to display residual curvature, and a natural cubic spline with three degrees of freedom was adopted as the final specification to allow modest curvature while limiting model complexity. Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or to formally assess whether the observed patterns reflect within-participant clustering rather than the underlying association.

Recommended changes

Match the correct equation to the data.
Consider the linearity of relationships.
Provide the data in a recognised long or wide format, with a unique participant identifier included to allow assessment of repeated measures and clustering.
For repeated measures data, Linear Mixed Models is the default method.
Evaluate the assumptions of the linear regression models by examining residuals, identifying influential outliers, and assessing multicollinearity among predictors. If any assumptions are violated, address them using appropriate methods.

Model 1

Model results for Cycle ergometer VO₂

Term	B	p-value
Intercept	−0.8679
Cycle_ergometer_log_VE	5.7886	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Cycle ergometer VO₂

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.875
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Cycle ergometer VO₂

Term	SS	DF	MS	F	p-value
Cycle_ergometer_log_VE
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Cycle ergometer VO₂

Term	B	SE	Lower	Upper	t	p-value
Intercept	−0.868	0.085	−1.036	−0.700	−10.172	<0.001
Cycle_ergometer_log_VE	5.789	0.158	5.478	6.099	36.696	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval; Calculated using type III SS.

Fit statistics for Cycle ergometer VO₂

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.884	0.782	0.781	−99.541	0.210	1,346.579	1	376	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Cycle ergometer VO₂

Term	SS	DF	MS	F	p-value
Cycle_ergometer_log_VE	59.952	1	59.952	1,346.579	<0.001
Residuals	16.740	376	0.045
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term	Statistic	p-value	Results
Cycle_ergometer_log_VE	−2.178	0.0300	Linearity violation
Tukey test	−2.178	0.0294	Linearity violation
Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

Plots and tests show an unaccounted non-linear relationship may be present.
The outlier in this data is causing linearity problems with the regression line being pulled down so it does not fit the data.

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
40.441	<0.001	1	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test indicates heteroscedasticity.
Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Cycle ergometer VO₂	378	2.238	0.451	2.178	0.829	3.534	−0.142	0.906
Cycle_ergometer_log_VE	378	0.536	0.069	0.537	0.188	0.890	−0.716	4.114
.fitted	378	2.238	0.399	2.240	0.220	4.286	−0.716	4.114
.resid	378	−0.000	0.211	−0.002	−1.956	0.609	−2.465	20.365
.leverage	378	0.005	0.007	0.003	0.003	0.073	6.399	54.992
.sigma	378	0.211	0.002	0.211	0.183	0.211	−16.364	291.909
.cooksd	378	0.013	0.188	0.001	0.000	3.629	18.981	362.736
.std.resid	378	−0.000	1.011	−0.010	−9.627	2.992	−2.617	22.382
dfb.1_	378	0.002	0.167	−0.003	−0.229	2.943	14.915	255.211
dfb.C___	378	−0.002	0.172	0.001	−3.043	0.246	−14.901	255.544
dffit	378	−0.001	0.181	−0.001	−3.100	0.833	−13.119	226.552
cov.r	378	1.005	0.022	1.007	0.616	1.035	−14.550	247.008
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.C___	dffit	cov.r
266	−4.359	0.011	0.105	0.378	−0.410	−0.468	0.921
106	3.024	0.071	0.340	0.831	−0.817	0.833	1.031
84	−11.075	0.073	3.629	2.943	−3.043	−3.100	0.616
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Three observations had studentized residuals greater than three. Two of these did not make a substantial difference to estimates with their DFBETAS, DFFITS, and COVRATIO statistics were within conventional thresholds. The third observation was an extreme outlier, with a Cook’s distance of approximately 11 and DFBETAS and DFFITS values around three, indicating that both the regression coefficients and their standard errors would change substantially if this observation were removed. The COVRATIO for this observation was approximately 0.6, suggesting that it also affected the precision of the parameter estimates. Residual diagnostics further suggested that this observation may be contributing to violations of the linearity assumption.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.094	0.0025	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.838	<0.001	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot indicates the residuals are not normally distributed.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
0.602	0.795	<0.001

Independence results

The Durbin–Watson test suggests there are auto-correlation issues.
The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.

Assumption conclusions

The model assumptions were dominated by an extreme outlier, with a Cook’s distance of approximately 11 and DFBETAS and DFFITS values around three, indicating that both the regression coefficients and their standard errors would change substantially if this observation were removed. The COVRATIO for this observation was approximately 0.6, suggesting a material impact on the precision of the parameter estimates. Residual diagnostics further indicated that this observation may be contributing to violations of the linearity assumption.

Residual plots also suggested departures from normality, evidence of autocorrelation, and mild heteroscedasticity. In addition, the study design involved repeated measures, which were not accounted for in the analysis. Taken together, these issues indicate that the fitted model poorly represents the underlying data structure.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Cycle ergometer VO₂

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept	−0.8679	−0.8679	0.0000	Reproduced
Cycle_ergometer_log_VE	5.7886	5.7886	0.0000	Reproduced
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept		<0.001
Cycle_ergometer_log_VE	<0.001	<0.001	0.0000	Reproduced	Remains sig, B same direction
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value for the model was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated non-linearity and the presence of influential observations that may affect both point estimates and confidence intervals. The data was standardized and fitted with and without the outlying observation to compare the differences in coefficients.

Linear, quadratic, and cubic models were therefore visualised using the same variable transformations as specified in the original model. Model fit was compared using AIC and BIC, with comparisons restricted to models fitted to the same dataset. Models with and without the influential outlier were assessed descriptively through visualisation rather than formal information-criterion comparison. To address assumption violations, the outcome variable (cycle ergometer VO₂) was log-transformed and the model refitted. Robust regression was then applied to reduce the effect of influential observations.

Results

After standardisation, the regression coefficients changed only slightly (0.88 with the outlier included vs 0.91 excluded). On this basis alone, one might conclude that the outlier does not substantially affect the results. However, this conclusion would be misleading. The apparent stability of the coefficient arises because the linear model provides a poor fit to the data both with and without the outlier, rather than because the outlier is inconsequential.

Characteristic	All_data				Outlier Removed
Characteristic	N	Beta	95% CI¹	p-value	N	Beta	95% CI¹	p-value
(Intercept)	378	0.0000	-0.0473, 0.0473	>0.999	377	0.0000	-0.0411, 0.0411	>0.999
z_Cycle_ergometer_log_VE	378	0.8842	0.8368, 0.9315	<0.001	377	0.9141	0.8729, 0.9552	<0.001
¹ CI = Confidence Interval

Going back to the unstandardized data, the initial model had substantial violations of regression assumptions, driven by an extreme influential outlier, as shown by fitting three models: Linear (BIC = -61), quadratic (BIC = -87), and cubic (BIC = -215). The Cubic and quadratic models showed BIC reductions of more than 10; however, none of the models provided an adequate fit. The cubic model appeared to follow individual extreme observations rather than the underlying trend, while both the linear and quadratic fits were artificially pulled downward at the upper end of the predictor range. After removal of the outlier, the linear model remained a poor representation of the data (BIC = -193), with quadratic fitting best (BIC = -230) than cubic (BIC = -224), with residual curvature and continued influence from observations at the upper end of the distribution.

Comparison of non linear models with full data

Comparison of non linear models with outlier removed

The full dataset was re-analysed with the outlier retained, applying log-transformations to both the predictor and outcome variables. Visually, the resulting log–log model performed similarly to the quadratic model fitted to the reduced dataset. In the log–log specification, ventilation (VE) was strongly associated with carbon dioxide production (VO₂), with a 1% increase in VE associated with an approximately 2.9% increase in VO₂.

Although the log–log transformation did not substantially improve the residual diagnostics, it visually improved the alignment of the fitted line with the observed data. However, one high-leverage observation exerted disproportionate influence on the OLS slope. Robust regression was therefore examined as a sensitivity analysis, downweighting this observation and producing a fitted line that more closely reflected the central trend of the data. The exponent for VE indicates that VO₂ is proportional to VE raised to the power of 3.4. This implies an elastic relationship, where proportional changes in VE are associated with amplified proportional changes in VO₂. For example, doubling VE would be associated with an approximate 2\(^{3.4}\) ≈ 10.6-fold increase in VO₂.

Log-log model with comparison of OLS and robust regression

Comparison of Robust regression with full data on log scale

	Ordinary least squares					Robust regression
Variable	Estimate	SE	Lower	Upper	p-value	Estimate	SE	Lower	Upper	p-value
Intercept	−0.785	0.038	−0.860	−0.710	<0.001	−1.014	0.048	−1.107	−0.920	<0.001
Cycle_ergometer_log_VE	2.922	0.071	2.784	3.061	<0.001	3.357	0.084	3.192	3.522	<0.001

Further modelling is constrained by the absence of participant-level identifiers. Without these identifiers, it is not possible to fit random-effects models or formally investigate whether the observed patterns are driven by within-participant clustering rather than the underlying association.

Model 2

Model results for Half-squat VO₂

Term	B	p-value
Intercept	−0.5116
Half_squat_log_VE	4.3696	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Half-squat VO₂

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.853
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Half-squat VO₂

Term	SS	DF	MS	F	p-value
Half_squat_log_VE
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Half-squat VO₂

Term	B	SE	Lower	Upper	t	p-value
Intercept	−0.512	0.069	−0.648	−0.376	−7.396	<0.001
Half_squat_log_VE	4.370	0.142	4.091	4.648	30.830	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval; Calculated using type III SS.

Model fit for Half-squat VO₂

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.850	0.722	0.721	−494.657	0.123	950.494	1	366	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Half-squat VO₂

Term	SS	DF	MS	F	p-value
Half_squat_log_VE	14.355	1	14.355	950.494	<0.001
Residuals	5.527	366	0.015
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.