Paper 33: Ultrasonography and electrophysiological study of median nerve in patients with essential tremor

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 15, 2026

References

Lee HL, Kim J-s, Kim H, Kim Is, Kim J-w, Kim Y-e, et al. (2019) Ultrasonography and electrophysiological study of median nerve in patients with essential tremor. PLoS ONE 14(4): e0215750. https://doi.org/10.1371/journal.pone.0215750

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Lee et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This study examined the median nerve in patients with essential tremors. ANCOVA and linear regression were identified as analysis methods, the authors did not report checking the assumptions, outliers, or collinearity. Both hands of patients were used in the study without correction for repeated measures. There was no meaningful interpretation of regression coefficients other than direction. Forward modelling was used for linear regression models.

Data availability and software used

The authors have only provided half of the data with no data supplied for the control group this was not stated in their data availability statement which indicated that all relevant data are within the manuscript and its Supporting Information file, although it was mentioned in the paper itself. The data provided was a long formatted SPSS file, but the authors did not take advantage of the in-built data dictionary. All analyses were performed in SPSS.

Regression sample

The authors used both ANCOVA and linear regression, which were all counted as linear regression. Three models were randomly selected because no primary model of interest was identified. Two models had Carpal inlet as the outcome, with each explanatory variable (Sensory NCV and Sensory amplitude) modelled separately in forward stepwise models, adjusted for age. The third model was a univariate ratio for the Carpal outlet / mid-forearm between the essential tremor and control group, as there was no control data provided, this could not be assessed for reproducibility.

Computational reproducibility results

Based on the two models, the data provided were computationally reproducible. While some confidence intervals were reported in the text, only coefficients and p-values were provided for the selected models. In addition, only a subset of the data was available, with no data provided for the control group. It was also not immediately clear on first reading that Table 3 comprised 16 separate models. Providing full model results in the supplementary materials would have improved clarity and reproducibility.

Inferential reproducibility results

Neither model was inferentially reproducible. The primary limitation was the lack of adjustment for repeated measures, as both left and right hands from each participant were included in the analyses with ICC’s >0.3. Evidence of heteroscedasticity was also observed. Consequently, linear mixed models with a wild bootstrap were fitted to account for within-participant clustering and heteroscedasticity. Comparisons of coefficients and 95% confidence intervals indicated medium (>0.10) to large (>0.20) differences in the standardized confidence-interval ranges. The direction of the coefficients and the statistical significance were consistent.

Model 1

Model results for Carpal inlet

Term

B

SE

Lower

Upper

t

p-value

Intercept

Sensory_amplitude

−0.044

<0.001

age

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Carpal inlet

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Carpal inlet

Term

SS

DF

MS

F

p-value

Sensory_amplitude

age

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Carpal inlet

Term

B

SE

Lower

Upper

t

p-value

Intercept

19.604

1.952

15.642

23.567

10.044

<0.001

Sensory_amplitude

−0.044

0.012

−0.069

−0.020

−3.636

<0.001

age

−0.138

0.026

−0.191

−0.085

−5.291

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Carpal inlet

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.668

0.446

0.414

159.777

1.783

14.068

2

35

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Carpal inlet

Term

SS

DF

MS

F

p-value

Sensory_amplitude

45.619

1

45.619

13.221

<0.001

age

96.602

1

96.602

27.997

<0.001

Residuals

120.763

35

3.450

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Sensory_amplitude

1.198

0.2393

No linearity violation

age

1.049

0.3018

No linearity violation

Tukey test

0.896

0.3704

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results
  • No linearity violation was observed in either plots or tests.
Testing for homoscedasticity

Statistic

p-value

Parameter

Method

11.848

0.0027

2

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test indicates heteroscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Carpal inlet

38

9.513

2.426

8.650

5.800

15.300

0.969

0.073

Sensory_amplitude

38

47.368

32.264

35.500

10.000

119.000

0.912

−0.661

age

38

57.947

15.141

58.000

21.000

77.000

−0.641

−0.333

.fitted

38

9.513

1.620

9.000

7.175

12.767

0.586

−0.951

.resid

38

0.000

1.807

0.223

−3.067

3.667

0.086

−0.936

.leverage

38

0.079

0.042

0.067

0.032

0.187

1.097

0.245

.sigma

38

1.857

0.031

1.867

1.770

1.885

−1.256

0.798

.cooksd

38

0.033

0.049

0.017

0.000

0.193

1.958

2.980

.std.resid

38

0.004

1.018

0.124

−1.777

2.030

0.089

−0.929

dfb.1_

38

0.002

0.232

0.002

−0.697

0.617

−0.134

2.102

dfb.Sns_

38

−0.001

0.193

0.003

−0.625

0.498

−0.680

2.270

dfb.age

38

−0.002

0.219

0.004

−0.568

0.646

0.114

1.872

dffit

38

0.018

0.326

0.030

−0.731

0.779

0.180

−0.006

cov.r

38

1.089

0.106

1.125

0.792

1.243

−0.843

0.157

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance) of the estimated regression coefficients if the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 indicate inflated variances and reduced precision (wider confidence intervals), whereas values above 1 indicate deflated variances and increased precision (narrower confidence intervals). A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.Sns_

dfb.age

dffit

cov.r

8

2.130

0.054

0.078

0.341

−0.364

−0.230

0.509

0.792

34

1.050

0.187

0.085

0.316

−0.009

−0.368

0.504

1.220

15

2.083

0.109

0.162

0.617

−0.625

−0.487

0.730

0.854

13

−1.836

0.137

0.167

−0.697

0.498

0.646

−0.731

0.951

33

1.623

0.187

0.193

0.489

−0.014

−0.568

0.779

1.073

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Cook’s distance, DFBETAS and DFFITS were within conventional ranges. However, there were COVRATIO values that suggest multiple point may effect confidence interval widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.089

0.8998

Exact one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.973

0.4893

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk supports residuals being normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

Sensory_amplitude

1.668

0.600

age

1.668

0.600

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.324

1.246

0.0160

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

The residual plots and the Breusch–Pagan test suggested some heteroscedasticity. Normality checks (Q–Q plots and tests) indicated that the residuals were approximately normally distributed, with no clear evidence of nonlinearity. Outlier diagnostics suggested that individual observations did not strongly influence point estimates, although confidence-interval widths may be affected. Evidence of autocorrelation was observed, consistent with the repeated-measures design, indicating that the independence assumption may not hold.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Carpal inlet

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

19.6042

Sensory_amplitude

−0.044

−0.0444

−0.0004

Reproduced

age

−0.1378

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Sensory_amplitude

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

age

<0.001

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value was reproduced for this model.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, diagnostics indicated potential heteroscedasticity and residual non-independence. To further assess the robustness of the findings, wild bootstrapping was used to obtain bootstrapped standardized regression coefficients and their 95% confidence intervals, accounting for potential heteroscedasticity. Absolute and percentage changes in coefficient estimates and confidence-interval bounds, relative to the linear model, were summarised using thresholds of a 10% change and standardized coefficient differences of <0.10 and <0.20. Concordance of coefficient direction was assessed, and the stability of statistical significance across models was evaluated.

Results

Linear vs Mixed model

The AIC and BIC were lower for mixed model compared to the linear model. The unadjusted intraclass correlation coefficient (ICC) from the null model was 0.4, suggesting a substantial degree of clustering. As ignoring clustering can lead to underestimated standard errors, inference was obtained using wild bootstrap linear mixed-effects models.

Reproduced Linear Model
Term B SE1 t
95% CI
p-value
Lower Upper
(Intercept) 0.0000 0.1242 0.0000 −0.2521 0.2521 >0.999
z_Sensory_amplitude −0.5909 0.1625 −3.6361 −0.9209 −0.2610 <0.001
z_age −0.8599 0.1625 −5.2913 −1.1899 −0.5300 <0.001
Sigma = 0.766; AIC = 92.4; BIC = 99.0; Residual df = 35; No. Obs. = 38
1 SE = Standard Error
Linear Mixed Model
Term B SE1 t
95% CI
p-value
Lower Upper
(Intercept) −0.0000 0.1527 −0.0000 −0.3106 0.3106 >0.999
z_Sensory_amplitude −0.5727 0.1796 −3.1885 −0.9382 −0.2073 0.003
z_age −0.8484 0.1920 −4.4195 −1.2390 −0.4578 <0.001
serial_no.SD (Intercept) 0.5879




Residual.SD (Observations) 0.4408




Sigma = 0.441; AIC = 84.4; BIC = 92.6; No. Obs. = 38
1 SE = Standard Error

Bootstrapped results

Wild bootstrapping was performed with 10,000 iterations.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0000

0.0003

−0.0003

−1,000.0000

Yes

No

No

z_Sensory_amplitude

−0.5909

−0.5721

−0.0189

−3.1900

No

No

No

z_age

−0.8599

−0.8474

−0.0126

−1.4600

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.2521

−0.3218

0.0697

27.6300

Yes

No

No

z_Sensory_amplitude

−0.9209

−0.9854

0.0645

7.0000

No

No

No

z_age

−1.1899

−1.3684

0.1785

15.0000

Yes

Yes

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2521

0.3292

−0.0771

−30.5600

Yes

No

No

z_Sensory_amplitude

−0.2610

−0.1684

−0.0926

−35.5000

Yes

No

No

z_age

−0.5300

−0.3317

−0.1983

−37.4100

Yes

Yes

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.5042

0.6509

0.1467

29.1000

Yes

Yes

No

z_Sensory_amplitude

0.6599

0.8170

0.1571

23.8100

Yes

Yes

No

z_age

0.6599

1.0367

0.3768

57.1000

Yes

Yes

Yes

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

1.0000

0.9987

0.0013

Remains non-sig, B same direction

z_Sensory_amplitude

<0.001

0.0018

−0.0009

Remains sig, B same direction

z_age

<0.001

<0.001

−0.0002

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on bootstrapped model

Substantial correlation of error residuals was observed as the study design had repeated measures with the unadjusted ICC being 0.4 and the AIC was for the linear mixed model was 8 units lower than the linear model. The linear mixed model was bootstrapped, and the CI range standardized differences for sensory amplitude was 0.16, and age 0.38 when compared to the linear model and was not inferentially reproducible.

Model 2

Model results for Carpal inlet

Term

B

SE

Lower

Upper

t

p-value

Intercept

Sensory_NCV

−0.269

<0.001

age

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Carpal inlet

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Carpal inlet

Term

SS

DF

MS

F

p-value

Sensory_NCV

age

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Carpal inlet

Term

B

SE

Lower

Upper

t

p-value

Intercept

28.104

3.066

21.881

34.327

9.168

<0.001

Sensory_NCV

−0.269

0.055

−0.380

−0.158

−4.913

<0.001

age

−0.115

0.020

−0.155

−0.075

−5.833

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for Carpal inlet

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.740

0.548

0.522

152.022

1.610

21.215

2

35

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Carpal inlet

Term

SS

DF

MS

F

p-value

Sensory_NCV

67.911

1

67.911

24.138

<0.001

age

95.728

1

95.728

34.025

<0.001

Residuals

98.471

35

2.813

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Sensory_NCV

−1.264

0.2149

No linearity violation

age

0.528

0.6012

No linearity violation

Tukey test

1.693

0.0904

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

1.310

0.5195

2

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Carpal inlet

38

9.513

2.426

8.650

5.800

15.300

0.969

0.073

Sensory_NCV

38

44.395

5.460

45.000

32.000

55.000

−0.293

−0.452

age

38

57.947

15.141

58.000

21.000

77.000

−0.641

−0.333

.fitted

38

9.513

1.796

9.077

6.353

13.869

0.475

−0.528

.resid

38

0.000

1.631

−0.505

−2.751

4.134

0.697

−0.368

.leverage

38

0.079

0.045

0.061

0.031

0.219

1.385

1.296

.sigma

38

1.677

0.033

1.690

1.542

1.702

−2.185

5.255

.cooksd

38

0.031

0.044

0.011

0.000

0.182

1.876

2.992

.std.resid

38

0.001

1.015

−0.311

−1.730

2.504

0.677

−0.440

dfb.1_

38

0.001

0.201

−0.000

−0.463

0.652

0.471

2.115

dfb.S_NC

38

−0.000

0.184

0.002

−0.630

0.402

−0.778

2.517

dfb.age

38

−0.002

0.198

0.000

−0.674

0.497

−0.341

2.780

dffit

38

0.004

0.314

−0.068

−0.599

0.754

0.644

−0.127

cov.r

38

1.092

0.126

1.127

0.623

1.392

−1.363

4.042

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance) of the estimated regression coefficients if the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 indicate inflated variances and reduced precision (wider confidence intervals), whereas values above 1 indicate deflated variances and increased precision (narrower confidence intervals). A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.S_NC

dfb.age

dffit

cov.r

34

0.220

0.219

0.005

0.080

−0.045

−0.110

0.117

1.392

1

2.724

0.031

0.067

0.035

0.071

−0.138

0.488

0.623

8

2.158

0.057

0.085

0.396

−0.390

−0.148

0.531

0.787

15

1.782

0.137

0.158

0.652

−0.630

−0.337

0.710

0.967

33

1.558

0.190

0.182

0.343

−0.085

−0.674

0.754

1.095

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Cook’s distance, DFBETAS and DFFITS were within conventional ranges. However, there were COVRATIO values that suggest multiple point may effect confidence interval widths. ##### Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.136

0.4474

Exact one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.941

0.0435

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing collinearity with VIF

Term

VIF

Tolerance

Sensory_NCV

1.172

0.853

age

1.172

0.853

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.318

1.148

0.0060

Independence results
  • The Durbin–Watson test suggests there are auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

Residual diagnostics suggested mild heteroscedasticity, although the Breusch–Pagan test was not statistically significant. The Q–Q plot indicated minor deviations from normality, while there was no clear evidence of nonlinearity. Outlier and influence diagnostics showed that individual observations did not materially affect point estimates, although confidence-interval widths may be sensitive to influential cases. Evidence of autocorrelation was observed, consistent with the repeated-measures design, indicating that the assumption of independent observations may not be fully satisfied.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Carpal inlet

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

28.1040

Sensory_NCV

−0.269

−0.2686

0.0004

Reproduced

age

−0.1150

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Sensory_NCV

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

age

<0.001

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value for this model was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, diagnostics suggested potential heteroscedasticity, non-normality and non-independence of residuals. To further assess the robustness of the findings, wild bootstrapping was used to obtain bootstrapped standardized regression coefficients and their 95% confidence intervals, accounting for potential heteroscedasticity. Absolute and percentage changes in coefficient estimates and confidence-interval bounds, relative to the linear model, were summarised using thresholds of a 10% change and standardized coefficient differences of <0.10 and <0.20. Concordance of coefficient direction was assessed, and the stability of statistical significance was evaluated across models.

Results

Linear vs Mixed model

The AIC and BIC were lower for the mixed model compared to the linear model. The unadjusted intraclass correlation coefficient (ICC) from the null model was 0.309, suggesting a substantial degree of clustering.

Reproduced Linear Model
Term B SE1 t
95% CI
p-value
Lower Upper
(Intercept) 0.0000 0.1121 0.0000 −0.2277 0.2277 >0.999
z_Sensory_NCV −0.6045 0.1230 −4.9130 −0.8543 −0.3547 <0.001
z_age −0.7177 0.1230 −5.8331 −0.9675 −0.4679 <0.001
Sigma = 0.691; AIC = 84.7; BIC = 91.2; Residual df = 35; No. Obs. = 38
1 SE = Standard Error
Linear Mixed Model
Term B SE1 t
95% CI
p-value
Lower Upper
(Intercept) 0.0000 0.1351 0.0000 −0.2748 0.2748 >0.999
z_Sensory_NCV −0.5556 0.1305 −4.2574 −0.8212 −0.2901 <0.001
z_age −0.6990 0.1457 −4.7957 −0.9955 −0.4024 <0.001
serial_no.SD (Intercept) 0.5013




Residual.SD (Observations) 0.4368




Sigma = 0.437; AIC = 79.4; BIC = 87.6; No. Obs. = 38
1 SE = Standard Error

Bootstrapped results

As ignoring clustering can lead to underestimated standard errors, inference was obtained using wild bootstrap linear mixed-effects models with 10000 resamples.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0000

0.0007

−0.0007

−1,000.0000

Yes

No

No

z_Sensory_NCV

−0.6045

−0.5552

−0.0493

−8.1600

No

No

No

z_age

−0.7177

−0.6983

−0.0194

−2.7000

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.2277

−0.2832

0.0555

24.3900

Yes

No

No

z_Sensory_NCV

−0.8543

−0.8853

0.0310

3.6300

No

No

No

z_age

−0.9675

−1.0429

0.0755

7.8000

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2277

0.2893

−0.0616

−27.0800

Yes

No

No

z_Sensory_NCV

−0.3547

−0.2314

−0.1233

−34.7700

Yes

Yes

No

z_age

−0.4679

−0.3493

−0.1186

−25.3400

Yes

Yes

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.4553

0.5725

0.1172

25.7300

Yes

Yes

No

z_Sensory_NCV

0.4996

0.6539

0.1543

30.9000

Yes

Yes

No

z_age

0.4996

0.6936

0.1941

38.8400

Yes

Yes

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, percentage changes were truncated at ±1000%, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

1.0000

0.9951

0.0049

Remains non-sig, B same direction

z_Sensory_NCV

<0.001

<0.001

−0.0002

Remains sig, B same direction

z_age

<0.001

<0.001

−0.0002

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

Substantial correlation of error residuals was observed as the study design had repeated measures with the unadjusted ICC being 0.309 and the AIC was for the linear mixed model was 5 units lower than the linear model. The linear mixed model was bootstrapped, and the CI range standardized differences for Sensory_NCV was 0.1543, for age (0.1941) when compared to the linear model and was not inferentially reproducible.