Paper 32: Is there a fair allocation of healthcare research funds by the European Union?

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

March 15, 2026

Reference

Kalo Z, van den Akker LHM, Voko Z, Csana di M, Pitter JG (2019) Is there a fair allocation of healthcare research funds by the European Union? PLoS ONE 14(4): e0207046. https://doi.org/10.1371/journal.pone.0207046

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Kalo et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper investigated factors associated with grant allocation in the European Union. The outcome in the linear regression analyses was country-level grant funding. The Multivariable model appears consistent with a backward selection approach, although this was not explicitly described as such. Assumptions of linear regression were assessed using residual diagnostics and graphical checks. Collinearity and outliers were examined; however, influential observations were retained in the final models without accompanying sensitivity analyses.

Data availability and software used

The authors provided the data in a Word (DOCX) file as supporting information, in a wide format. Importing the data was difficult and required additional cleaning due to the file format. The authors used Stata for the regression analyses.

Regression sample

There were two multivariable models examining health research funding per 100,000 people, which where related to the main research question, with initial variable model which had predictors of (1) average gross domestic product per capita (GDP per capita), (2) average population size, (3) disease burden in disability-adjusted life years (DALY) per 100,000 inhabitants , and (4) excellence of research capabilities (citations). The second and final multivariable model, had DALY removed. There were also four univariate analysis, of which citations was selected randomly for reproduction.

Computational reproducibility results

This paper was not computationally reproducible, as one of the three models could not be reproduced. The univariate analysis for citations was reproducible, and the final multivariable model was mostly reproducible, with only minor rounding differences. However, the initial multivariable model was not reproducible.

To investigate this further, the univariate association for DALYs was examined. The authors reported that “disease burden showed a statistically significant negative association of −97,410 EUR per 100,000 inhabitants, and per 1,000 additional DALYs per 100,000 inhabitants (p = 0.003; R² = 0.30).” While this presentation is confusing due to the simultaneous use of per 1,000 and per 100,000 scaling, rescaling the reproduced coefficient by 1,000 yields −97,727, matching the reported scale but not the exact value (paper: −97,410). This suggests that the discrepancy is not due solely to scaling; however, the reported p-value and R² were reproducible. In the initial multivariable model, the DALY coefficient was −30.15, whereas the reproduced estimate was −29.85. Although this difference was small and on the correct scale, all other coefficients and their associated statistics differed, suggesting that the data provided by the authors may differ from that used in the analyses.

Inferential reproducibility results

This paper was not inferentially reproducible. The univariate model for citations was successfully reproduced, with standardized coefficients and confidence interval bounds below 0.10, and with consistent coefficient directions and statistical significance. However, the final multivariable model contained a highly influential observation (Cook’s distance = 3.9). Although the authors acknowledged this outlier, no sensitivity analysis was conducted to assess its impact. Sensitivity analyses conducted here indicated substantial changes in effect sizes and linearity conclusions (as described below).

To mitigate the influence of an extreme observation, funding and GDP per capita were log-transformed, while citations were retained on the original scale. Each additional citation was associated with an 8.2% increase in funding (95% CI: 3.7% to 12.9%, p < 0.001). A 10% increase in GDP per capita was associated with an approximately 12% increase in funding. As the model was fitted on the log scale, fit statistics (e.g. R2, AIC, BIC) are not directly comparable with those from models on the original scale. The focus was therefore on obtaining a valid inferential model, assessed through residual diagnostics to ensure that the log transformation improved adherence to model assumptions and mitigated the impact of outliers. Although the influential observation continued to have some impact, its influence was substantially reduced, with Cook’s distance below 0.5. On the log scale, residual diagnostics did not indicate evidence of non-linearity in the association between GDP per capita and funding. Mild heteroscedasticity was observed, robust standard errors were used as a sensitivity analysis.

Model 1

Model results for Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

Citations

139046

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.64

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Citations

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

−1,521,547.155

418,253.860

−2,381,280.277

−661,814.032

−3.638

0.0012

Citations

139,046.123

20,511.914

96,883.280

181,208.966

6.779

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.799

0.639

0.625

837.129

675,118.574

45.952

1

26

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Citations

22,555,382,202,358.328

1

22,555,382,202,358.328

45.952

<0.001

Residuals

12,761,982,493,625.779

26

490,845,480,524.068

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Citations

1.366

0.1841

No linearity violation

Tukey test

1.366

0.1719

No linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

The residual and univariate plots show some signs of mild curvature but it was not significant therefore found to be linear.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

6.071

0.0137

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test indicates heteroscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Health research funding per 100,000

28

1,167,902.821

1,143,700.373

782,287.500

54,627.000

4,074,757.000

0.984

−0.178

Citations

28

19.342

6.573

18.230

8.600

32.250

0.144

−1.090

.fitted

28

1,167,902.821

913,993.723

1,013,263.669

−325,750.496

2,962,690.314

0.144

−1.090

.resid

28

0.000

687,507.104

44,991.747

−1,783,550.216

1,568,137.969

−0.033

0.426

.leverage

28

0.071

0.037

0.058

0.037

0.179

1.043

0.297

.sigma

28

700,025.221

24,751.280

710,911.088

611,566.783

714,429.733

−2.376

5.032

.cooksd

28

0.046

0.085

0.007

0.000

0.368

2.435

5.624

.std.resid

28

0.005

1.024

0.067

−2.636

2.379

−0.011

0.430

dfb.1_

28

0.001

0.193

−0.006

−0.582

0.386

−0.431

1.765

dfb.Cttn

28

−0.000

0.249

−0.012

−0.558

0.791

0.623

2.572

dffit

28

0.022

0.329

0.020

−0.813

0.952

0.240

1.497

cov.r

28

1.082

0.135

1.120

0.623

1.216

−1.978

3.560

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.Cttn

dffit

cov.r

4

0.972

0.135

0.074

0.374

−0.328

0.383

1.161

9

−1.172

0.179

0.147

0.386

−0.489

−0.546

1.183

16

−3.020

0.068

0.252

0.343

−0.558

−0.813

0.623

20

2.638

0.115

0.368

−0.582

0.791

0.952

0.748

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

One observation had a studentized residual greater than 3, indicating a small outlier; however, this value remained within conventional thresholds for Cook’s distance, DFBETA and DFFit. The COVRATIO indicated observations that may affect confidence intervals widths.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.109

0.8568

Exact one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.976

0.7493

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk supports residuals being normally distributed.
  • QQ-plot looks roughly normal.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.015

1.967

0.8800

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • The study design seems to be independent.
Assumption conclusions

No meaningful departures from the assumptions of linearity and independence were observed. The Breusch-Pagan test was statistically significant and the residuals vs. fitted plot showed a funnel shape, suggesting heteroscedasticity and a sensitivity analysis using weighted or robust regression is recommended. Normality tests and the Q-Q plot indicated the residuals were approximately normally distributed. Outlier diagnostics indicated that point estimates were unlikely to be substantially affected by influential points, but confidence-interval width could be affected and should be further investigated.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Health research funding per 100,000

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−1,521,547.2

Citations

139046

139,046.1

0.1231

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.640

0.6386

−0.0014

Reproduced

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.0012

Citations

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value was reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, there were indications that the residuals may exhibit slight heteroscedasticity. To further verify the findings, bootstrapped standardized regression coefficients and their 95% confidence intervals were examined. Percentage and absolute changes in estimates and confidence-interval bounds relative to the linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Coefficient direction and statistical significance were assessed for consistency. Wild bootstrapping was used to account for potential heteroscedasticity.

Results

Bootstrapped results

Wild bootstrapping was performed with 10,000 iterations.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.0000

−0.0005

0.0005

1,000.0000

Yes

No

No

z_Citations

0.7992

0.7988

0.0003

0.0400

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

−0.2380

−0.2165

−0.0214

−9.0100

No

No

No

z_Citations

0.5568

0.5542

0.0026

0.4700

No

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.2380

0.2207

0.0173

7.2600

No

No

No

z_Citations

1.0415

1.0518

−0.0104

−0.9900

No

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Intercept

0.4759

0.4372

−0.0387

−8.1400

No

No

No

z_Citations

0.4847

0.4976

0.0130

2.6700

No

No

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Intercept

1.0000

0.9961

0.0039

Remains non-sig, B changes direction

z_Citations

<0.001

<0.001

0.0000

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Conclusions based on the bootstrapped model

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Model 2

Model results for Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

Average_GDP_PC

18.41

−3.64

40.46

0.097

DALY_per_100000

−30.15

−79.93

19.64

0.222

Citations

100283

60269

140298

<0.001

population_quartile:

2nd quartile – 1st quartile

673962

69270

1278655

0.031

3rd quartile – 1st quartile

633305

6829

1259782

0.048

4th quartile – 1st quartile

190753

−436460

1352454

0.534

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.83

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Average_GDP_PC

DALY_per_100000

Citations

population_quartile

0.082

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

−692,099.707

986,584.344

−2,743,814.168

1,359,614.754

−0.702

0.4907

Average_GDP_PC

18.498

10.642

−3.633

40.628

1.738

0.0968

DALY_per_100000

−29.848

24.084

−79.934

20.237

−1.239

0.2289

Citations

100,211.969

19,259.752

60,159.122

140,264.816

5.203

<0.001

population_quartile:

2nd quartile – 1st quartile

672,227.000

291,219.684

66,602.513

1,277,851.487

2.308

0.0313

3rd quartile – 1st quartile

632,255.747

302,023.406

4,163.690

1,260,347.804

2.093

0.0486

4th quartile – 1st quartile

189,682.956

302,507.429

−439,415.682

818,781.594

0.627

0.5374

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Model fit for Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.908

0.825

0.775

826.816

469,729.694

16.508

6

21

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Average_GDP_PC

888,916,574,062.690

1

888,916,574,062.690

3.022

0.0968

DALY_per_100000

451,875,188,381.914

1

451,875,188,381.914

1.536

0.2289

Citations

7,964,757,698,839.988

1

7,964,757,698,839.988

27.073

<0.001

population_quartile

2,247,263,386,569.635

3

749,087,795,523.212

2.546

0.0834

Residuals

6,178,087,581,216.055

21

294,194,646,724.574

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Forest plot showing Original and Reproduced coefficients and 95% confidence intervals for Health research funding per 100,000

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−692,099.70702

Average_GDP_PC

18.41

18.49765

0.0877

Not Reproduced

DALY_per_100000

−30.15

−29.84845

0.3016

Not Reproduced

Citations

100283

100,211.96884

−71.0312

Not Reproduced

population_quartile:

2nd quartile – 1st quartile

673962

672,227.00026

−1734.9997

Not Reproduced

3rd quartile – 1st quartile

633305

632,255.74730

−1049.2527

Not Reproduced

4th quartile – 1st quartile

190753

189,682.95613

−1070.0439

Not Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

−2,743,814.168434

Average_GDP_PC

−3.64

−3.632591

0.0074

Incorrect Rounding

DALY_per_100000

−79.93

−79.934011

−0.0040

Reproduced

Citations

60269

60,159.121670

−109.8783

Not Reproduced

population_quartile:

2nd quartile – 1st quartile

69270

66,602.513155

−2667.4868

Not Reproduced

3rd quartile – 1st quartile

6829

4,163.690195

−2665.3098

Not Reproduced

4th quartile – 1st quartile

−436460

−439,415.681550

−2955.6815

Not Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

1,359,614.75440

Average_GDP_PC

40.46

40.62789

0.1679

Not Reproduced

DALY_per_100000

19.64

20.23712

0.5971

Not Reproduced

Citations

140298

140,264.81600

−33.1840

Not Reproduced

population_quartile:

2nd quartile – 1st quartile

1278655

1,277,851.48736

−803.5126

Not Reproduced

3rd quartile – 1st quartile

1259782

1,260,347.80441

565.8044

Not Reproduced

4th quartile – 1st quartile

1352454

818,781.59380

−533672.4062

Not Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.830

0.8251

−0.0049

Reproduced

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

0.4907

Average_GDP_PC

0.097

0.0968

−0.0002

Reproduced

Remains non-sig, B same direction

DALY_per_100000

0.222

0.2289

0.0069

Not Reproduced

Remains non-sig, B same direction

Citations

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

population_quartile:

2nd quartile – 1st quartile

0.031

0.0313

0.0003

Reproduced

Remains sig, B same direction

3rd quartile – 1st quartile

0.048

0.0486

0.0006

Incorrect Rounding

Remains sig, B same direction

4th quartile – 1st quartile

0.534

0.5374

0.0034

Not Reproduced

Remains non-sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman plot showing differences between original and reproduced p-values for Health research funding per 100,000

Results for p-values

Some p-values were reproduced, others were not.

Conclusion computational reproducibility

This model was not computationally reproducible.

As this model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced and therefore, statistical assumptions could not be meaningfully compared or interpreted.

Model 3

Model results for Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

Average_GDP_PC

27.55

11.32

43.79

0.002

Citations

97560

57385

137736

<0.001

population_quartile:

2nd quartile – 1st quartile

700203

90777

1309630

0.026

3rd quartile – 1st quartile

721330

105599

1337061

0.024

4th quartile – 1st quartile

293956

−315950

903861

0.328

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit Statistics Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.82

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Average_GDP_PC

Citations

population_quartile

0.067

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Health research funding per 100,000

Term

B

SE

Lower

Upper

t

p-value

Intercept

−1,826,305.610

372,998.828

−2,599,857.834

−1,052,753.386

−4.896

<0.001

Average_GDP_PC

27.554

7.830

11.317

43.792

3.519

0.0019

Citations

97,560.323

19,372.294

57,384.644

137,736.002

5.036

<0.001

population_quartile:

2nd quartile – 1st quartile

700,203.348

293,858.942

90,777.201

1,309,629.494

2.383

0.0262

3rd quartile – 1st quartile

721,329.701

296,899.014

105,598.833

1,337,060.569

2.430

0.0237

4th quartile – 1st quartile

293,955.476

294,090.076

−315,950.013

903,860.964

1.000

0.3284

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Health research funding per 100,000

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.901

0.812

0.770

826.792

486,604.957

19.039

5

22

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Health research funding per 100,000

Term

SS

DF

MS

F

p-value

Average_GDP_PC

3,732,393,212,495.432

1

3,732,393,212,495.432

12.385

0.0019

Citations

7,643,156,607,888.156

1

7,643,156,607,888.156

25.362

<0.001

population_quartile

2,482,817,319,025.463

3

827,605,773,008.488

2.746

0.0673

Residuals

6,629,962,769,597.969

22

301,361,944,072.635

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term

Statistic

p-value

Results

Average_GDP_PC

−1.765

0.0921

No linearity violation

Citations

1.455

0.1604

No linearity violation

population_quartile

Tukey test

2.740

0.0061

Linearity violation

Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

Plots and tests show an unaccounted non-linear relationship may be present.

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

6.416

0.2678

5

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test supports homoscedasticity.
  • Some heteroscedasticity is present in plots, and a sensitivity analysis using weighted or robust regression or wild bootstrapping is recommended.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Health research funding per 100,000

28

1,167,902.821

1,143,700.373

782,287.500

54,627.000

4,074,757.000

0.984

−0.178

Average_GDP_PC

28

24,617.461

15,978.688

20,416.670

5,388.890

79,566.670

1.366

2.479

Citations

28

19.342

6.573

18.230

8.600

32.250

0.144

−1.090

population_quartile*

28

2.500

1.139

2.500

1.000

4.000

0.000

−1.475

.fitted

28

1,167,902.821

1,030,774.644

955,108.840

−470,784.916

2,981,548.152

0.333

−1.208

.resid

28

0.000

495,534.234

−37,693.413

−757,719.152

1,295,335.078

0.474

−0.364

.leverage

28

0.214

0.120

0.180

0.146

0.786

3.886

15.730

.sigma

28

546,365.025

23,828.707

554,394.043

464,893.826

561,881.093

−2.446

5.305

.cooksd

28

0.183

0.743

0.022

0.000

3.959

4.681

20.843

.std.resid

28

−0.047

1.096

−0.075

−2.542

2.634

0.072

−0.099

dfb.1_

28

−0.042

0.372

−0.015

−1.512

0.661

−1.970

6.563

dfb.A_GD

28

−0.136

0.988

0.006

−5.124

0.601

−4.571

20.211

dfb.Cttn

28

0.074

0.593

−0.009

−0.803

2.825

3.395

13.760

dfb.p_2q

28

0.053

0.417

0.000

−0.537

1.920

3.087

11.660

dfb.p_3q

28

0.046

0.352

0.002

−0.351

1.314

1.970

4.272

dfb.p_4q

28

0.052

0.373

0.008

−0.381

1.757

3.317

12.726

dffit

28

−0.182

1.201

−0.032

−5.666

1.544

−3.267

12.706

cov.r

28

1.292

0.338

1.377

0.170

1.720

−1.332

2.005

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potentially influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice, a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept), and n is the number of observations. In practice, this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each model parameter. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.A_GD

dfb.Cttn

dfb.p_2q

dfb.p_3q

dfb.p_4q

dffit

cov.r

9

−1.727

0.304

0.199

0.661

0.250

−0.803

−0.537

0.125

−0.055

−1.140

0.856

20

3.111

0.198

0.285

−0.657

0.178

0.578

−0.034

0.822

0.051

1.544

0.170

18

−2.955

0.786

3.959

−1.512

−5.124

2.825

1.920

1.314

1.757

−5.666

0.767

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals near 3. One of these observations was an inflential outlier with Cook’s distance of approximately 4, with DFBETAS and DFFITS also outside conventional ranges. COVRATIO values substantially < 1, suggesting the observation may effect point estimates but potential inflation of standard errors (wider confidence intervals).

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.107

0.8713

Exact one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.986

0.9608

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov supports residuals being normally distributed.
  • The Shapiro-Wilk supports residuals being normally distributed.
  • QQ-plot looks roughly normal.
Assessing collinearity with VIF

Term

VIF

Tolerance

Average_GDP_PC

1.184

0.844

Citations

1.205

0.830

population_quartile

1.009

0.991

VIF = Variance Inflation Factor.

Collinearity results
  • All VIF values are under three, indicating no collinearity issues.
  • Overall, when taking into account VIF and SE, the model does not have collinearity issues.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

−0.007

2.001

0.9440

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • The study design seems to be independent.
Assumption conclusions

No meaningful departures from the assumptions of normality or independence were observed. However, residual plots and diagnostic tests indicated potential issues with linearity, suggesting that a non-linear relationship may not have been adequately captured, and residuals may be mildly heteroscedastic. Additionally, the presence of observations with high DFBETA, DFFIT and COVRatio indicates that further investigation of potential influential points is warranted.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Health research funding per 100,000

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

−1,826,305.60981

Average_GDP_PC

27.55

27.55404

0.0040

Reproduced

Citations

97560

97,560.32327

0.3233

Reproduced

population_quartile:

2nd quartile – 1st quartile

700203

700,203.34761

0.3476

Reproduced

3rd quartile – 1st quartile

721330

721,329.70087

−0.2991

Reproduced

4th quartile – 1st quartile

293956

293,955.47583

−0.5242

Incorrect Rounding

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

−2,599,857.83390

Average_GDP_PC

11.32

11.31659

−0.0034

Reproduced

Citations

57385

57,384.64406

−0.3559

Reproduced

population_quartile:

2nd quartile – 1st quartile

90777

90,777.20122

0.2012

Reproduced

3rd quartile – 1st quartile

105599

105,598.83287

−0.1671

Reproduced

4th quartile – 1st quartile

−315950

−315,950.01284

−0.0128

Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

−1,052,753.3857

Average_GDP_PC

43.79

43.7915

0.0015

Reproduced

Citations

137736

137,736.0025

0.0025

Reproduced

population_quartile:

2nd quartile – 1st quartile

1309630

1,309,629.4940

−0.5060

Incorrect Rounding

3rd quartile – 1st quartile

1337061

1,337,060.5689

−0.4311

Reproduced

4th quartile – 1st quartile

903861

903,860.9645

−0.0355

Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in R2

O_R2

R_R2

Change.R2

Reproduce.R2

0.820

0.8123

−0.0077

Incorrect Rounding

O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Average_GDP_PC

0.002

0.0019

−0.0001

Reproduced

Remains sig, B same direction

Citations

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

population_quartile:

2nd quartile – 1st quartile

0.026

0.0262

0.0002

Reproduced

Remains sig, B same direction

3rd quartile – 1st quartile

0.024

0.0237

−0.0003

Reproduced

Remains sig, B same direction

4th quartile – 1st quartile

0.328

0.3284

0.0004

Reproduced

Remains non-sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman Plot showing differences between original and reproduced p-values for Health research funding per 100,000

Results for p-values

P-values were reproduced with all differences close to zero.

Conclusion computational reproducibility

This model was mostly computationally reproducible, with minor rounding errors. P-values were reproduced and had the same interpretation, and regression coefficients did not change direction.

Methods

The model was successfully reproduced; however, residual diagnostics indicated potential heteroscedasticity, nonlinearity, and observations that may influence point estimates or confidence-interval widths. Evidence of non-linearity was primarily associated with GDP. To assess the impact of influential observations, a sensitivity analysis was conducted in which all continuous variables were standardized, and models were fitted with and without the influential observation to evaluate changes in effect size.

To further examine departures from linearity, models fitted on the original scale were visualised by comparing linear, quadratic, and cubic specifications, both with and without the influential observation. As the apparent polynomial relationship was driven by the influential observation, an additional sensitivity analysis was undertaken in which both the outcome and GDP were log-transformed to assess whether this specification improved linearity and model assumptions.

Results

This model was not inferentially reproducible with outliers and linearity issue substantially changing the results.

Standardized z_Funding_per_100000 showing with and without outlier

The authors identified an outlying value for Average_GDP_PC corresponding to Luxembourg; however, this observation was retained in the analysis without a sensitivity assessment. Removal of this observation resulted in substantial changes in coefficient estimates, with GDP more than doubling without the outlier, consistent with being an influential point identified by a large Cook’s distance. Although it is generally appropriate not to exclude outliers without strong justification, influential observations should be explicitly examined through sensitivity analyses to assess their impact on model estimates and inference.

Characteristic
All_data
Outlier Removed
N Beta 95% CI1 p-value N Beta 95% CI1 p-value
(Intercept) 28 -0.37 -0.75, 0.00 0.051 27 -0.10 -0.48, 0.28 0.594
z_Citations 28 0.56 0.33, 0.79 <0.001 27 0.29 0.01, 0.57 0.040
z_Average_GDP_PC 28 0.38 0.16, 0.61 0.002 27 0.64 0.35, 0.93 <0.001
population_quartile 28


27


    1st quartile



    2nd quartile
0.61 0.08, 1.1 0.026
0.18 -0.35, 0.72 0.484
    3rd quartile
0.63 0.09, 1.2 0.024
0.33 -0.17, 0.83 0.182
    4th quartile
0.26 -0.28, 0.79 0.328
-0.13 -0.66, 0.40 0.614
1 CI = Confidence Interval
Comparison of Funding_per_100000 with non-linear terms for Average_GDP_PC with outlier included

It can be seen that linear, quadratic and cubic models do not fit the data well, all models are heavily impacted by the outlier.

Comparison of Funding_per_100000 with non-linear terms for Average_GDP_PC without outlier included

After excluding the influential observation, a quadratic association between funding and GDP was evident. Although this sensitivity analysis is informative for understanding the underlying functional form, the observation should be retained in the primary analysis and modelled appropriately. Given that a single observation can exert disproportionate influence, the data were log-transformed to stabilise the variance and reduce the impact of extreme values.

To mitigate the influence of an extreme observation, funding and GDP per capita were log-transformed, while citations were retained on the original scale.

Back transformed coefficients
Characteristic exp(β) 95% CI1 p-value
(Intercept) 0.841 0.020, 34.665 0.924
Citations 1.082 1.037, 1.129 <0.001
log_Average_GDP_PC 3.301 2.162, 5.042 <0.001
population_quartile


    1st quartile
    2nd quartile 1.213 0.676, 2.176 0.501
    3rd quartile 1.387 0.770, 2.498 0.261
    4th quartile 1.059 0.588, 1.907 0.842
1 CI = Confidence Interval

Each additional citation was associated with an 8.2% increase in funding (95% CI: 3.7% to 12.9%, p < 0.001). There were no statistically significant differences in funding between population quartiles. For example, countries in the second population quartile had 1.2 times the funding as those in the first quartile (95% CI: 0.68 to 2.18, p = 0.501).

The exponentiated coefficient for log(GDP) (3.3) represents the change in funding associated with an approximately 2.7-fold increase in GDP (because natural logarithms are used) and is therefore difficult to interpret directly. Because both funding and GDP were log-transformed, the coefficient for log(GDP) represents an elasticity. The coefficient (1.194; see table below) indicates that funding is proportional to GDP raised to the power of 1.194, meaning funding increases with GDP according to this power relationship. For example, a 10% increase in GDP corresponds to an estimated 12% increase in funding (1.101.194 - 1).

As the model was fitted on the log scale, fit statistics (e.g. R2, AIC, BIC) are not directly comparable with those from models on the original scale. The focus was therefore on obtaining a valid inferential model, assessed through residual diagnostics to ensure that the log transformation improved adherence to model assumptions and mitigated the impact of outliers. Although the influential observation continued to have some impact, its influence was substantially reduced, with Cook’s distance below 0.5. On the log scale, Residual diagnostics did not indicate evidence of non-linearity between GDP per capita and funding.

Residual plots

Residual diagnostics indicated mild heteroscedasticity. As a sensitivity analysis, heteroscedasticity-robust standard errors were applied, leading to inflated standard errors relative to the conventional OLS estimates and was therefore appropriate.

Comparison of OLS and robust standard errors with log coefficients

Ordinary least squares

Robust SE (HC3)

Variable

Estimate

SE

Lower

Upper

p-value

Estimate

SE

Lower

Upper

p-value

(Intercept)

−0.173

1.793

−3.892

3.546

0.9239

−0.173

2.621

−5.310

4.963

0.9478

Citations

0.079

0.020

0.037

0.122

<0.001

0.079

0.032

0.016

0.142

0.0228

log_Average_GDP_PC

1.194

0.204

0.771

1.618

<0.001

1.194

0.317

0.572

1.816

0.0011

population_quartile2nd quartile

0.193

0.282

−0.392

0.777

0.5011

0.193

0.446

−0.682

1.067

0.6699

population_quartile3rd quartile

0.327

0.284

−0.261

0.916

0.2613

0.327

0.383

−0.424

1.078

0.4026

population_quartile4th quartile

0.057

0.284

−0.531

0.646

0.8417

0.057

0.393

−0.713

0.828

0.8854