Paper 19: Associations between industry involvement and study characteristics at the time of trial registration in biomedical research

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 24, 2026

References

Seidler AL, Hunter KE, Chartres N, Askie LM (2019) Associations between industry involvement and study characteristics at the time of trial registration in biomedical research. PLoS ONE 14(9): e0222117. https://doi.org/10.1371/journal.pone.0222117

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Seidler et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This study examined associations between industry involvement and study characteristics of trial registration in biomedical research. Linear regression was a minor part of the paper, focusing on the mean difference for target sample size and industry involvement. There was poor interpretation of the regression coefficient, rather than saying the target sample size mean difference for industry was smaller MD = -153 (95% CI = -231 to -74, p < 0.001) than without industry involvement. Authors stated, ‘The mean difference between trials with and without industry involvement was -153 (95% CI = -231 to -74, p < .001)’. Although this seems relatively minor, it impedes interpretation. Assumptions were not discussed.

Data availability and software used

The data were provided in the supporting information as an Excel file in both long and wide formats, with no data dictionary. All analyses were performed using R.

Regression sample

There was one linear regression reported with target sample size as the outcome variable, therefore no randomisation of models was required.

Computational reproducibility results

The results for this paper were mostly reproduced. A summary of results were presented in text (coefficient and 95% confidence interval), with more decimal places presented in the Supporting information, which was used for reproducing results. These results were moderately difficult to reproduce because the authors provided raw data that required manipulating and merging four datasheets into both long and wide formatted data. Understanding how to reproduce the required reading both the paper and the supporting information, as six outliers were excluded and were not mentioned in the main text of the paper. Original results found that the target sample size mean difference for the industry was smaller, MD = -153 (95% CI = -231 to -74, p < 0.001). There was a mistake (typo) in the supplementary results for the upper confidence interval with reporting of -74.98, which, when reproduced, was -73.987, and correctly reported in the text of the paper as -74.

Inferential reproducibility results

The regression in this paper was not inferentially reproducible, with gross violations of normality and heteroscedasticity. Removing outliers did not address the skewed distribution of target sample size, as indicated by large discrepancies between means and medians and SDs greatly exceeding IQRs (e.g., no industry: mean = 249, 95% CI 211–295, SD = 689; median = 70, IQR = 125; industry: mean = 96, 95% CI 26–166, SD = 198; median = 45, IQR = 76). Outlier removal reduced the variance but risked bias, as large sample sizes are often required when rare events occur or small effects are of interest.

Target sample size was standardised and a heteroscedastic bootstrap model was fitted, which showed a standardised change in confidence interval range of –0.1. While bootstrapping provided a comparison on the original additive scale, the clear mean–variance relationship (variance increasing with mean) indicated multiplicative rather than additive error. Alternative distributions were therefore considered. The negative binomial and gamma models improved fit , but residuals continued to show overdispersion. The inverse Gaussian model provided the best fit, with an AIC of 4,784 lower than linear regression (a AIC difference of 10 is considered large), and produced narrower confidence intervals than the negative binomial and gamma models, suggesting greater precision. This model estimated the mean sample size was 61% lower with industry involvement (ratio = 0.39, 95% CI 0.30–0.50), with geometric means of 249 (95% CI 211–295) for no industry involvement and 96 (95% CI 78–118) for industry involvement. Notably, the confidence interval for industry involvement under the inverse Gaussian model was much narrower than under linear regression, reflecting improved fit and greater precision in estimating the mean.

Model 1

Model results for Target Sample Size

Term

B

SE

Lower

Upper

t

p-value

Intercept

Sponsor_Type:

Any industry involvement – No industry involvement

−152.99

−231.99

−74.98

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Target Sample Size

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Target Sample Size

Term

SS

DF

MS

F

p-value

Sponsor_Type

Residuals

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Target Sample Size

Term

B

SE

Lower

Upper

t

p-value

Intercept

249.145

18.435

212.981

285.308

13.515

<0.001

Sponsor_Type:

Any industry involvement – No industry involvement

−152.991

40.274

−231.994

−73.987

−3.799

<0.001

SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Target Sample Size

R

R2

R2Adj

AIC

RMSE

F

DF1

DF2

p-value

0.100

0.010

0.009

22,400.217

618.730

14.430

1

1425

<0.001

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Target Sample Size

Term

SS

DF

MS

F

p-value

Sponsor_Type

5,532,050.506

1

5,532,050.506

14.430

<0.001

Residuals

546,293,544.369

1425

383,363.891

SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Testing for homoscedasticity

Statistic

p-value

Parameter

Method

5.368

0.0205

1

studentized Breusch-Pagan test

Homoscedasticity results
  • The studentized Breusch-Pagan test indicates heteroscedasticity.
  • The variance was substantially larger in the no industry involment group, which can be seen visially in plots.
Model descriptives including cook’s distance and leverage to understand outliers

Term

N

Mean

SD

Median

Min

Max

Skewness

Kurtosis

Target Sample Size

1427

217.088

622.073

60.000

1.000

6,800.000

6.863

54.913

Sponsor_Type*

1427

1.790

0.407

2.000

1.000

2.000

−1.426

0.033

.fitted

1427

217.088

62.285

249.145

96.154

249.145

−1.426

0.033

.resid

1427

−0.000

618.947

−149.145

−247.145

6,550.855

6.829

54.701

.leverage

1427

0.001

0.001

0.001

0.001

0.003

1.426

0.033

.sigma

1427

619.162

1.662

619.363

594.534

619.381

−10.785

125.432

.cooksd

1427

0.000

0.003

0.000

0.000

0.050

10.414

117.972

.std.resid

1427

−0.000

1.000

−0.241

−0.399

10.585

6.829

54.696

dfb.1_

1427

−0.000

0.009

−0.000

−0.009

0.188

14.536

263.958

dfb.S_ii

1427

0.000

0.016

−0.003

−0.168

0.150

3.171

47.966

dffit

1427

0.000

0.031

−0.007

−0.012

0.328

6.644

51.734

cov.r

1427

1.001

0.010

1.002

0.851

1.005

−10.397

118.087

* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if an observation is removed. Potiential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a theshold of 0.5 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when observation is removed. Potiential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.

DFBETA threshold

DFBETA measures the change in a regression coefficient, in units of its standard error, when a particular observation is removed from the model. There is a DFBETA for each paremeter in the model. Potiential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. In practice, DFBETA \(\pm 1\) is often used to identify outliers.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentised residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID

StudRes

Leverage

CookD

dfb.1_

dfb.S_ii

dffit

cov.r

4

0.039

0.003

0.000

0.002

−0.002

0.002

1.005

16

0.310

0.003

0.000

0.018

−0.016

0.018

1.005

1188

10.694

0.001

0.047

0.000

0.146

0.319

0.859

832

11.023

0.001

0.050

0.000

0.150

0.328

0.851

StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points
  • Although the largest studentised residual was 11, Cook’s distance, DFBETAS, and DFFITS were within conventional thresholds, indicating the outliers were not influential on the mean. The covratio indicated several values that may indicate inflation of the variance.
Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic

p-value

Method

0.345

<0.001

Asymptotic one-sample Kolmogorov-Smirnov test

Statistic

p-value

Method

0.328

<0.001

Shapiro-Wilk normality test

Normality results
  • The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
  • The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
  • QQ-plot indicates the residuals are not normally distributed.
Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation

Statistic

p-value

0.061

1.878

0.0540

Independence results
  • The Durbin–Watson test suggests there are no auto-correlation issues.
  • The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.
Assumption conclusions

Assumptions of linear regression were not satisfied: residuals showed marked non-normality and heteroscedasticity, and a clear mean–variance relationship suggested a multiplicative error structure, indicating that alternative distributions should be considered. Independence also appeared questionable, with potential clustering where some organisations contributed multiple observations; this should be examined using mixed-effects models.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Target Sample Size

Change in regression coefficients

term

O_B

R_B

Change.B

reproduce.B

Intercept

249.1445

Sponsor_Type:

Any industry involvement – No industry involvement

−152.99

−152.9907

−0.0007

Reproduced

O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term

O_lower

R_lower

Change.lci

Reproduce.lower

Intercept

212.9812

Sponsor_Type:

Any industry involvement – No industry involvement

−231.99

−231.9938

−0.0038

Reproduced

O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term

O_upper

R_upper

Change.uci

Reproduce.upper

Intercept

285.3078

Sponsor_Type:

Any industry involvement – No industry involvement

−74.98

−73.9875

0.9925

Not Reproduced

O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in p-values

Term

O_p

R_p

Change.p

Reproduce.p

SigChangeDirection

Intercept

<0.001

Sponsor_Type:

Any industry involvement – No industry involvement

<0.001

<0.001

0.0000

Reproduced

Remains sig, B same direction

O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value for this model was reproduced.

Conclusion computational reproducibility

This model was mostly computationally reproducible. The regression coefficient and p-value were reproduced, with a typographical error identified in the upper confidence interval.

Methods

While the model was mostly computationally reproduced, the residuals showed substantial non-normality and heteroscedasticity. The original aim was to estimate the mean difference in target sample size by sponsor type; since linear regression is equivalent to a two-sample t-test in this setting, a rank-based alternative such as the Mann–Whitney U test could provide a simple comparison. However, potential clustering by organisation may also have been present. This was not assessed in the current sensitivity analysis, as the documentation was unclear, but ideally the authors should have considered a mixed-effects model to account for such clustering.

Target sample size was standardised and a heteroscedastic bootstrap model was fitted. While bootstrapping provides a comparison on the original additive scale, the mean–variance relationship warranted further modelling because the outcome was a count-like variable and initial descriptive statistics indicated overdispersion (with variance exceeding the mean). Therefore, comparisons of alternative distributions were conducted on the unstandardised outcome.

A negative binomial model was first fitted; however, residual diagnostics continued to show evidence of dispersion, and visual inspection of fitted values revealed a clear mean–variance relationship. To better capture this, other candidate distributions were considered: a gamma model, which assumes a quadratic mean–variance relationship, and an inverse Gaussian model, which allows for a cubic mean–variance relationship. Model performance was then evaluated using the Akaike Information Criterion (AIC) to determine the best-fitting distribution.

Results

The regression coefficients were consistent between the bootstrap and reproduced models, indicating no difference in point estimates. However, the upper and lower confidence intervals shifted by more than 10%, although the shifts did not exceed the threshold of ±0.1.. The difference in standardised confidence interval range width was –0.1, with the bootstrap interval being 42% narrower than that of the reproduced model, indicating the model is not inferentially reproducible.

Boostrapping results

Wild bootstrapping was performed with 10,000 iterations.

Change in regression coefficients

Term

B

boot.B

B_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Sponsor_Type1

−0.2459

−0.2458

−0.0001

−0.0500

No

No

No

B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term

Lower

boot.Lower

Lower_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Sponsor_Type1

−0.3729

−0.3216

−0.0513

−13.7600

Yes

No

No

Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term

Upper

boot.Upper

Upper_diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Sponsor_Type1

−0.1189

−0.1718

0.0528

44.4300

Yes

No

No

Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term

Range

boot.Range

Range_Diff

%_Diff

Diff_10%

Diff_0.1

Diff_0.2

Sponsor_Type1

0.2540

0.1498

−0.1042

−41.0200

Yes

Yes

No

Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term

p-value

boot.p-value

changep

SigChangeDirection

Sponsor_Type1

0.0002

0.0000

0.0002

Remains sig, B same direction

p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Comparison of distributions

The linear model had an AIC of 22400. A negative binomial model improved fit (AIC 17757), but DHARMa residual diagnostics indicated deviations from ideal assumptions, including overdispersion, outliers, and a lack-of-fit signal. A gamma model was then fitted; although overdispersion and outliers were no longer evident, its AIC was higher (19239) and the residual diagnostic worsened, indicating misspecification. Finally, an inverse Gaussian model achieved the lowest AIC (17616). Its diagnostics improved, with outliers and overdispersion no longer an issue; a slight residual departure from the target distribution remained, but the inverse Gaussian was retained as the best overall fit (inverse Gaussian residuals shown below). Because outliers and potential organisational clustering were not formally evaluated, the final choice of distribution should be based on models that explicitly account for these features.

Note on “normality” in DHARMa: for GLMs (e.g., negative binomial, gamma, inverse Gaussian), residuals are not expected to be Gaussian. DHARMa uses simulated, scaled residuals that should be uniform on (0,1) under a correctly specified model. Apparent “non-normality” in this context means non-uniformity of these residuals (e.g., a significant KS test or QQ plot deviations), which signals potential misspecification (wrong mean–variance form, link, or missing structure), not a violation of Gaussian residual assumptions.

Characteristic
Linear (AIC = 22400)
Negative Binomial (AIC = 17757)
Gamma (AIC = 19239)
Inverse Gaussian (AIC = 17616)
Beta 95% CI1 p-value Ratio1 95% CI1 p-value Ratio 95% CI1 p-value Ratio 95% CI1 p-value
(Intercept) 249 213, 285 <0.001 249 231, 269 <0.001 249 214, 291 <0.001 249 210, 295 <0.001
Sponsor_Type1 -153 -232, -74 <0.001 0.39 0.33, 0.46 <0.001 0.39 0.28, 0.54 <0.001 0.39 0.30, 0.50 <0.001
1 CI = Confidence Interval, IRR = Incidence Rate Ratio

Residuals for the inverse gaussian model

Inferential reproducibility conclusion

This model was not inferentially reproducible: residuals were heteroscedastic, grossly non-normal, and demonstrated a clear mean–variance relationship indicative of multiplicative error. The bootstrap model showed a standardised change the confidence interval range of –0.1, whereas an inverse Gaussian model substantially improved fit, reducing the AIC by 4,784.