Paper 19: Associations between industry involvement and study characteristics at the time of trial registration in biomedical research

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 24, 2026

References

Seidler AL, Hunter KE, Chartres N, Askie LM (2019) Associations between industry involvement and study characteristics at the time of trial registration in biomedical research. PLoS ONE 14(9): e0222117. https://doi.org/10.1371/journal.pone.0222117

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Seidler et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This study examined associations between industry involvement and study characteristics of trial registration in biomedical research. Linear regression was a minor part of the paper, focusing on the mean difference for target sample size and industry involvement. There was poor interpretation of the regression coefficient, rather than saying the target sample size mean difference for industry was smaller MD = -153 (95% CI = -231 to -74, p < 0.001) than without industry involvement. Authors stated, ‘The mean difference between trials with and without industry involvement was -153 (95% CI = -231 to -74, p < .001)’. Although this seems relatively minor, it impedes interpretation. Assumptions were not discussed.

Data availability and software used

The data were provided in the supporting information as an Excel file in both long and wide formats, with no data dictionary. All analyses were performed using R.

Regression sample

There was one linear regression reported with target sample size as the outcome variable, therefore no randomisation of models was required.

Computational reproducibility results

The results for this paper were mostly reproduced. A summary of results were presented in text (coefficient and 95% confidence interval), with more decimal places presented in the Supporting information, which was used for reproducing results. These results were moderately difficult to reproduce because the authors provided raw data that required manipulating and merging four datasheets into both long and wide formatted data. Understanding how to reproduce the required reading both the paper and the supporting information, as six outliers were excluded and were not mentioned in the main text of the paper. Original results found that the target sample size mean difference for the industry was smaller, MD = -153 (95% CI = -231 to -74, p < 0.001). There was a mistake (typo) in the supplementary results for the upper confidence interval with reporting of -74.98, which, when reproduced, was -73.987, and correctly reported in the text of the paper as -74.

Inferential reproducibility results

The regression in this paper was not inferentially reproducible, with gross violations of normality and heteroscedasticity. Removing outliers did not address the skewed distribution of target sample size, as indicated by large discrepancies between means and medians and SDs greatly exceeding IQRs (e.g., no industry: mean = 249, 95% CI 211–295, SD = 689; median = 70, IQR = 125; industry: mean = 96, 95% CI 26–166, SD = 198; median = 45, IQR = 76). Outlier removal reduced the variance but risked bias, as large sample sizes are often required when rare events occur or small effects are of interest.

Target sample size was standardised and a heteroscedastic bootstrap model was fitted, which showed a standardised change in confidence interval range of –0.1. While bootstrapping provided a comparison on the original additive scale, the clear mean–variance relationship (variance increasing with mean) indicated multiplicative rather than additive error. Alternative distributions were therefore considered. The negative binomial and gamma models improved fit , but residuals continued to show overdispersion. The inverse Gaussian model provided the best fit, with an AIC of 4,784 lower than linear regression (a AIC difference of 10 is considered large), and produced narrower confidence intervals than the negative binomial and gamma models, suggesting greater precision. This model estimated the mean sample size was 61% lower with industry involvement (ratio = 0.39, 95% CI 0.30–0.50), with geometric means of 249 (95% CI 211–295) for no industry involvement and 96 (95% CI 78–118) for industry involvement. Notably, the confidence interval for industry involvement under the inverse Gaussian model was much narrower than under linear regression, reflecting improved fit and greater precision in estimating the mean.

Recommended changes

Outliers should not be removed. Instead, the underlying data distribution should be modelled more appropriately, for example using an inverse Gaussian model.
Consider creating a reproducible analysis workflow and sharing the code.
Include a data dictionary.
As the dataset contained many additional variables in wide and long formats, the authors should provide a minimal dataset including only the variables used in the paper.
Evaluate the assumptions of the linear regression models by examining residuals, identifying influential outliers, and assessing multicollinearity among predictors. If any assumptions are violated, address them using appropriate methods.
Potential clusters identified in the data with some organisations contributing multiple sample sizes, this should be investigated using mixed models.

Model 1

Model results for Target Sample Size

Term	B	Lower	Upper	p-value
Intercept
Sponsor_Type:
Any industry involvement – No industry involvement	−152.99	−231.99	−74.98	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Target Sample Size

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value

R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Target Sample Size

Term	SS	DF	MS	F	p-value
Sponsor_Type
Residuals
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Target Sample Size

Term	B	SE	Lower	Upper	t	p-value
Intercept	249.145	18.435	212.981	285.308	13.515	<0.001
Sponsor_Type:
Any industry involvement – No industry involvement	−152.991	40.274	−231.994	−73.987	−3.799	<0.001
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval.

Fit statistics for Target Sample Size

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.100	0.010	0.009	22,400.217	618.730	14.430	1	1425	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Target Sample Size

Term	SS	DF	MS	F	p-value
Sponsor_Type	5,532,050.506	1	5,532,050.506	14.430	<0.001
Residuals	546,293,544.369	1425	383,363.891
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
5.368	0.0205	1	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test indicates heteroscedasticity.
The variance was substantially larger in the no industry involment group, which can be seen visially in plots.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Target Sample Size	1427	217.088	622.073	60.000	1.000	6,800.000	6.863	54.913
Sponsor_Type*	1427	1.790	0.407	2.000	1.000	2.000	−1.426	0.033
.fitted	1427	217.088	62.285	249.145	96.154	249.145	−1.426	0.033
.resid	1427	−0.000	618.947	−149.145	−247.145	6,550.855	6.829	54.701
.leverage	1427	0.001	0.001	0.001	0.001	0.003	1.426	0.033
.sigma	1427	619.162	1.662	619.363	594.534	619.381	−10.785	125.432
.cooksd	1427	0.000	0.003	0.000	0.000	0.050	10.414	117.972
.std.resid	1427	−0.000	1.000	−0.241	−0.399	10.585	6.829	54.696
dfb.1_	1427	−0.000	0.009	−0.000	−0.009	0.188	14.536	263.958
dfb.S_ii	1427	0.000	0.016	−0.003	−0.168	0.150	3.171	47.966
dffit	1427	0.000	0.031	−0.007	−0.012	0.328	6.644	51.734
cov.r	1427	1.001	0.010	1.002	0.851	1.005	−10.397	118.087
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if an observation is removed. Potiential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a theshold of 0.5 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when observation is removed. Potiential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, often DFFIT \(\pm 1\) is used to identify highly influential observations.

DFBETA threshold

DFBETA measures the change in a regression coefficient, in units of its standard error, when a particular observation is removed from the model. There is a DFBETA for each paremeter in the model. Potiential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets this threshold can flag a high number of observations with only minor influence on the model. In practice, DFBETA \(\pm 1\) is often used to identify outliers.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentised residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.S_ii	dffit	cov.r
4	0.039	0.003	0.000	0.002	−0.002	0.002	1.005
16	0.310	0.003	0.000	0.018	−0.016	0.018	1.005
1188	10.694	0.001	0.047	0.000	0.146	0.319	0.859
832	11.023	0.001	0.050	0.000	0.150	0.328	0.851
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Although the largest studentised residual was 11, Cook’s distance, DFBETAS, and DFFITS were within conventional thresholds, indicating the outliers were not influential on the mean. The covratio indicated several values that may indicate inflation of the variance.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.345	<0.001	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.328	<0.001	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov test indicates residuals may not be normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot indicates the residuals are not normally distributed.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
0.061	1.878	0.0540

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
The study design is not independent and should be assessed using linear mixed models or generalized estimating equations.

Assumption conclusions

Assumptions of linear regression were not satisfied: residuals showed marked non-normality and heteroscedasticity, and a clear mean–variance relationship suggested a multiplicative error structure, indicating that alternative distributions should be considered. Independence also appeared questionable, with potential clustering where some organisations contributed multiple observations; this should be examined using mixed-effects models.

Forest plot showing original and reproduced coefficients and 95% confidence intervals for Target Sample Size

Change in regression coefficients

term	O_B	R_B	Change.B	reproduce.B
Intercept		249.1445
Sponsor_Type:
Any industry involvement – No industry involvement	−152.99	−152.9907	−0.0007	Reproduced
O_B = original B; R_B = reproduced B; Change.B = change in R_B - O_B; Reproduce.B = B reproduced.

Change in lower 95% confidence intervals for coefficients

term	O_lower	R_lower	Change.lci	Reproduce.lower
Intercept		212.9812
Sponsor_Type:
Any industry involvement – No industry involvement	−231.99	−231.9938	−0.0038	Reproduced
O_lower = original lower confidence interval; R_lower = reproduced lower confidence interval; change.lci = change in R_lower - O_lower; Reproduce.lower = lower confidence interval reproduced.

Change in upper 95% confidence intervals for coefficients

term	O_upper	R_upper	Change.uci	Reproduce.upper
Intercept		285.3078
Sponsor_Type:
Any industry involvement – No industry involvement	−74.98	−73.9875	0.9925	Not Reproduced
O_upper = original upper confidence interval; R_upper = reproduced upper confidence interval; change.uci = change in R_upper - O_upper; Reproduce.upper = upper confidence interval reproduced.

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p	SigChangeDirection
Intercept		<0.001
Sponsor_Type:
Any industry involvement – No industry involvement	<0.001	<0.001	0.0000	Reproduced	Remains sig, B same direction
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Results for p-values

The p-value for this model was reproduced.

Conclusion computational reproducibility

This model was mostly computationally reproducible. The regression coefficient and p-value were reproduced, with a typographical error identified in the upper confidence interval.

Methods

While the model was mostly computationally reproduced, the residuals showed substantial non-normality and heteroscedasticity. The original aim was to estimate the mean difference in target sample size by sponsor type; since linear regression is equivalent to a two-sample t-test in this setting, a rank-based alternative such as the Mann–Whitney U test could provide a simple comparison. However, potential clustering by organisation may also have been present. This was not assessed in the current sensitivity analysis, as the documentation was unclear, but ideally the authors should have considered a mixed-effects model to account for such clustering.

Target sample size was standardised and a heteroscedastic bootstrap model was fitted. While bootstrapping provides a comparison on the original additive scale, the mean–variance relationship warranted further modelling because the outcome was a count-like variable and initial descriptive statistics indicated overdispersion (with variance exceeding the mean). Therefore, comparisons of alternative distributions were conducted on the unstandardised outcome.

A negative binomial model was first fitted; however, residual diagnostics continued to show evidence of dispersion, and visual inspection of fitted values revealed a clear mean–variance relationship. To better capture this, other candidate distributions were considered: a gamma model, which assumes a quadratic mean–variance relationship, and an inverse Gaussian model, which allows for a cubic mean–variance relationship. Model performance was then evaluated using the Akaike Information Criterion (AIC) to determine the best-fitting distribution.

Results

The regression coefficients were consistent between the bootstrap and reproduced models, indicating no difference in point estimates. However, the upper and lower confidence intervals shifted by more than 10%, although the shifts did not exceed the threshold of ±0.1.. The difference in standardised confidence interval range width was –0.1, with the bootstrap interval being 42% narrower than that of the reproduced model, indicating the model is not inferentially reproducible.

Boostrapping results

Wild bootstrapping was performed with 10,000 iterations.

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Sponsor_Type1	−0.2459	−0.2458	−0.0001	−0.0500	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Sponsor_Type1	−0.3729	−0.3216	−0.0513	−13.7600	Yes	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Sponsor_Type1	−0.1189	−0.1718	0.0528	44.4300	Yes	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Sponsor_Type1	0.2540	0.1498	−0.1042	−41.0200	Yes	Yes	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Sponsor_Type1	0.0002	0.0000	0.0002	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Comparison of distributions

The linear model had an AIC of 22400. A negative binomial model improved fit (AIC 17757), but DHARMa residual diagnostics indicated deviations from ideal assumptions, including overdispersion, outliers, and a lack-of-fit signal. A gamma model was then fitted; although overdispersion and outliers were no longer evident, its AIC was higher (19239) and the residual diagnostic worsened, indicating misspecification. Finally, an inverse Gaussian model achieved the lowest AIC (17616). Its diagnostics improved, with outliers and overdispersion no longer an issue; a slight residual departure from the target distribution remained, but the inverse Gaussian was retained as the best overall fit (inverse Gaussian residuals shown below). Because outliers and potential organisational clustering were not formally evaluated, the final choice of distribution should be based on models that explicitly account for these features.

Note on “normality” in DHARMa: for GLMs (e.g., negative binomial, gamma, inverse Gaussian), residuals are not expected to be Gaussian. DHARMa uses simulated, scaled residuals that should be uniform on (0,1) under a correctly specified model. Apparent “non-normality” in this context means non-uniformity of these residuals (e.g., a significant KS test or QQ plot deviations), which signals potential misspecification (wrong mean–variance form, link, or missing structure), not a violation of Gaussian residual assumptions.

Characteristic	Linear (AIC = 22400)			Negative Binomial (AIC = 17757)			Gamma (AIC = 19239)			Inverse Gaussian (AIC = 17616)
Characteristic	Beta	95% CI¹	p-value	Ratio¹	95% CI¹	p-value	Ratio	95% CI¹	p-value	Ratio	95% CI¹	p-value
(Intercept)	249	213, 285	<0.001	249	231, 269	<0.001	249	214, 291	<0.001	249	210, 295	<0.001
Sponsor_Type1	-153	-232, -74	<0.001	0.39	0.33, 0.46	<0.001	0.39	0.28, 0.54	<0.001	0.39	0.30, 0.50	<0.001
¹ CI = Confidence Interval, IRR = Incidence Rate Ratio

Residuals for the inverse gaussian model

Inferential reproducibility conclusion

This model was not inferentially reproducible: residuals were heteroscedastic, grossly non-normal, and demonstrated a clear mean–variance relationship indicative of multiplicative error. The bootstrap model showed a standardised change the confidence interval range of –0.1, whereas an inverse Gaussian model substantially improved fit, reducing the AIC by 4,784.