Paper 38: Association of serum 25-hydroxyvitamin D concentrations with sleep phenotypes in a German community sample

Author

Lee Jones - Senior Biostatistician - Statistical Review

Published

April 5, 2026

References

Dogan-Sander E, Willenberg A, Batmaz İ, Enzenbach C, Wirkner K, Kohls E, et al. (2019) Association of serum 25-hydroxyvitamin D concentrations with sleep phenotypes in a German community sample. PLoS ONE 14(7): e0219318. https://doi.org/10.1371/journal.pone.0219318

Disclosure

This reproducibility project was conducted to the best of our ability, with careful attention to statistical methods and assumptions. The research team comprises four senior biostatisticians (three of whom are accredited), with 20 to 30 years of experience in statistical modelling and analysis of healthcare data. While statistical assumptions play a crucial role in analysis, their evaluation is inherently subjective, and contextual knowledge can influence judgements about the importance of assumption violations. Differences in interpretation may arise among statisticians and researchers, leading to reasonable disagreements about methodological choices.

Our approach aimed to reproduce published analyses as faithfully as possible, using the details provided in the original papers. We acknowledge that other statisticians may have differing success in reproducing results due to variations in data handling and implicit methodological choices not fully described in publications. However, we maintain that research articles should contain sufficient detail for any qualified statistician to reproduce the analyses independently.

Methods used in our reproducibility analyses

There were two parts to our study. First, 100 articles published in PLOS ONE were randomly selected from the health domain and sent for post-publication peer review by statisticians. Of these, 95 included linear regression analyses and were therefore assessed for reporting quality. The statisticians evaluated what was reported, including regression coefficients, 95% confidence intervals, and p-values, as well as whether model assumptions were described and how those assumptions were evaluated. This report provides a brief summary of the initial statistical review.

The second part of the study involved reproducing linear regression analyses for papers with available data to assess both computational and inferential reproducibility. All papers were initially assessed for data availability and the statistical software used. From those with accessible data, the first 20 papers (from the original random sample) were evaluated for computational reproducibility. Within each paper, individual linear regression models were identified and assigned a unique number. A maximum of three models per paper were selected for assessment. When more than three models were reported, priority was given to the final model or the primary models of interest as identified by the authors; any remaining models were selected at random.

To assess computational reproducibility, differences between the original and reproduced results were evaluated using absolute discrepancies and rounding error thresholds, tailored to the number of decimal places reported in each paper. Results for each reported statistic, e.g., regression coefficient, were categorised as Reproduced, Incorrect Rounding, or Not Reproduced, depending on how closely they matched the original values. Each paper was then classified as Reproduced, Mostly Reproduced, Partially Reproduced, or Not Reproduced. The mostly reproduced category included cases with minor rounding or typographical errors, whereas partially reproduced indicated substantial errors were observed, but some results were reproduced.

For models deemed at least partially computationally reproducible, inferential reproducibility was further assessed by examining whether statistical assumptions were met and by conducting sensitivity analyses, including bootstrapping where appropriate. We examined changes in standardized regression coefficients, which reflect the change in the outcome (in standard deviation units) for a one standard deviation increase in the predictor. Meaningful differences were defined as a relative change of 10% or more, or absolute differences of 0.1 (moderate) and 0.2 (substantial). When non-linear relationships were identified, inferential reproducibility was assessed by comparing model fit measures, including R², Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). When the Gaussian distribution was not appropriate for the dependent variable, alternative distributions were considered, and model fit was evaluated using AIC and BIC.

Results from the reproduction of the Dogan-Sander et al. (2019) paper are presented below. An overall summary of results is presented first, followed by model-specific results organised within tab panels. Within each panel, the Original results tab displays the linear regression outputs extracted from the published paper. The Reproduced results tab presents estimates derived from the authors’ shared data, along with a comprehensive assessment of linear regression assumptions. The Differences tab compares the original and reproduced models to assess computational reproducibility. Finally, the Sensitivity analysis tab evaluates inferential reproducibility by examining whether identified assumption violations meaningfully affected the results.

Summary from statistical review

This paper explored the association between vitamin D and sleep. Seven sleep phenotypes were modelled using linear regression to explore the relationships between sleep vitamin D and adjustment for covariates using a stepwise modelling approach, although direction of stepwise modelling was not described. Assumptions were checked using residuals. The authors have gone to great lengths to describe statistical analysis. Still, they have not interpreted the results in the same detail, with over-reliance on p-values and no standard error or confidence intervals shown.

Data availability and software used

The authors provided the data in a wide formatted Excel file in the supporting information with no accompanying data dictionary. SPSS and Minitab were reported as used for statistical analysis, SPSS was assumed for linear regressions as this aligned to the dummy coding descriptions in the paper.

Regression sample

There were seven main multivariable models of interest. Three outcomes randomly selected for reproducibility were Subjective Sleep Quality, Night Sleep Efficiency, and Midsleep time.

Computational reproducibility results

This paper was not computationally reproducible, as two of the three reported models could not be reproduced. Although the authors devoted substantial effort to describing their statistical methods, including the variables used, the modelling approach, and assumption checking, several aspects of the reporting impeded reproducibility. The authors described the use of Box–Cox transformations, variable centring, and the creation of dummy variables, including interaction terms; however, none of the derived variables were provided. As a result, more than 30 derived variables had to be created across the three models. Although the Box–Cox λ values were reported, there are multiple valid ways to implement these transformations. Initial attempts produced shifted sums of squares despite identical p-values, indicating differences in implementation. The Subjective Sleep Quality model was ultimately reproduced after an alternative Box–Cox equations approach was applied. The authors also provided standardized regression coefficients without identifying that they were standardized.

The Night Sleep Efficiency and Midsleep Time models could not be reproduced due to inconsistencies in the reported sample sizes. The authors stated that the whole dataset contained 1,045 observations and that alcohol consumption was the only predictor with missing data (27 observations). However, the Subjective Sleep Quality model was reported to include 985 observations, yet no observations were excluded for missing alcohol consumption, even though three observations had missing values. This suggests that the model may have been fitted using a reduced dataset carried forward from a previous analysis. Further inconsistencies were observed for the remaining models. The Night Sleep Efficiency model was reported to include 1,005 observations but was reproduced with 1,045 observations, consistent with the descriptive statistics. The Midsleep Time model was reported to include 1,021 observations, but it was reproduced with 1,027. In addition, this model exhibited inconsistent degrees of freedom, with nine coefficients in the model but ten degrees of freedom reported in the ANOVA table. Collectively, these inconsistencies in the construction of derived variables, sample sizes, and model degrees of freedom render two of the three analyses computationally not reproducible.

Inferential reproducibility results

Although the authors provide a detailed description of their modelling strategy, this description focuses primarily on statistical procedures and assumptions rather than the clinical or substantive context of the analysis. While the bootstrapped regression coefficients and p-values were reproduced, the model was not inferentially reproducible. Only the final model was assessed; however, the modelling process described in the manuscript reflected two key misconceptions about statistical modelling. First, the authors described stepwise variable selection as a solution to multicollinearity. Highly collinear predictors were entered into a stepwise selection procedure, a practice known to produce unstable coefficient estimates and p-values. Because correlated predictors compete to explain the same variance, minor changes to the data or model specification can cause variables to enter or exit the model, change coefficient signs, or alter statistical significance. As a result, findings may appear computationally reproduced while remaining inferentially unstable. Collinearity should be addressed prior to any stepwise selection. Second, categorical predictors were dummy coded and treated as separate independent variables within the selection process, including interaction terms. This resulted in interactions being evaluated independently of their corresponding main effects and multi-level categorical variables being fragmented across multiple tests. Together, these modelling decisions undermine the stability, interpretability, and inferential validity of the reported results.

Recommended Changes

Provide tables in the Supporting Information that present all analyses conducted in the paper, including full model outputs such as regression coefficients and all variables used for adjustment.
Stepwise regression does not address multicollinearity. When predictors are highly correlated, either reduce the number of correlated variables or use regularisation approaches such as LASSO or ridge regression.
When interaction terms are included, the corresponding main effects should also be retained in the model to ensure interpretability.
Provide a dataset that includes all derived variables used in the analysis, including transformed variables (e.g. Box–Cox transformations) and centered or standardized variables.
Ensure consistency between what is described in the methods and what is reported in the results. In particular, clearly distinguish between standardized and unstandardized regression coefficients, as this distinction is critical for interpretation.
Regression coefficients should be interpreted and related back to the outcome. For example, when standardized coefficients are reported, interpretation should be framed in terms of a one–standard deviation increase in the predictor and the corresponding change (in standard deviation units) in the outcome, with consideration of whether the magnitude of the effect is clinically meaningful.
Consider using bootstrapping rather than Box–Cox transformations so that regression coefficients are more easily interpretable.
Ensure complete reporting of the stepwise model selection procedure, including the direction of selection (forward, backward, or bidirectional) and the criteria for variable entry and removal.
Include a data dictionary.
Consider creating a reproducible analysis workflow and sharing the code.

Model 1

Model results for Subjective Sleep Quality

Term	p-value	stdEst
Intercept
Age	<0.001	0.139
BMI	0.011	0.079
depression_yes	<0.001	0.132
male_age	<0.001	−0.288
male_thyroiddisorder	0.009	0.083
male_winter	0.012	0.085
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit statistics for Subjective Sleep Quality

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
	0.096	0.090			17.22	6	978	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Subjective Sleep Quality

Term	SS	DF	MS
Age
BMI
depression_yes
male_age
male_thyroiddisorder
male_winter
Residuals	41.088	978	0.042
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Subjective Sleep Quality

Term	B	SE	Lower	Upper	t	p-value	stdEst
Intercept	0.349	0.048	0.254	0.443	7.241	<0.001
Age	0.003	0.001	0.001	0.004	4.314	<0.001	0.139
BMI	0.004	0.001	0.001	0.007	2.533	0.0115	0.079
depression_yes	0.099	0.023	0.054	0.144	4.323	<0.001	0.132
male_age	−0.002	0.000	−0.002	−0.002	−7.963	<0.001	−0.288
male_thyroiddisorder	0.072	0.028	0.018	0.127	2.625	0.0088	0.083
male_winter	0.051	0.020	0.011	0.090	2.524	0.0118	0.085
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Fit statistics for Subjective Sleep Quality

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.309	0.096	0.090	−315.946	0.204	17.224	6	978	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Subjective Sleep Quality

Term	SS	DF	MS	F	p-value
Age	0.784	1	0.784	18.613	<0.001
BMI	0.270	1	0.270	6.417	0.0115
depression_yes	0.787	1	0.787	18.688	<0.001
male_age	2.670	1	2.670	63.412	<0.001
male_thyroiddisorder	0.290	1	0.290	6.893	0.0088
male_winter	0.268	1	0.268	6.371	0.0118
Residuals	41.172	978	0.042
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

The blue line shows the best line of fit with shading representing 95% confidence intervals, while holding all other covariates constant. The dots show partial residuals, which reflect the observed data adjusted for all other predictors except the one being plotted.

Checking residuals plots for patterns

Blue line showing quadratic fit for residuals

Testing residuals for non linear relationships

Term	Statistic	p-value	Results
Age	−1.348	0.1779	No linearity violation
BMI	0.336	0.7370	No linearity violation
depression_yes	0.148	0.8821	No linearity violation
male_age	−1.791	0.0736	No linearity violation
male_thyroiddisorder	0.415	0.6779	No linearity violation
male_winter	0.354	0.7232	No linearity violation
Tukey test	−0.833	0.4050	No linearity violation
Specification test for predictors using quadratic tests, for fitted values curvature is tested through Tukey's one-degree-of-freedom test for nonadditivity.

Checking univariate relationships with the dependent variable using scatterplots

Blue line shows linear relationship, red line indicates relationship inferred by GAM modelling

Linearity results

No linearity violation was observed in either plots or tests.

Testing for homoscedasticity

Statistic	p-value	Parameter	Method
5.114	0.5293	6	studentized Breusch-Pagan test

Homoscedasticity results

The studentized Breusch-Pagan test supports homoscedasticity.
There is no distinct funnelling pattern observed, supporting homoscedasticity of residuals.

Model descriptives including cook’s distance and leverage to understand outliers

Term	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
Subjective Sleep Quality	985	0.566	0.215	0.593	0.000	1.089	−0.024	−0.164
Age	985	58.593	11.670	60.000	20.000	79.000	−0.337	−0.555
BMI	985	27.717	4.499	27.139	16.934	55.360	1.088	2.813
depression_yes	985	0.091	0.288	0.000	0.000	1.000	2.832	6.027
male_age	985	29.267	30.852	0.000	0.000	79.000	0.224	−1.750
male_thyroiddisorder	985	0.065	0.247	0.000	0.000	1.000	3.525	10.433
male_winter	985	0.153	0.360	0.000	0.000	1.000	1.922	1.695
.fitted	985	0.566	0.066	0.561	0.447	0.803	0.350	−0.539
.resid	985	0.000	0.205	−0.001	−0.579	0.511	−0.099	−0.179
.leverage	985	0.007	0.005	0.005	0.002	0.045	2.193	7.193
.sigma	985	0.205	0.000	0.205	0.204	0.205	−2.115	5.012
.cooksd	985	0.001	0.002	0.000	0.000	0.012	3.069	12.365
.std.resid	985	−0.000	1.000	−0.007	−2.824	2.497	−0.099	−0.182
dfb.1_	985	0.000	0.031	0.000	−0.188	0.121	−0.213	4.175
dfb.Age	985	−0.000	0.032	−0.000	−0.138	0.204	0.302	4.046
dfb.BMI	985	−0.000	0.029	−0.000	−0.169	0.122	−0.634	5.987
dfb.dpr_	985	0.000	0.034	−0.000	−0.248	0.208	0.029	15.379
dfb.ml_g	985	−0.000	0.032	0.000	−0.126	0.121	−0.026	1.804
dfb.ml_t	985	−0.000	0.030	0.000	−0.223	0.247	0.333	25.320
dfb.ml_w	985	0.000	0.030	0.000	−0.187	0.162	0.253	6.955
dffit	985	−0.000	0.083	−0.001	−0.287	0.292	−0.030	0.563
cov.r	985	1.007	0.011	1.009	0.954	1.055	−1.203	3.575
* categorical variable

Cooks threshold

Cook’s distance measures the overall change in fit, if the ith observation is removed. Potential influential observations are identified by \(\text{Cook's Distance}_i > \frac{4}{n}\), where n is the number of observations. In practice a threshold of 0.5 to 1 is often used to identify influential observations.

DFFIT threshold

DFFIT measures how many standard deviations the fitted values will change when the ith observation is removed. Potential influential observations \(\left| \text{DFFITS}_i \right| > \frac{2\sqrt{p}}{\sqrt{n}}\) where p is the number of predictors (including the intercept) and n is the number of observations. In practice this can result in a large number of points identified, a practical cut-off of 1 was used to flag observations with meaningful impact.

DFBETA threshold

DFBETAS quantify the influence of the ith observation on the jth regression coefficient as the change in that coefficient when the observation is omitted, expressed in units of the coefficient’s estimated standard error. There is a DFBETA for each parameter in the model. Potential influential observations \(|\text{DFBETA}_{ij}| > \frac{2}{\sqrt{n}}\), where n is the number of observations. In larger datasets, this threshold can flag a high number of observations with only minor influence on the model. A practical cut-off of 1 was used to flag observations with meaningful impact.

Influence plot

Observations with high leverage (horizontal) and large residuals (vertical, typically at ±2 or ±3 studentized residuals) are concerning, as they may disproportionately influence the model. This combination is reflected by large bubbles with high Cook’s distance indicated by darker shadings of blue.

COVRATIO plot

COVRATIO measures the overall change in the precision (covariance matrix) of the estimated regression coefficients when the ith observation is removed. Values close to 1 indicate little influence on the model’s precision. Values below 1 suggest that an observation inflates the variances and reduces precision, resulting in wider confidence intervals, whereas values above 1 suggest deflated variances and narrower confidence intervals. A commonly cited guideline is \(\left|\mathrm{COVRATIO}_i - 1\right| > \frac{3p}{n}\), where p is the number of parameters and n is the number of observations. A practical cut-off between 0.9 to 1.1 was used to flag observations with meaningful impact on precision, although there is no agreed universal alternative cut-off.

Observations of interest identified by the influence plot

ID	StudRes	Leverage	CookD	dfb.1_	dfb.Age	dfb.BMI	dfb.dpr_	dfb.ml_g	dfb.ml_t	dfb.ml_w	dffit	cov.r
973	0.183	0.045	0.000	−0.023	−0.012	0.039	−0.002	−0.004	0.001	−0.001	0.040	1.055
960	−2.814	0.003	0.003	−0.095	−0.002	0.084	0.039	0.069	0.003	0.003	−0.155	0.955
373	−2.835	0.003	0.004	−0.035	0.090	−0.062	0.040	0.055	0.001	0.010	−0.165	0.954
993	−0.996	0.043	0.006	0.095	0.062	−0.165	0.008	−0.002	−0.124	0.025	−0.212	1.045
435	−1.969	0.021	0.012	0.121	−0.062	−0.088	0.007	−0.036	−0.223	0.062	−0.287	1.000
20	2.083	0.019	0.012	0.117	−0.048	−0.093	−0.020	0.011	0.247	−0.046	0.292	0.996
StudRes = studentized residuals; CookD = Cook's Distance a combined measure of leverage and influence. DFBETAS (dfb.*) measures how much a specific regression coefficient changes (in standard errors) when an observation is removed; DFFITS measures how much the fitted (predicted) value for an observation changes (in standard deviations) when that observation is removed; cov.r = coefficient covariance ratio which measures how much the overall variance (precision) of the coefficients changes when that observation is removed.

Results for outliers and influential points

Two observations had studentized residuals close to three 3, however, both observations had low leverage and small Cook’s distance, with DFBETAS, COVRATIO, DFFITS within conventional ranges, indicating no issues with outliers.

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Statistic	p-value	Method
0.022	0.7067	Asymptotic one-sample Kolmogorov-Smirnov test

Statistic	p-value	Method
0.996	0.0132	Shapiro-Wilk normality test

Normality results

The Kolmogorov-Smirnov supports residuals being normally distributed.
The Shapiro-Wilk normality test indicates residuals may not be normally distributed.
QQ-plot looks roughly normal.

Assessing collinearity with VIF

Term	VIF	Tolerance
Age	1.122	0.891
BMI	1.047	0.955
depression_yes	1.014	0.986
male_age	1.411	0.709
male_thyroiddisorder	1.082	0.924
male_winter	1.223	0.818
VIF = Variance Inflation Factor.

Collinearity results

All VIF values are under three, indicating no collinearity issues.
Overall, when taking into account VIF and SE, the model does not have collinearity issues.

Assessing independence with the Durbin–Watson test for autocorrelation

AutoCorrelation	Statistic	p-value
0.052	1.896	0.1040

Independence results

The Durbin–Watson test suggests there are no auto-correlation issues.
The study design seems to be independent.

Assumption conclusions

The residuals indicated no meaningful departures from the assumptions. While the Shapiro-Wilk normality test showed minor deviations in residual normality, the Q-Q plot showed an approximately normal distribution. There were no outliers of concern.

Change in regression coefficients

term	O_std.B	R_std.B	changestdB	reproduce.std.B
Intercept
Age	0.139	0.1390	0.0000	Reproduced
BMI	0.079	0.0788	−0.0002	Reproduced
depression_yes	0.132	0.1324	0.0004	Reproduced
male_age	−0.288	−0.2877	0.0003	Reproduced
male_thyroiddisorder	0.083	0.0830	0.0000	Reproduced
male_winter	0.085	0.0849	−0.0001	Reproduced
O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R²

O_R2	R_R2	Change.R2	Reproduce.R2
0.096	0.0956	−0.0004	Reproduced
O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	17.22	17.2241	0.0041	Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in degrees of freedom

O_DF1	R_DF1	Change.DF1	Reproduce.DF1	O_DF2	R_DF2	Change.DF2	Reproduce.DF2
6	6	0	Reproduced	978	978	0	Reproduced
O_DF1 = original degrees of freedom for the model; R_DF1 = reproduced degrees of freedom for the model; Change.DF1 = change in R_DF1 - O_DF1; Reproduce.DF1 = reproduced degrees of freedom for the model (R_DF1 = O_DF1); O_DF2 = original degrees of freedom for the residuals; R_DF2 = reproduced degrees of freedom for the residuals; Change.DF2 = change in R_DF2 - O_DF2; Reproduce.DF2 = reproduced degrees of freedom for the residuals (R_DF2 = O_DF2).

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p
Intercept		<0.001
Age	<0.001	<0.001	0.0000	Reproduced
BMI	0.011	0.0115	0.0005	Reproduced
depression_yes	<0.001	<0.001	0.0000	Reproduced
male_age	<0.001	<0.001	0.0000	Reproduced
male_thyroiddisorder	0.009	0.0088	−0.0002	Reproduced
male_winter	0.012	0.0118	−0.0002	Reproduced
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman plot showing differences between original and reproduced p-values for Subjective Sleep Quality

Results for p-values

The P-values for this model were reproduced.

Conclusion computational reproducibility

This model was computationally reproducible, with all reported statistics that were assessed being reproducible.

Methods

The model was successfully reproduced; however, residual diagnostics indicated there may be minor departure of residual normality. All variables in the model were standardized, and inference was assessed using bootstrapped standardized regression coefficients and their corresponding 95% confidence intervals. Percentage and absolute changes in estimates and confidence-interval ranges relative to the original linear model were summarised using thresholds of 10% change and standardized coefficient differences of <0.10 and <0.20. Consistency of coefficient direction and statistical significance was also evaluated.

Bootstrapped results

A non-parametric bootstrap with bias-corrected and accelerated (BCa) confidence intervals was performed using 10,000 resamples.

Change in regression coefficients

Term	B	boot.B	B_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.0073	0.0078	−0.0005	−6.9500	No	No	No
z_Age	0.1391	0.1387	0.0004	0.2800	No	No	No
z_BMI	0.0791	0.0796	−0.0005	−0.5900	No	No	No
z_depression_yes	0.1357	0.1355	0.0002	0.1500	No	No	No
z_male_age	−0.2891	−0.2891	0.0001	0.0200	No	No	No
z_male_thyroiddisorder	0.0825	0.0823	0.0003	0.3200	No	No	No
z_male_winter	0.0844	0.0843	0.0002	0.1800	No	No	No
B = standardized regression coefficient reproduced B; boot.B = boostrapped standardized reproduced B; B_diff = change in B - boot.B; %_Diff = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in lower 95% confidence interval

Term	Lower	boot.Lower	Lower_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	−0.0524	−0.0530	0.0006	1.2200	No	No	No
z_Age	0.0758	0.0784	−0.0026	−3.4400	No	No	No
z_BMI	0.0178	0.0231	−0.0052	−29.4500	Yes	No	No
z_depression_yes	0.0741	0.0721	0.0020	2.7600	No	No	No
z_male_age	−0.3603	−0.3600	−0.0003	−0.0800	No	No	No
z_male_thyroiddisorder	0.0208	0.0246	−0.0037	−17.8500	Yes	No	No
z_male_winter	0.0188	0.0232	−0.0045	−23.7000	Yes	No	No
Lower = standardized reproduced lower CI; boot.Lower = boostrapped standardized reproduced lower CI; Lower_diff = change in Lower - boot.Lower; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in upper 95% confidence interval

Term	Upper	boot.Upper	Upper_diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.0670	0.0671	−0.0000	−0.0000	No	No	No
z_Age	0.2023	0.2030	−0.0007	−0.3400	No	No	No
z_BMI	0.1404	0.1341	0.0063	4.5200	No	No	No
z_depression_yes	0.1973	0.2033	−0.0059	−3.0100	No	No	No
z_male_age	−0.2178	−0.2188	0.0010	0.4500	No	No	No
z_male_thyroiddisorder	0.1442	0.1410	0.0032	2.2000	No	No	No
z_male_winter	0.1500	0.1467	0.0033	2.1900	No	No	No
Upper = standardized reproduced upper CI; boot.Upper = boostrapped standardized reproduced upper CI; Upper_diff = change in Upper - boot.Upper; %_change = percentage difference, percentage changes were truncated at ±1000%; Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in Range of 95% confidence interval

Term	Range	boot.Range	Range_Diff	%_Diff	Diff_10%	Diff_0.1	Diff_0.2
Intercept	0.1194	0.1201	0.0006	0.5400	No	No	No
z_Age	0.1265	0.1246	−0.0019	−1.5100	No	No	No
z_BMI	0.1226	0.1110	−0.0116	−9.4600	No	No	No
z_depression_yes	0.1232	0.1312	0.0080	6.4800	No	No	No
z_male_age	0.1425	0.1412	−0.0013	−0.8800	No	No	No
z_male_thyroiddisorder	0.1234	0.1165	−0.0069	−5.5900	No	No	No
z_male_winter	0.1312	0.1235	−0.0077	−5.9000	No	No	No
Range = standardized reproduced CI range; boot.B = boostrapped standardized reproduced CI range; Range_diff = change in CI Range ; %_change = percentage difference, Diff_10% = difference ≥10% ; Diff_0.1 and Diff_0.2 = absolute difference ≥0.1 and ≥0.2, respectively.

Change in p-value significance and regression coefficient direction

Term	p-value	boot.p-value	changep	SigChangeDirection
Intercept	0.8096	0.7976	0.0119	Remains non-sig, B same direction
z_Age	<0.001	<0.001	0.0000	Remains sig, B same direction
z_BMI	0.0115	0.0048	0.0066	Remains sig, B same direction
z_depression_yes	<0.001	<0.001	−0.0000	Remains sig, B same direction
z_male_age	<0.001	<0.001	0.0000	Remains sig, B same direction
z_male_thyroiddisorder	0.0088	0.0056	0.0032	Remains sig, B same direction
z_male_winter	0.0118	0.0076	0.0042	Remains sig, B same direction
p-value = standardized reproduced p-value; boot.p-value = boostrapped standardized reproduced p-value; changep = change in p-value - boot.p-value; SigChangeDirection = statistical significance and B change between reproduced and bootstrapped model.

Check the distribution of bootstrap estimates

The bootstrap distribution of each coefficient appeared approximately normal and centered near the original estimate (red dashed line), suggesting that the estimates are relatively stable. No strong skewness or multimodality was observed.

Bootstrapped model conclusions

This model was inferentially reproducible. While some statistics changed by 10% or more, these differences were not meaningful, with a change in standardized regression coefficients of less than 0.1. The direction of effects and statistical significance remained consistent between the reproduced and bootstrapped models.

Inferential reproducibility conclusions

While the bootstrapped regression coefficients and p-values were reproduced, the model was not inferentially reproducible. Although only the final model was assessed, the described modelling process was flawed and contained two major misconceptions about statistical modelling. The authors described using a stepwise selection procedure as a solution to multicollinearity. Highly collinear variables were entered into a stepwise selection procedure, which may result in unstable coefficient estimates and p-values. Because correlated predictors compete to explain the same variance, these small changes can cause variables to enter or exit the model, flip coefficient signs, or change statistical significance. This is why results may appear computationally reproduced yet remain inferentially unstable; therefore, collinearity should be resolved before putting variables into stepwise regression. In addition, categorical predictors were dummy coded and treated as separate independent variables within the selection process, including interaction terms. This led to interactions being evaluated independently of their corresponding main effects and to multi-level categorical variables being split across multiple tests. Together, these modelling decisions undermine the stability, interpretability, and inferential validity of the reported results.

Model 2

Model results for Night Sleep Efficiency

Term	p-value	stdEst
Intercept
male	<0.001	−0.167
married	<0.001	0.112
BMI_C	<0.001	−0.150
male_C_BMI_C	0.004	−0.091
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit statistics for Night Sleep Efficiency

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
	0.063	0.059			16.727	4	999	0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Night Sleep Efficiency

Term	SS	DF	MS
male
married
BMI_C
male_C_BMI_C
Residuals	272,800,000,000,000,000	999	273,100,000,000,000
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results Night Sleep Efficiency

Term	B	SE	Lower	Upper	t	p-value	stdEst
Intercept	13,563,697.118	244,256.181	13,084,406.007	14,042,988.228	55.531	<0.001
male	−1,538,000.840	260,712.679	−2,049,583.675	−1,026,418.005	−5.899	<0.001	−0.179
married	925,453.473	274,169.607	387,464.813	1,463,442.133	3.375	<0.001	0.103
BMI_C	−112,842.175	29,743.287	−171,205.870	−54,478.480	−3.794	<0.001	−0.119
male_C_BMI_C	−165,852.368	59,892.677	−283,376.632	−48,328.104	−2.769	0.0057	−0.087
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Model fit for Night Sleep Efficiency

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.235	0.055	0.052	34,835.103	4,167,525.463	15.234	4	1040	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA table for Night Sleep Efficiency

Term	SS	DF	MS	F	p-value
male	607,335,010,136,740.000	1	607,335,010,136,740.000	34.801	<0.001
married	198,842,876,129,404.000	1	198,842,876,129,404.000	11.394	<0.001
BMI_C	251,191,211,811,660.000	1	251,191,211,811,660.000	14.393	<0.001
male_C_BMI_C	133,824,421,831,412.000	1	133,824,421,831,412.000	7.668	0.0057
Residuals	18,149,840,569,741,496.000	1040	17,451,769,778,597.592
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

Change in regression coefficients

term	O_std.B	R_std.B	changestdB	reproduce.std.B
Intercept
male	−0.167	−0.1793	−0.0123	Not Reproduced
married	0.112	0.1026	−0.0094	Not Reproduced
BMI_C	−0.150	−0.1188	0.0312	Not Reproduced
male_C_BMI_C	−0.091	−0.0867	0.0043	Not Reproduced
O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R²

O_R2	R_R2	Change.R2	Reproduce.R2
0.063	0.0553	−0.0077	Not Reproduced
O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	16.727	15.2340	−1.4930	Not Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in degrees of freedom

O_DF1	R_DF1	Change.DF1	Reproduce.DF1	O_DF2	R_DF2	Change.DF2	Reproduce.DF2
4	4	0	Reproduced	999	1,040	41	Not Reproduced
O_DF1 = original degrees of freedom for the model; R_DF1 = reproduced degrees of freedom for the model; Change.DF1 = change in R_DF1 - O_DF1; Reproduce.DF1 = reproduced degrees of freedom for the model (R_DF1 = O_DF1); O_DF2 = original degrees of freedom for the residuals; R_DF2 = reproduced degrees of freedom for the residuals; Change.DF2 = change in R_DF2 - O_DF2; Reproduce.DF2 = reproduced degrees of freedom for the residuals (R_DF2 = O_DF2).

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p
Intercept		<0.001
male	<0.001	<0.001	0.0000	Reproduced
married	<0.001	<0.001	0.0000	Reproduced
BMI_C	<0.001	<0.001	0.0000	Reproduced
male_C_BMI_C	0.004	0.0057	0.0017	Not Reproduced
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman plot showing differences between original and reproduced p-values for Night Sleep Efficiency

Results for p-values

Some of the p-values for this model were reproduced.

Conclusion computational reproducibility

This model was not computationally reproducible. There are clear differences between degrees of freedom between models, indicating a different number of observations.

As this model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced and therefore, statistical assumptions could not be meaningfully compared or interpreted.

Model 3

Model results for Midsleep time

Term	p-value	stdEst
Intercept
VitD_concentration_C	0.011	−0.080
Age_C	0.001	−0.154
low_SES_C	0.050	−0.060
alcohol_consumption_C	<0.001	0.192
spring_C	0.034	−0.064
employed_C	<0.001	−0.349
male_age	0.004	−0.086
single_age	0.001	−0.114
male_alcohol_consumption	0.001	−0.127
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = standardized B.

Fit Statistics Midsleep time

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
	0.074	0.065			7.976	10	1036	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Midsleep time

Term	SS	DF	MS
VitD_concentration_C
Age_C
low_SES_C
alcohol_consumption_C
spring_C
employed_C
male_age
single_age
male_alcohol_consumption
Residuals	170.441	1000	0.17
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square.

Model results for Midsleep time

Term	B	SE	Lower	Upper	t	p-value	stdEst
Intercept	1.849	0.022	1.806	1.893	83.035	<0.001
VitD_concentration_C	−0.003	0.001	−0.005	−0.000	−2.282	0.0227	−0.072
Age_C	−0.007	0.002	−0.011	−0.004	−4.393	<0.001	−0.200
low_SES_C	−0.074	0.037	−0.146	−0.002	−2.006	0.0451	−0.062
alcohol_consumption_C	0.006	0.002	0.002	0.010	2.865	0.0043	0.231
spring_C	−0.064	0.028	−0.118	−0.010	−2.314	0.0208	−0.071
employed_C	−0.302	0.037	−0.375	−0.230	−8.139	<0.001	−0.356
male_age	0.001	0.001	0.000	0.002	2.027	0.0430	0.079
single_age	0.002	0.001	−0.000	0.003	1.887	0.0594	0.060
male_alcohol_consumption	−0.005	0.002	−0.009	−0.000	−2.029	0.0427	−0.178
SE = Standard error; Lower = lower confidence interval; Upper = upper confidence interval: stdEst = Standardized B

Fit statistics for Midsleep time

R	R2	R2Adj	AIC	RMSE	F	DF1	DF2	p-value
0.290	0.084	0.076	1,085.973	0.406	10.346	9	1017	<0.001
R2 Adj = Adjusted R2; AIC = Akaike Information Criterion; RMSE = The Root Mean Squared Error; DF1 = Degrees of freedom for the model; DF2 = Degrees of freedom for the residuals.

ANOVA Table for Midsleep time

Term	SS	DF	MS	F	p-value
VitD_concentration_C	0.867	1	0.867	5.205	0.0227
Age_C	3.215	1	3.215	19.294	<0.001
low_SES_C	0.671	1	0.671	4.026	0.0451
alcohol_consumption_C	1.367	1	1.367	8.206	0.0043
spring_C	0.892	1	0.892	5.356	0.0208
employed_C	11.036	1	11.036	66.240	<0.001
male_age	0.684	1	0.684	4.107	0.0430
single_age	0.594	1	0.594	3.562	0.0594
male_alcohol_consumption	0.686	1	0.686	4.119	0.0427
Residuals	169.444	1017	0.167
SS = Sum of Squares; DF = Degrees of freedom; MS = Mean Square; Calculated using type III SS.

Visualisation of regression model

Change in regression coefficients

term	O_std.B	R_std.B	changestdB	reproduce.std.B
Intercept
VitD_concentration_C	−0.080	−0.0718	0.0082	Not Reproduced
Age_C	−0.154	−0.2004	−0.0464	Not Reproduced
low_SES_C	−0.060	−0.0623	−0.0023	Not Reproduced
alcohol_consumption_C	0.192	0.2314	0.0394	Not Reproduced
spring_C	−0.064	−0.0713	−0.0073	Not Reproduced
employed_C	−0.349	−0.3564	−0.0074	Not Reproduced
male_age	−0.086	0.0787	0.1647	Not Reproduced
single_age	−0.114	0.0603	0.1743	Not Reproduced
male_alcohol_consumption	−0.127	−0.1782	−0.0512	Not Reproduced
O_std.B = original Standarzised B; R_B = reproduced Standarzised B; Change.std.B = change in R_std.B - O_std.B; Reproduce.std.B = Standarzised B reproduced.

Change in R²

O_R2	R_R2	Change.R2	Reproduce.R2
0.074	0.0839	0.0099	Not Reproduced
O_R2 = original R2; R_R2 = reproduced R2; Change.R2 = change in R_R2 - O_R2

Change in global F

Term	O_F	R_F	Change.F	Reproduce.F
Intercept	7.976	10.3461	2.3701	Not Reproduced
O_F = original global F; R_F = reproduced global F; Change.F = change in R_F - O_F; Reproduce.F = Global F reproduced.

Change in degrees of freedom

O_DF1	R_DF1	Change.DF1	Reproduce.DF1	O_DF2	R_DF2	Change.DF2	Reproduce.DF2
10	9	−1	Not Reproduced	1,036	1,017	−19	Not Reproduced
O_DF1 = original degrees of freedom for the model; R_DF1 = reproduced degrees of freedom for the model; Change.DF1 = change in R_DF1 - O_DF1; Reproduce.DF1 = reproduced degrees of freedom for the model (R_DF1 = O_DF1); O_DF2 = original degrees of freedom for the residuals; R_DF2 = reproduced degrees of freedom for the residuals; Change.DF2 = change in R_DF2 - O_DF2; Reproduce.DF2 = reproduced degrees of freedom for the residuals (R_DF2 = O_DF2).

Change in p-values

Term	O_p	R_p	Change.p	Reproduce.p
Intercept		<0.001
VitD_concentration_C	0.011	0.0227	0.0117	Not Reproduced
Age_C	0.001	<0.001	0.0000	Reproduced
low_SES_C	0.050	0.0451	−0.0049	Not Reproduced
alcohol_consumption_C	<0.001	0.0043	0.0033	Not Reproduced
spring_C	0.034	0.0208	−0.0132	Not Reproduced
employed_C	<0.001	<0.001	0.0000	Reproduced
male_age	0.004	0.0430	0.0390	Not Reproduced
single_age	0.001	0.0594	0.0584	Not Reproduced
male_alcohol_consumption	0.001	0.0427	0.0417	Not Reproduced
O_p = original p-value; R_p = reproduced p-value; Changep = change in p-value R_p - O_p; Reproduce.p = p-values reproduced. SigChangeDirection = statistical significance and B change between original and reproduced models. Note, p-values that were <0.001 were set to 0.00099 for the purposes of comparison.

Bland Altman Plot showing differences between original and reproduced p-values for Midsleep time

Results for p-values

The p-values for this model were not reproduced.

Conclusion computational reproducibility

This model was not computationally reproducible. There are clear differences between degrees of freedom between models, indicating a different number of observations. The degrees of freedom for the fitted model differ from the number of regression coefficients, so two different models for the same analysis have been reported.

As this model was not computationally reproducible, inferential reproducibility was not considered, since the original analyses could not be reproduced, and therefore, statistical assumptions could not be meaningfully compared or interpreted.

References

Disclosure

Methods used in our reproducibility analyses

Summary from statistical review

Data availability and software used

Regression sample

Computational reproducibility results

Inferential reproducibility results

Recommended Changes

Model 1

Model results for Subjective Sleep Quality

Fit statistics for Subjective Sleep Quality

ANOVA table for Subjective Sleep Quality

Model results for Subjective Sleep Quality

Fit statistics for Subjective Sleep Quality

ANOVA table for Subjective Sleep Quality

Visualisation of regression model

Checking residuals plots for patterns

Testing residuals for non linear relationships

Checking univariate relationships with the dependent variable using scatterplots

Linearity results

Testing for homoscedasticity

Homoscedasticity results

Model descriptives including cook’s distance and leverage to understand outliers

Cooks threshold

DFFIT threshold

DFBETA threshold

Influence plot

COVRATIO plot

Observations of interest identified by the influence plot

Results for outliers and influential points

Checking for normality of the residuals using a Q–Q plot

Normality of residuals using Shapiro-Wilk and Kolmogorov-Smirnov tests

Normality results

Assessing collinearity with VIF

Collinearity results

Assessing independence with the Durbin–Watson test for autocorrelation

Independence results

Assumption conclusions

Change in regression coefficients

Change in R2

Change in global F

Change in degrees of freedom

Change in p-values

Bland Altman plot showing differences between original and reproduced p-values for Subjective Sleep Quality

Results for p-values

Conclusion computational reproducibility

Methods

Bootstrapped results

Change in regression coefficients

Change in lower 95% confidence interval

Change in upper 95% confidence interval

Change in Range of 95% confidence interval

Change in p-value significance and regression coefficient direction

Check the distribution of bootstrap estimates

Bootstrapped model conclusions

Inferential reproducibility conclusions

Model 2

Model results for Night Sleep Efficiency

Fit statistics for Night Sleep Efficiency

ANOVA table for Night Sleep Efficiency

Model results Night Sleep Efficiency

Model fit for Night Sleep Efficiency

ANOVA table for Night Sleep Efficiency

Visualisation of regression model

Change in regression coefficients

Change in R2

Change in global F

Change in degrees of freedom

Change in p-values

Bland Altman plot showing differences between original and reproduced p-values for Night Sleep Efficiency

Results for p-values

Conclusion computational reproducibility

Model 3

Model results for Midsleep time

Fit Statistics Midsleep time

ANOVA Table for Midsleep time

Model results for Midsleep time

Fit statistics for Midsleep time

ANOVA Table for Midsleep time

Change in R²

Change in R²

Change in R²