Academic context. Prepared as the major assessment for PUBH 726 Applied Biostatistics 2 (Regression Methods), Postgraduate Diploma in Public Health, University of Otago (2025). The dataset was provided by the course and is not real-world data; the analysis was conducted independently. The diploma was completed with distinction overall.

About this project

The primary research question was whether weekly exercise time was associated with stress levels in adolescents aged 11 to 18, after adjustment for other plausible predictors. A secondary question asked whether sleep duration showed any independent association with stress, and whether that association varied by age.

The exercise was deliberately constructed to test the analytical workflow end-to-end: data integrity, model building, confounder identification, diagnostic checking, sensitivity analysis, and lay translation. None of these steps is glamorous on its own. The value comes from doing all of them carefully and honestly.

Approach

  1. Data preparation. Each variable was checked for missingness, duplication, and implausible values. A single participant reporting 170 hours of weekly exercise was excluded as implausible (the maximum hours in a week is 168). The final analytical sample was 1,332 adolescents.
  2. Descriptive statistics. A standard Table 1 was produced summarising the sample, with means, standard deviations, and medians used appropriately depending on the distribution of each variable.
  3. Model building. A full model was fitted with exercise time, age, sex, BMI, and sleep duration. Multicollinearity was checked using variance inflation factors. Backwards elimination was used to remove non-significant predictors, with each candidate for removal tested as a potential confounder of the exposure–outcome relationship using the 10% change rule.
  4. Diagnostics. Linearity was assessed via scatterplots of stress against each continuous predictor. Homoscedasticity was assessed via a plot of residuals against fitted values. Normality of residuals was assessed via histogram and Q-Q plot. Influential points were assessed using DFBETA.
  5. Interaction. For the secondary question, an interaction term between sleep duration and centred age was tested in a separate model.

Key findings

  • After adjustment for age, sex, and sleep duration, exercise time was not significantly associated with stress (coefficient 0.004 per additional weekly hour; 95% CI −0.018 to 0.027; p = 0.71). The confidence interval comfortably included zero.
  • Age was associated with higher stress (0.61 points per year of age; 95% CI 0.38 to 0.85; p < 0.001).
  • Male sex was associated with lower stress (−0.77; 95% CI −1.23 to −0.31; p = 0.001).
  • Sleep duration of 8 or more hours was associated with substantially lower stress (−1.35; 95% CI −1.82 to −0.88; p < 0.001).
  • The interaction between sleep duration and centred age was not statistically significant (p = 0.53), suggesting that the protective association of adequate sleep does not vary meaningfully by age within this adolescent range.
  • The final model explained 6.1% of the variation in stress, a modest R² that reflects how much these particular predictors can actually explain about adolescent stress, which is not a lot.

Lay summary

In this group of 1,333 adolescents, the amount of weekly exercise was not associated with how stressed they reported feeling. However, adolescents who got less than eight hours of sleep, who were older, and who were female all tended to report higher stress, and these patterns held after accounting for the other factors. The strongest takeaway is that sleep matters: getting eight or more hours a night was associated with meaningfully lower stress regardless of age, sex, or how much the adolescent exercised.

What this project demonstrates

This piece is the closest thing in the portfolio to a standalone data analysis project. It is the kind of work I could do for a research team, a public health unit, a health-tech company doing applied analytics, or any organisation that needs a quantitative result explained clearly to a non-technical audience. The skills behind it are multivariable linear regression, model diagnostics, confounder testing, interaction analysis, and, just as importantly, the discipline of translating a statistical model into language a Minister, a policy team, or a community board can actually use.