The Statistics Behind Everyday Science

Apery Kira (
Every day we hear and read about how a certain study shows (or does not show) evidence for something, but we do not usually consider whether commentators (or even the authors!) have interpreted the statistical results properly. Today, the field of statistics plays an important role in many academic disciplines, including epidemiology. Dr. Ashok Chaurasia, who is an Assistant Professor of Statistics in the School of Public Health and Health Systems at the University of Waterloo and a CSEB Board member, sat down with me to discuss two topics from his list of research interests, variable selection and missing data.

We first discussed variable selection in regression models. It is well known that many variables influence our health behaviours. For example, peer-pressure or parental behaviours can promote or reduce binge drinking. Therefore, researchers should include such variables in their regression models as potential effect modifiers or cofounders. However, as the number of relevant variables increases, the list of possible models (combinations of these variables) increases exponentially. From this large list of models, we seek to select the ‘best’ model. By ‘best’ we mean the optimal model for achieving some specific purpose, such as prediction or interpretation, assuming an optimal model exists that closely resembles what exists in nature (the ‘truth’ epidemiologists try and depict using statistical models). In the search for the ‘best’ model, Professor Chaurasia cautions against making statements of causality because association does not imply causation. For causation, one needs a controlled setting such as randomized controlled trial. Even then, a single well-designed trial cannot alone be taken to infer causation.

Professor Chaurasia’s other research interest includes the common problem of missing data – specifically, handling missing data in epidemiological studies. For some context, suppose in a survey we intend to collect data on 100 subjects, but 20 subjects did not provide complete data, i.e. they did not answer all of the questions. So, how can we analyze the data in an efficient manner without discarding the incomplete cases? The analysis where incomplete cases are discarded is often biased and inefficient, and therefore not preferred. An interesting method of handling missing data is Multiple Imputation (MI; Rubin 1987), which under certain assumptions can allow researchers to impute values for the missing data by borrowing information through ‘matching’ incomplete cases to cases who provided complete data. The matching process is repeated multiple times to address additional uncertainty due to missing values. Each repeated iteration yields an imputed dataset; subsequent regression analyses are conducted on the imputed datasets.

With just these two examples, we can see the importance of statistical methods in epidemiology. It is safe to say that without these methods, our ability to make valid scientific conclusions would be extremely hindered!

Apery Kira is an Honours Health Studies undergraduate student at the University of Waterloo, currently pursuing a health research specialization. She is in the process of refining her research interests and expresses a strong interest in biostatistics. Apery aims to pursue a career in clinical and health research. During her spare time, Apery enjoys reading, hiking, and baking.