Statistics 186
Spring,
2025
Causal
Inference concerns the very difficult, challenging problem of addressing
questions such as, "Would vaccinating children 16 and younger against
COVID 19 lead to fewer deaths among public school teachers " and "Would
providing Harvard students access to a mobile health application designed to
help them manage school stress, lead to improved school performance " This
class will include 4 modules. The first module introduces the nuanced
world of causal inference along with a fundamental tool: the language of
potential outcomes. The second module covers randomized experiments
and how data from randomized experiments can be used to make causal
statements. The third module introduces the rather tricky problem of
using observational (non-randomized) data to attempt to make causal
statements. The final module introduces a new and challenging area
in which the goal is to make causal inference about the effect of sequences of
treatments.
Professor: Susan Murphy (samurphy@g.harvard.edu).
Classroom
& Times: Science Ctr 228, MW 4:30 pm-5:45 pm
(EST). No class 2/17, 3/17, 3/19.
TF: Noah Dasanaike (noahdasanaike@g.harvard.edu)
Sections
and Office Hours:
Susan Murphy's Office Hours: Monday
6-7pm 316.09 Science Ctr
Noah Dasanaike's
Office Hours: Wednesday 9:45-11:45am, CGIS K415 book here
Course
Webpage: Stat 186 Canvas site
Book: Imbens, G., & Rubin, D.
(2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An
Introduction. Cambridge: Cambridge University Press.
doi:10.1017/CBO9781139025751. No purchase is necessary; you can download pdf
copies of the book chapters from the Library Reserves section of the Stat 186 Canvas site.
You can also purchase the hard copy or an Adobe e-book.
Other Papers: Scientific papers may be assigned.
Recommended Texts: Hernan MA, Robins JM
(2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. See https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Prerequisites: Stat110, Stat111, Stat139. Probability and
statistical inference are needed extensively, and statistical linear models are
often used.
Computing and Simulation: Some
homework problems will mainly involve statistical reasoning and probability
whereas other homework problems will require programming in R. Students may use other software programs such as Python and Matlab, but we will only provide support for R. I recommend
RStudio as an interface for R. Both R and RStudio are freely available. You are
welcome to use generative AI (ChatGPT, Gemini, etc)
for assistance in coding.
Typical Class:
4:30pm: Ensure that you have turned in the Quiz on Canvas
4:30pm: Sit with your assigned group
4:30-5:00pm:
30 Min. Lecture
5:00-5:10pm:
Breakout with your group (Discuss quiz and questions posed in Lecture)
5:10-5:20pm:
Class Discussion (one of the groups leads the discussion)
5:20-5:45pm:
25 Min. Lecture
Topics
Covered:
Module
1:
[Potential Outcomes, Assignment Mechanisms]
Dates:
1/27,
1/29
This module provides the crucial
underpinning for this entire course. This unit will assist you in thinking
critically about statements made in everyday life about cause and effect. We
provide a language for investigating causality this language will help you
translate statements such as Students who walk at least 10,000 steps per day
are less likely to be stressed into mathematical statements and then if needed
reframe these statements to enhance precision. This then allows us to be
precise in how we use data to investigate causality.
Assigned Reading Material: Chapters 1 & 3 of Imbens and Rubin
Module
2:
[Randomized Experiments and Associated Data Analyses]
Much of our course focuses on Module 2;
Module 2 will help you understand why randomized experiments facilitate causal
inference and also this module will help you understand how to reason about and
conduct causal inference with experimental data. This Module provides first
hints about how you might be able to conduct causal inference when you have
observational data instead of experimental data.
1.
Classical
Randomized Experiments
Dates:
2/03, 2/05
This section concerns experimental
settings in which we determine the assignment mechanism, that is the
probability distribution of the randomized treatment assignment. We will learn
about some of the pros and cons of different approaches to randomization.
Assigned Reading Material: Chapter 4 of Imbens and Rubin.
Other interesting material: Statistical Properties
of Randomization in Clinical Trials by Lachin, Properties of Simple
Randomization in Clinical Trials by Lachin and Randomization in Clinical
Trials: Conclusions and Recommendations by Lachin, Matts and Wei. These papers
are under Files on Canvas.
2.
Fisher's
Approach to Causal Reasoning about Treatment Effects for the Population of N
Individuals in the Sample
Dates:
2/05, 2/10, 2/12
This section concerns finite
sample inference. In this section you will learn about how you can conduct
causal inference about the N units (individuals) in the experiment (finite
sample inference). Randomization tests are crucial tools.
Assigned Reading Material: Chapter 5 of Imbens and Rubin; Section 5.1 is critical but very dense. I
suggest reading Section 5.1 again and again as you go through the other
sections so that you gradually began to understand Section 5.1.
Under Files>Readings
Other interesting material: Statistical Properties
of Randomization in Clinical Trials by Lachin, Properties of Simple
Randomization in Clinical Trials by Lachin and Randomization in Clinical
Trials: Conclusions and Recommendations by Lachin, Matts and Wei. These papers
are under Files on Canvas.
3.
Neyman's
Approach to Causal Reasoning about Treatment Effects for the Population of N
Units in the Sample and it's Extension to using the Sample to Conduct Causal Inference
about Treatment Effects in a Large Population.
Dates:
02/19, 02/24, 02/26
Often we aim to use the sample of
N units (individuals) to inform decisions about a larger population (should we
provide the new take-home chemotherapy to all adolescents recovering from
leukemia as opposed to the current take-home chemotherapy ). You will learn why
this causal inference both requires more assumptions and at the same is less
restrictive than finite sample causal inference. You will learn first
statistical approaches to conducting this type of causal inference.
Assigned Reading Material: Chapter 6 of Imbens and Rubin
4.
Using
Regression to Conduct Causal Inference.
Dates:
03/03, 03/05, 03/10, 03/12
On
03/03 we will return briefly to Fisher's FEP finite sample inference. Be sure to
read Meaning of Power.pdf on Canvas under Readings before this class. For 03/05 read Chapter 5, Section 5.7 as well
as Chapter 7 in Imbens and Rubin
Regression is one of the earliest
and continues to be one of the most common tools used to conduct causal
inference. In regression we often add outside knowledge about the form of the
mean of the outcome conditional on covariates. In this section you will learn
how you can use covariates to improve your ability to detect and conduct
inference about causal effects. You will learn about the consequences of
miss-specifying the regression.
Assigned Reading Material: Chapter 7 of Imbens and Rubin; additional reading material may be
assigned.
Module
3:
[Observational Studies]
In many areas of science, experiments are
unethical, for example, we might be interested in the causal effect of parental
divorce on children s elementary school performance. Or for monetary or
societal reasons, data from experiments is not available. These are all
settings in which the assignment mechanism is unknown. In this module you will
learn about first approaches to conducting causal inference in these thorny
problems.
1.
Unconfounded
Treatment Assignment.
Dates: 03/24, 03/26, 03/31, 04/02
In this section you will learn about the
critical role the propensity score plays in conducting causal inference, in
particular for use in settings in which science along with high quality
observational data can be harnessed to explain the assignment mechanism.
Assigned
Reading Material: Chapter
12 of Imbens and Rubin; additional reading material may
be assigned.
2. Estimating the Propensity
Score.
Dates: 04/07, 04/09
In
the analysis of observational data with propensity scores you will need to
estimate the propensity score. You will learn about methods for doing this and,
how to think about estimation when the goal is to reduce confounding as opposed
to fitting a good model.
Assigned
Reading Material.
Chapter 13 Imbens and Rubin; additional reading
material may be assigned.
3. Using the Propensity
Score to Conduct Causal Inference in Observational Studies.
Tentative Dates: 04/14, 04/16, 04/21, 04/23
Here we discuss how to
use the propensity score via stratification/blocking in causal inference. If
time permits, we will discuss a second approach, namely propensity score
weighting.
Assigned Reading Material: Chapter 17 Imbens and Rubin; additional reading material may be
assigned.
Module
4:
[Dynamic Treatment Regimes & Sequential Experimentation]
Tentative Date: 2 classes TBD
This is causal inference
on steroids!! In this module you will learn about how to reason about potential
outcomes when treatments are sequential. When treatments are sequential, it is easy
for the analysis method to accidentally introduce confounding even though the
treatments are randomized.
Assigned Reading Material: MRTs
for Developing Digital Interventions. This paper is written for
behavioral scientists.
Grading: Course grades will be based on a weighted average of
homework scores (30%), quizzes (10%), participation (10%), midterm exam on
03/26 (20%) and a final exam on a date to be determined (30%). Additional
information about each of these components is below. The course is
letter-graded by default, but you have my permission to switch to SAT/UNSAT
grading if you prefer. If you are considering SAT/UNSAT you should discuss it
with your advisor, and check whether it would count for what you want it to
count for. A grade of SAT corresponds to a letter grade of C- or above.
Quizzes: Assigned Readings are listed above in each section. To help
with various circumstances (expected or unexpected), your lowest three (3)
quizzes will be dropped. Monday's Quiz is available on Canvas starting at
4:30pm EST on Sunday and closing at 4:30pm EST on Monday at the beginning of
class: similarly, Wednesday's Quiz is available on Canvas starting at 4:30pm
EST on Tuesday and closing at 4:30pm EST on Wednesday at the beginning of
class. Once you start the quiz, you have 10 minutes to complete the
quiz. Collaboration on Quizzes is not
permitted. Use of generative AI is not permitted.
Homework: Problem sets will be assigned on every other Thursday 4pmEST
via Canvas and will be due two weeks later on the following Thursday at 4pmEST
in Canvas. The first assignment will appear in Canvas at 4pm on 1/30 and is due
in Canvas at 4pm on 2/13.
Homework must be submitted via the
Canvas course website; no submissions on paper or by email will be accepted.
Your submission must be a single PDF file, no more than 20 MB in size, except
that computer code can be uploaded in a separate supplementary file if that is
more convenient for you (i.e., a .R or .Rmd file with
your R code). The outputs from your code, e.g., plots and summary statistics,
must still be in your main PDF file. Your homework can be typeset, written
using a tablet, or scanned from handwritten work, but must be clear and easily
legible (not blurry or faint), and correctly rotated (e.g., not upside down).
Always check your submission: download it after uploading it in Canvas, and
make sure that it is the correct file and that it got uploaded successfully.
To help with various circumstances
(expected or unexpected), your lowest homework score will be dropped. Further
you can petition us for up to 2 late homework submissions by emailing Noah Dasanaike, (noahdasanaike@g.harvard.edu) by
the Wednesday at 4pmEST preceding the due date for the homework. A late homework
submission is 24 hours late (so due the Friday at 4pmEST after the due date)
and is submitted directly to Noah Dasanaike.
Unless otherwise specified, please show
your work, simplify fully, and give clear, careful justifications for your
answers (using words and sentences to
explain your logic, not just formulas).
Homework
Collaboration Policy: Beginning the first week, every
other week students are randomly divided into collaborative groups of people on
Thursdays at 4pm. This is your discussion + homework group for next two weeks.
Each student individually submits their homework solution with a list of who
was in their assigned collaboration group. You
must write up your solutions yourself and in your own words. Copying
someone else's solution or a solution from generative AI, or just making
trivial changes to other s solutions, including a
generative AI solution for the sake of not copying verbatim, is not acceptable.
For example, in problems where you must make up a story or example, two
students should not have the exact same answer, or almost the same answer except
one has an example with dogs chasing cats and the other has an example with
cats chasing mice, with the same structure and the same numbers. I highly
recommend starting problem sets early enough so that you have time to work hard
on the problems on your own first, before discussing them with your group.
Using generative AI tools such as
ChatGPT to help with homework is allowed, but not recommended. In my own
testing of ChatGPT (with GPT-4), it could do some Stat 186 problems correctly
but also made many mistakes. It could be a useful tool for suggesting ideas
(and to chat about the material with or compare answers with) but it is
error-prone. Furthermore, working hard on the homework problems is crucial for
learning the material and preparing for the midterm and final, so even if
ChatGPT did excellent work on the Stat 186 homeworks,
relying on it too much would likely be harmful for overall understanding and
for performance in the class. In any case, your solutions must reflect your own
understanding of the material, explained in your own way, rather than being
copied from any other source.
Participation: Active participation is expected, through attending class
(Mondays and Wednesdays), completing quizzes and engaging in discussions. Stat
186 is a challenging course covering subtle concepts, so let's all try to help
create a supportive, collaborative community.
Midterm:
The Midterm is on Wednesday, March
26 at 4:30pmET in our classroom. You can bring one 8.5 by 11 sheet of
notes (using both back and front) with you to the midterm. Otherwise, the
midterm is closed book, no internet access and no computer access.
There is no makeup midterm. To help
with various circumstances (expected or unexpected), if you miss the midterm,
your score on your final exam will count for 50% of your grade.
Final: The Final is on Thursday, May 15 (location TBD). You can
bring two 8.5 by 11 sheets of notes (using both back and front) with you to the
final. Otherwise, the final is closed book, no internet access, and no computer
access.
Accommodations: Harvard University s goal is to remove barriers for disabled
students related to inaccessible elements of instruction or design in this
course. If reasonable accommodations are necessary to provide access, please
contact the Disability Access Office
(DAO). Accommodations do not alter fundamental requirements of the course and
are not retroactive. Students should request accommodations as early as
possible, since they may take time to implement. Students should notify DAO at
any time during the semester if adjustments to their communicated accommodation
plan are needed
Where to get tech help: The Academic
Resources Center has
resources. For tech help you can chat with to see if they can help and/or you
can call the HUIT help
desk.