Algorithms to Investigate Causal Paths to Explain the Incidence of Cardiovascular Disease. Simon Thornley, MPH, MBChB, FAFPHM. [email protected] Professional Teaching Fellow, Research Fellow, PhD candidate. The University of Auckland, New Zealand. Summary Background to study Directed Acyclic Graphs (DAGs) What are they? What can they be used for?

How do computers draw DAGs? A look at a case study including risk factors for CVD My PhD Cardiovascular risk prediction Screen healthy adults Put high risk ones on drugs Distortion of natural history of disease How to deal with it when analysing CVD risk? Primary prevention In the 70s, risk factors identified for the treatment of CVD, from cohort studies.

Raised blood pressure Diabetes status Cigarette smoking LDL cholesterol level Age Targets for drug treatment. Assumption Not just risk factors, but on the causal pathway to disease. Assumption Not just risk factors, but on the causal pathway to disease.

Are they canaries or the miner?? S u m m a r y m e a s u r e o f e ff e c t ( O R / H R Drug treatment: a summary 1.5 Harm No effect 1.0 Benefit 0.5

Drug type Drug effects in observational studies Being on a drug indicates , rather than risk, after adjustment for all other factors??!!! Explanations: Unmeasured confounding Measurement error Drug does harm For example: Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. BMJ 2007;335(7611):136.

Sydney: Professorial fellow I've worked a lot with blood pressure epidemiology, and blood pressure-lowering drug use is always associated with higher risk in all observational studies That is because people who get treated differ from those who don't in too many respects to be able to capture post-hoc. That's why observational studies can never replace randomised trials. Estimating causal effect [sic] can only be attempted under very special circumstances in observational studies. Continued After much flogging of the analyst [If you followed my advice about the design of

the study] you would probably find some evidence of a protective effect of statins (unless all RCTs of statins are wrong) Statistics and causality Statistics Assesses parameters of a distribution from samples. Infers associations Estimate probabilities of past and future events... If... experimental conditions remain the same.

Causal analysis Infers probabilities under conditions that are changing e.g. treatments or interventions The problem: variable selection Association with outcome Based on relationship with outcome variable (p-value) Minimising information metric (AIC, BIC, Mallows C) fit of data to model; joint probability of data given model, penalised for model complexity Causal relationship

What about causal relationships between variables? Confounding: shared common cause of exposure and disease. What are DAGs? Graphic: A picture of nodes (variables) and arcs or edges (causal influence) Directed: directed causal effects shown Acyclic: No arrows from effects to causes Why use DAGs? Encodes expert knowledge Make assumptions about research question explicit; allow debate Link causal to statistical model for causal inference What could give rise to an observed association

between exposure and disease? What do we use DAGs for? EXPLAINING OBSERVED ASSOCIATIONS Confounding E and D share a common cause (confounding) Confounder Exposure Disease

Collider Induced by conditioning on common effect of Exposure and Disease (e.g. selection bias, collider). Hospitalisation Exposure Disease True causal association? Exposure

Disease Researcher drawn DAG: Serum urate and CVD Diabetes Creatinine HbA1c BP meds Obesity Sex BPt-1

BP Nutrition Propensity to take preventive treatment Urate Gout HDL Trigs LDLt-1 Ethnic group

Statin therapy CVD HDL Trigs LDLt Smoking A computer can do it for us Several algorithms available (from computer science, artificial intelligence). Starts with Chi-square tests of independence Conditional tests (similar to Mantel-Haenszel test)

Aim Use algorithm to draw DAG for variables used to assess CVD risk Inform structure of regression model for causal enquiry and prediction Technical details may induce somnolence, so do not attempt to drive or operate large machinery after listening to this section. HOW THE ARTIFICIAL INTELLIGENCE ALGORITHM WORKS Chi-square tests Null: P(smoke, CVD) = P(smoke)P(CVD)

No relationship Alt: P(smoke, CVD) P(smoke)P(CVD) Yes, a relationship exists (association) Chi-square distribution gives distribution assuming independence (null), if on tails of this (P<0.05), then assume null is false. Conditional Chi-square test Null: P(smoke, CVD) = P(smoke|age) P(CVD|age) No relationship Alt: P(smoke, CVD) P(smoke|age) P(CVD|age) Yes, a relationship exists (causal, if alternative hypothesis

supported for all subsets of conditioning variables). Equivalent to MH chi-squared test Simplified THE ALGORITHM 1: Determine causal neighbours Start with arcs (dependence) between all variables Let set of variables = U For each pair of nodes X (e.g. smoke) and Y (e.g. CVD), determine if X is independent of Y, given all subsets of U.

If so, drop the edge between X and Y Repeat for all pairs 2: Causal direction of triplets Find colliders: For each triplet X, Y, Z, if XZ and YZ, but not X Y (X-Z-Y), if for all subsets S of U-{X,Y,Z}, X is dependent on Y|(S U {Z}), then orient the arcs so that XZ Y. Repeat for all triplets. 3: Avoid cycles Then orientate other edges so as not to introduce cycles (effect causes cause)

Note not all directions may be determined, since XYZ and X YZ are equivalent patterns of conditional dependence. AI and CVD risk prediction. A WORKED EXAMPLE Predict cohort study Population 30 to 80 year old patients free of CVD and heart failure CVD risk assessment at GP between 06 to 09 At least 2 years of follow-up

Variables Combined CVD events (death or hospital admission) Cumulative incidence

Age-at-enrolment Sex Diabetes Smoking Ethnic group Statin and antihypertensive drug use Systolic blood pressure Family history Total to high-density-lipoprotein cholesterol ratio Software bnlearn with R (M. Scutari) False positive proportion: 5% Tests option: Monte-Carlo chi-square, due to small cell counts

Categorical data only: Continuous variables categorised into deciles. Banned list Sex, ethnic group and age must not be caused by any other variable. Family history must not be caused by drug treatment variables. The outcome, fatal and nonfatal CVD, must not cause any other variable. The populations CRUDE ASSOCIATIONS WITH CVD

CVD No CVD Total Test stat. P-value (col%) (col%) (col%) Total 101 6155 6256 Gender Men

61 (60.4) 3395 (55.2) 61.7 (10.2) 54.1 (10.5) 0.343 T-test

< 0.001 3456 (55.2) Age at enrolment Mean (SD) Chisq. 54.2 (10.5)

CVD No CVD Total (col%) (col%) (col%) Ethnic group Other 62 (61.4) 4348 (70.6) 4410 (70.5) Maori 22 (21.8) 826 (13.4)

848 (13.6) Pacific 16 (15.8) 773 (12.6) 789 (12.6) Indian 1 (1.0) 208 (3.4)

209 (3.3) Smoking status Yes 28 (27.7) 1082 (17.6) 1110 (17.7) Test stat. Pvalue Fishers exact

0.036 Chisq. 0.012 CVD No CVD Total Test stat. Pvalue (col%) (col%) (col%) Systolic blood pressure (mmHg) Median(IQR) 140 (130,

150) Diagnosis of diabetes? Yes 24 (23.8) Total to HDL-cholesterol ratio Median (IQR) 3.7 (3.1, 4.8) Rank sum < 0.001 test 130 (120, 142) 130 (120,

143) Chisq. 0.0143 Rank sum 0.744 896 (14.6) 920 (14.7) 3.8 (3.1, 4.7) 3.8 (3.1, 4.7)

CVD Statin treatment at baseline? Yes 20 (19.8) Antihypertensive treatment at baseline? Yes 48 (47.5) No CVD Total

860 (14.0) 880 (14.1) Test stat. P-value Chisq. 0.127 Chisq. 1637 (26.6) 1685 (26.9)

< 0.001 LET RIP! The DAG Ethnic group Systolic blood pressure Sex CVD

Diabetes Age Family history of CVD TC: HDL ratio Smoking status Statin use Anti

hypertensive sex ethni age FHx TC/HDL smoke Statin AntiH

SBP CVD Diabetes Use regression HOW STRONG ARE THE ARCS? Arc strength Software reports p-values X Dependent on sample size

Instead use regression. Cause=independent var. (x) Effect=dependent var. (y) If effect binary: logistic Derive odds ratios If effect continuous: linear For continuous vars: compare 16th and 84th centiles (binary var. comparison) Adjust for confounders and effect modifiers (e.g. age) Arc strength Cause Effect

Low+ High+ Beta-coeff.(95% CI) Odds ratio (95% CI) Age CVD 43.4 65.2

1.54 (1.11 to 1.97) 4.65 (3.03 to 7.14) Age Statin use 43.4 65.2 0.84 (0.69 to 0.99) 2.31 (1.99 to 2.69)

Age Anti-hypertensive 43.4 65.2 1.44 (1.31 to 1.57) 4.23 (3.72 to 4.82) Age

Family history of CVD 43.4 65.2 -0.31 (-0.43 to -0.20) 0.73 (0.65 to 0.82) Age Systolic blood pressure 43.4 65.2

10.42 (9.5 to 11.34) N/A Arc strength Cause Effect Other Odds ratio (95% CI) Indian -0.66 (-1.16 to -0.17) 0.51 (0.31 to 0.84)

Other Maori 1.19 (1.02 to 1.35) 3.28 (2.78 to 3.88) Other Pacific 0.64 (0.46 to 0.83)

1.91 (1.58 to 2.30) Ethnic Family Other group history of (adj. for CVD age) Indian 0.02 (-0.28 to 0.32) 1.02 (0.75 to 1.37) Other

Maori Other Pacific -1.03 (-1.24 to -0.82) 0.36 (0.29 to 0.44) Ethnic Smoker group (adj. for age) Low+ High+

Beta-coeff.(95% CI) -0.24 (-0.41 to -0.07) 0.79 (0.67 to 0.93) Arc strength Cause Effect Low+ High Beta-coeff.(95% CI) + Diabetes Statin use

(adj. for age) Diabetes Antihyperten (adj. for age) sive use Statin use Antihyperten (adj. for age) sive use AntiSystolic hypertensive blood (adj. for age) pressure No Yes 1.94 (1.77 to 2.10)

No Yes 1.68 (1.53 to 1.84) No Yes 1.70 (1.55 to 1.86) No

Yes 7.30 (6.28 to 8.33) Odds ratio (95% CI) 6.94 (5.90 to 8.16) 5.38 (4.60 to 6.28) 5.49 (4.69 to 6.42) N/A Cause

Effect Low+ High+ Beta-coeff.(95% CI) Ethnic group (adj. Diabetes Other Indian 1.89 (1.57 to for age) 2.21) Other Maori 1.21 (1.01 to 1.41) Other Pacific 2.04 (1.85 to 2.22) Smoker (no adj.) CVD

No Yes 0.59 (0.15 to 1.03) Smoker (no adj.) Total: No Yes 0.51 (0.43 to HDL0.59) cholester ol ratio Sex (adj. for age) TC: HDL Female Male 0.55 (0.49 to 0.61) Odds ratio

(95% CI) 6.64 (4.83 to 9.12) 3.36 (2.74 to 4.11) 7.65 (6.36 to 9.22) 1.80 (1.16 to 2.79) N/A N/A So what? DAG seems plausible

Cigarette smoking and age only causal influences on CVD. Many causal influences on drug use Drugs do not influence CVD risk? Cigarette smoking mediator of ethnic group effects Is researcher drawn DAG compatible with data? If not, why not? Only age and smoking necessary to adjust for when testing causal hypotheses? Barren proxy Variable that has no influence on exposure and outcome (not true confounder), but influenced by (proxy for) one.

Here, TC:HDL ratio, when considering smoking CVD relationship TC/HDL Smoke CVD Limitations Limited by sample size type-2 error rate likely to be high (e.g. only 101 CVD events). 5% type-1 error rate. With this algorithm, early errors in statistical tests can propagate through algorithm.

Cross-sectional relationships may be prone to survival bias. Assumptions: No hidden or latent variables, independent subject data, no errors in tests. Assumption free regression Diabetes Age Blood pressure

CVD TC:HDL Gender Smoking Summary DAGs are useful when considering variable selection for regression modelling Possible to draw DAGs either from data or from informed scientific knowledge. Useful to compare researcher drawn DAG with that from data. Can help visualise relationships between variables. Software available and relatively easy to use.

THE STORY CONTINUES "The main reason we take so many drugs is that drug companies dont sell drugs, they sell lies about drugs. This is what makes drugs so different from anything else in life Virtually everything we know about drugs is what the companies have chosen to tell us and our doctors Publication bias Ioannidis JPA, Trikalinos TA. An exploratory test for an excess of significant findings. Clin. Trials

2007;4(3):245-53. Calculate expected number of positive studies, given: Sample size of individual studies Number of events in controls Summary effect (assumed true) Statin meta-analysis Further reading Pearl, Judea (2010) "An Introduction to Causal Inference," The International Journal of Biostatistics: Vol. 6: Iss. 2, Article 7. DOI: 10.2202/1557-4679.1203 Available at: http://

www.bepress.com/ijb/vol6/iss2/7