phantasmagoria

Jacques' correlation coefficient

Last modified .

I'm reading through the third edition of Eloquent JavaScript by Marijn Haverbeke. Formula 4.1 in Chapter 4 Data Structures: Objects and Arrays kind of comes out of nowhere, and I wanted to see if I could figure out why it works. If other readers find this, I'm curious if you found a derivation that's less convoluted than the one that follows.

ϕ=n11n00n10n01n1n0n1n0 \begin{equation*} \phi = \frac{n_{11} n_{00} - n_{10} n_{01}}{\sqrt{n_{1 \bullet} n_{0 \bullet} n_{\bullet 1} n_{\bullet 0}}} \end{equation*}

We watch Jacques one day, waiting for Jacques to perform an activity like 'touched tree.' Define two Bernoulli random variables S S and A A . S S records if Jacques transforms into a squirrel, and A A records if Jacques performs the activity of interest. The two random variables are not independent. Jacques may be more likely to transform into a squirrel if they touch a tree. Next, we can mechanically apply formulas from the class we took on introductory probability.

E(S)=p10+p11Var(S)=(p10+p11)(1p10p11)E(A)=p01+p11Var(A)=(p01+p11)(1p01p11) \begin{align*} E(S) &= p_{10} + p_{11} \\ \text{Var}(S) &= (p_{10} + p_{11})(1 - p_{10} - p_{11}) \\ E(A) &= p_{01} + p_{11} \\ \text{Var}(A) &= (p_{01} + p_{11})(1 - p_{01} - p_{11}) \end{align*}

The little p's come from the probability mass function of the random vector (S,A) (S, A) . The first subscript is a 1 in the event that Jacques turns into a squirrel, and the second subscript is a 1 in the event that Jacques performs the given activity.

Then,

Cov(S,A)=E[SA]E(S)E(A)=p11(p10+p11)(p01+p11) \begin{equation*} \text{Cov}(S, A) = E[SA] - E(S)E(A) = p_{11} - (p_{10} + p_{11})(p_{01} + p_{11}) \end{equation*}

The expected value of the random variable SA SA is easy, because it is a new Bernoulli variable that is 1 only in the event that Jacques does the activity and turns into a squirrel.

With all these pieces,

Corr(S,A)=p11(p10+p11)(p01+p11)(p10+p11)(1p10p11)(p01+p11)(1p01p11) \begin{equation*} \text{Corr}(S, A) = \frac{p_{11} - (p_{10} + p_{11})(p_{01} + p_{11})}{\sqrt{(p_{10} + p_{11})(1 - p_{10} - p_{11})}\sqrt{(p_{01} + p_{11})(1 - p_{01} - p_{11})}} \end{equation*}

If we watch Jacques over n n days,

psa=nsan=nsan11+n00+n10+n01 p_{sa} = \frac{n_{sa}}{n} = \frac{n_{sa}}{n_{11} + n_{00} + n_{10} + n_{01} }

The little n's are just the number of times we observe the particular outcome associated with the given probability over n n days. If we substitute those equations into the equation for Corr(S,A) \text{Corr}(S, A) and ask Mathematica to simplify the horror that results we get

ϕ=n11n00n10n01(n11+n10)(n00+n01)(n11+n01)(n00+n10) \begin{equation*} \phi = \frac{n_{11} n_{00} - n_{10} n_{01}}{\sqrt{(n_{11} + n_{10}) (n_{00} + n_{01}) (n_{11} + n_{01}) (n_{00} + n_{10})}} \end{equation*}

which is what we wanted.