Jacques' correlation coefficient

Last modified Sunday, March 20, 2022.

I'm reading through the third edition of Eloquent JavaScript by Marijn Haverbeke. Formula 4.1 in Chapter 4 Data Structures: Objects and Arrays kind of comes out of nowhere, and I wanted to see if I could figure out why it works. If other readers find this, I'm curious if you found a derivation that's less convoluted than the one that follows.

\begin{equation*} \phi = \frac{n_{11} n_{00} - n_{10} n_{01}}{\sqrt{n_{1 \bullet} n_{0 \bullet} n_{\bullet 1} n_{\bullet 0}}} \end{equation*}

We watch Jacques one day, waiting for Jacques to perform an activity like 'touched tree.' Define two Bernoulli random variables $S$ and $A$ . $S$ records if Jacques transforms into a squirrel, and $A$ records if Jacques performs the activity of interest. The two random variables are not independent. Jacques may be more likely to transform into a squirrel if they touch a tree. Next, we can mechanically apply formulas from the class we took on introductory probability.

\begin{align*} E(S) &= p_{10} + p_{11} \\ \text{Var}(S) &= (p_{10} + p_{11})(1 - p_{10} - p_{11}) \\ E(A) &= p_{01} + p_{11} \\ \text{Var}(A) &= (p_{01} + p_{11})(1 - p_{01} - p_{11}) \end{align*}

The little p's come from the probability mass function of the random vector $(S, A)$ . The first subscript is a 1 in the event that Jacques turns into a squirrel, and the second subscript is a 1 in the event that Jacques performs the given activity.

Then,

\begin{equation*} \text{Cov}(S, A) = E[SA] - E(S)E(A) = p_{11} - (p_{10} + p_{11})(p_{01} + p_{11}) \end{equation*}

The expected value of the random variable $SA$ is easy, because it is a new Bernoulli variable that is 1 only in the event that Jacques does the activity and turns into a squirrel.

With all these pieces,

\begin{equation*} \text{Corr}(S, A) = \frac{p_{11} - (p_{10} + p_{11})(p_{01} + p_{11})}{\sqrt{(p_{10} + p_{11})(1 - p_{10} - p_{11})}\sqrt{(p_{01} + p_{11})(1 - p_{01} - p_{11})}} \end{equation*}

If we watch Jacques over $n$ days,

p_{sa} = \frac{n_{sa}}{n} = \frac{n_{sa}}{n_{11} + n_{00} + n_{10} + n_{01} }

The little n's are just the number of times we observe the particular outcome associated with the given probability over $n$ days. If we substitute those equations into the equation for $\text{Corr}(S, A)$ and ask Mathematica to simplify the horror that results we get

\begin{equation*} \phi = \frac{n_{11} n_{00} - n_{10} n_{01}}{\sqrt{(n_{11} + n_{10}) (n_{00} + n_{01}) (n_{11} + n_{01}) (n_{00} + n_{10})}} \end{equation*}

which is what we wanted.

Next: Queues in JavaScript
Previous: Hash tables