Daniel Lemire's blog

, 2 min read

If all your attributes are independent two-by-two… are all your attributes independent?

Suppose that you are working on a business problem where you have multiple attributes… maybe you have a table with multiple columns such as “age, gender, income, location, occupation”.

You might be interested in determining whether there are relations between some of these attributes. Maybe the income depends on the gender or the age?

That is fairly easy to do. You can take the gender column and the income column and do some statistics. You can compute Pearson’s correlation or some other measure.

If you have N attributes, you have N (N-1) / 2 distinct pairs of attributes, so there can be many pairs to check, but it is not so bad.

However, what if you have established that there is no significant relationship between any of your attributes when you take them two-by-two. Are you done?

Again, you could check for all possible sets (e.g., {age, gender, income}, {income, location, occupation}). The set of all possible sets is called the power set. It contains 2N sets. So it grows exponentially with N, which means that for any large value of N, it is not practical to check all such sets.

But maybe you think that because you checked all pairs, you are done.

Maybe not.

Suppose that x and y are two attributes taking random integer values. So there is no sensible dependency between x and y. Then introduce z which is given by z = x + y. Clearly, x and y determine z. But there is no pairwise dependency between any of x, y, z.

To be precise, in Java, if you do the following

Random r = new Random();
for(int k = 0; k < N; k++) {
   x[k] = r.nextInt();
   y[k] = r.nextInt();
   z[k] = x[k] + y[k];
}

then there is no correlation between (y, z) or (x, z) even though x + y = z.

So if you look only at (x, y), (y, z) and (x, z), this tells less than you might think about (x, y, z).

Thus, checking relationships pairwise is only the beginning…