Daniel Lemire's blog

, 15 min read

If all your attributes are independent two-by-two… are all your attributes independent?

16 thoughts on “If all your attributes are independent two-by-two… are all your attributes independent?”

  1. Michael says:

    Hi, if z = x + y, surely there are correlations between (x, z) and (y, z)? Perhaps there needs to be a nonlinear relationship – e.g. an XOR of binary random variables would have this property in this situation

    1. Hi, if z = x + y, surely there are correlations between (x, z) and (y, z)?

      Is there?

      Given full knowledge of x, what can you say about z? By my assumptions, all you can say is that z is x plus some random unknow value… so you cannot know what z is given x.

      1. Tristan Hume says:

        If you have a data set with a bunch of values of xyz, and you measure the correlation between (x,z) and (y,z) with linear regression you’ll find that both have positive correlation. The correlation will be noisy, but in general higher values of x will likely have higher values of z. Thus in a statistical sense (x,z) and (y,z) aren’t independent.

        1. Thus in a statistical sense (x,z) and (y,z) aren’t independent.

          They are in this example:

          https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2017/12/12/Test.java

          The point of my blog post is that pairwise relations do not capture all relations.

          1. Thomas van Dijk says:

            This experimental results seems to occur because you are adding two uniformly-random full-range integers, and the addition overflows. This “destroys” the correlation. Replace both calls of r.nextInt() with r.nextInt(100), and I get a correlation of about 0.7 on the second test. I didn’t do the math, but I can imagine that “your” x, y and z are indeed all independent.

            Your actual point is, of course, still completely valid: pairwise independence does not imply mutual independence. Thanks for the public service announcement 🙂

          2. Leonid Boytsov says:

            p(z=a,x=b) = p(y=a-b,x=b) = p(y=a-b)*p(x=b)
            now for independence to hold, p(z=a)=p(x+y=a) must be always equal to p(y=a-b) for arbitrary a and b. Why would this generally be the case?

        2. It depends on the variances of random variables x and y. If x has a variance much larger than that of y, you may see a strong correlation between x and z but none between y and z. If x and y both have small variances, you would be correct.

      2. Ilari Vallivaara says:

        Given full knowledge of x, what can you say about z? By my assumptions, all you can say is that z is x plus some random unknow value… so you cannot know what z is given x.

        You can say things about the distribution of z. More exactly, you can tell that the distribution depends on x. And this is what is required for the random variables to be dependent – not that you know what z is given x.

        There is a possibility that I have missed something profound here, but I genuinely do not think this example was meant to be based on integer overflow.

        1. You can say things about the distribution of z. More exactly, you can tell that the distribution depends on x. And this is what is required for the random variables to be dependent – not that you know what z is given x.

          Your current income z is my income x plus some unknown real number y. So z = x + y. Do I understand correctly that by your standard, it is fair to say that your income depends on my income?

          1. Ilari Vallivaara says:

            Your current income z is my income x plus some unknown real number y. So z = x + y. Do I understand correctly that by your standard, it is fair to say that your income depends on my income?

            Yes, if we are talking about dependence of random variables as response to claims like “then there is no correlation between (y, z) or (x, z) even though x + y = z.” Because there most certainly is a correlation, and the random variables are not independent.

            Let’s say your income x is either 10k or 100k a year (p = 0.5 for both). Let’s say my income is z = x + y, where y ~ N(0, 10^2) (small random variation). Even though I can not exactly tell what my income is, given your income, I still can say that it strongly depends on it. Do you think it’s not fair or accurate to say that?

            It’s still not clear to me if you chose your Java example on purpose and computed correlation over overflowing values, or what was the main point of it. With sensible distributions for x and y – like the suggested r.nextInt(100) – it is easy to verify that if z = x + y, then there is dependence (correlation) between (x, z) (and (y, z)).

            Or maybe I’m totally lost, and the overflowing Java example was just some trick instead of a simple example trying to demonstrate non-correlation. Is correlation even defined for overflowing or wrapping values? Or are we trying to formulate an even distribution over all (mathematical) integers or reals? I thought this was not the case, as the earlier post talked about values in sensible ranges (age, income, etc.).

            Could you elaborate on the purpose of your example and what it is supposed to demonstrate?

  2. Julian Hyde says:

    Thanks for this work, Daniel. It disproves one of the key assumptions I made in Calcite’s data profiler https://issues.apache.org/jira/browse/CALCITE-1616. Now I need to find an alternative approach.

    For this work, my definition of “independence” is as follows. x and y are independent if the number of distinct values of (x, y) is as you would expect given the number of distinct values of (x) and (y) individually in a particular sample.

    Example 1. In the Customers table, zipcode is dependent on id. There are 1,000,000 records in the table, 1,000,000 distinct values of id, 41,665 distinct values of zipcode, and 1,000,000 distinct values of (id, zipcode) combined. id is in fact a key, so it is unsurprising that zipcode is dependent upon it.

    Example 2. In the Customers table, zipcode is almost dependent on state. Among the 1,000,000 records, there are 50 states, 41,665 zipcodes, and 41,811 combinations of (state, zipcode). In other words, a few zipcodes cross state boundaries, but not many.

    Of course, you can’t prove that there is a functional dependency by looking at a sample of the data; someone could come along in a minute and add a record that breaks the dependency. But for some applications (e.g. query optimization) it is useful to be able to find groups of columns that are approximately functionally dependent.

    Your example reminds me of exotic normal forms. If (x, y), (y, z) and (x, z) are unique, then we have a textbook example of a relation that is in 4th normal form but not 5th normal form. See https://en.wikipedia.org/wiki/Fifth_normal_form. Conventional wisdom is that tables that are in 4NF but not in 5NF are rare in the real world, but I’m not sure the same can be said if (x, y), (y, z) and (x, z) are not necessarily keys.

    1. You could build a 2D matrix consisting of the counts of each distinct tuple. Then, for each x, you can compute P(Y=y|X=x) by dividing the count at (x,y) by the sum of the column, if this number is close to 1, you have a functional dependency of Y on X. If P(X=x|Y=y), computed by dividing by the sum of the row, is close to zero, you know the dependency is not mutual. For instance the probability that the state is NY given the zip is 10001 is 1, whereas the probability that the zip is 10001 given that the state is NY is approximately zero.

      1. Julian Hyde says:

        Richard, Rather than storing a count at each (x, y) I am storing a boolean – whether any records exist that have that (state, zipcode) combination – and then doing an approximate sum of the booleans. Thus the whole matrix is summarized by a single integer. Clearly your approach has more information, but my challenge is to make do with less information, because I want to cover all possible matrices.

  3. jld says:

    Don’t know much about statistics but I guess this is related:
    Multivariate Dependence Beyond Shannon Information

  4. Julian Hyde says:

    jld, I don’t know much statistics either but the paper you cite seems to be exactly on target. I will read and digest – thanks!

  5. Maynard Handley says:

    This is well known to mathematicians: pairwise independence of random variables does not imply mutual independence.

    https://en.wikipedia.org/wiki/Pairwise_independence

    A richer version of this comes in thinking about stochastic processes, which are defined by ALL the joint distributions functions at all collections of times (so two distinct times, three distinct times, etc). If there is no information beyond the two point distributions, you have a very common case (essentially Markov), but that’s not the ONLY possible situation.