Daniel Lemire's blog

, 10 min read

Science and Technology links (December 8th 2018)

8 thoughts on “Science and Technology links (December 8th 2018)”

  1. Andrew Dalke says:

    The followup letters to Rosenblatt point make statements like:

    Joel et al. “The high degree of overlap in the form of brain features between females and males combined with the prevalence of mosaicism within brains are at variance with the assumption that sex divides human brains into two separate populations. Moreover, the fact that the large majority of brains consist of unique mosaics of “male-end,” “female-end,” and intermediate (i.e., common in both females and males) features precludes any attempt to predict an individual’s unique brain mosaic on the basis of sex category”

    and Chekroud et al. “Based on these criteria, the authors convincingly establish that there is little evidence for this strict sexually dimorphic view of human brains, counter to the popular lay conception of a “male” and “female” brain.”

    1. His finding is stated as such:

      By fitting a linear support vector machine (2) to the voxel-based morphometry data reported in ref. 1 we achieve a cross-validated misclassification rate of about 80% (depending on the random splits). We thus conclude that, whereas the univariate brain attributes (voxel morphometry) are bad predictors of gender, the multivariate brain morphometry is a very good predictor of gender.

      Thus you can predict gender from brain morphometry with an accuracy of 80%.

      Of course, the result might be wrong but it is a simple classification exercise using available data. One can verify it quickly.

      As far as I can tell, it was never contested. Thus it is reasonable to assume that it is so: if you give me the morphometry of a brain, I can predict the gender well.

      1. Andrew Dalke says:

        The reply by Joel et al. addresses that 80% result directly.

        Rosenblatt (7) correctly identified an individual’s sex category about
        80% of the time … Chekroud et al. (8) correctly identified an
        individual’s sex category about 89.5–95% of the time, but accuracy
        dropped to 65–74% when head-size-related measurements were regressed
        out. This latter finding is in line with previous reports that
        observed sex/gender differences are largely attributed to differences
        in brain size (9, 10) (see also figure S4 in ref. 1). Although the
        different supervised learning methods achieve better accuracy in
        predicting sex category than the simple method described above, they
        have the same conceptual problem, namely, it is unclear what the
        biological meaning of the new space is and in what sense brains that
        seem close in this space are more similar than brains that seem
        distant. Moreover, it is unclear whether the brain variability that is
        represented in the new space is related to sex or rather to
        physiological, psychological, or social variables that correlate with
        sex (e.g., weight, socioeconomic status, or type of education) or to a
        chance difference between the males and females in the sample (2, 4).
        One way to answer this question is by checking whether a model created
        to predict sex category in one dataset can accurately predict sex
        category in another dataset. Using SVM, we found that accuracy may
        drop dramatically (sometimes to less than 50%) when a model created
        using a dataset from one geographical region (Tel-Aviv, Beijing, or
        Cambridge) was tested on the other datasets.

        1. Interesting. I had guessed the brain size was an important variable in this problem, and it appears that I was right, but I am surprised by the strength of the effect. Maybe I shouldn’t have been.

          It does not seem right to reject size-related features, but it is an interesting qualification.

          I am not sure I understand the quote, however. The fact that a model can learn to predict gender based on brain features is a data point… but the fact that one model fails to generalize across different genetics tells you nothing at all.

          Being able to build a model is informative; failing to do so proves nothing.

          Or they mean to imply that a single model cannot cover multiple ethnicities? Why would they think so?

          1. Andrew Dalke says:

            The single model of “male genitalia” and “female genitalia” – strongly bimodal, with “intersex” as a third category – does cover multiple ethnicities, so if you don’t think “male brain”/”female brain” doesn’t do so, then why would you say there are male/female brains?

            Are there male heights and female heights? Someone 157cm high is more likely to be female than male, while someone 188cm high is more likely to be male. Does that make 157cm a “female” height? Clearly no, as there short men, and even subpopulations where most men are under that height.

            I think the argument is that if you try to classify brain features as male and female, then you’ll find out that far more people have “intersex” brains, with some male and some female features, than people with ones which are all male/female. The numbers cited are ‘0–8.2% internally consistent brains and 23–53% substantially variable brains’.

  2. Nathan Kurz says:

    For #3, I think there are some serious issues with the paper that Cook’s simulation is based on (https://dataprivacylab.org/projects/identifiability/paper1.pdf).

    Cook does a simulation using fixed population per zipcode and uniform probability of any dob for 0-78 years, and gets an 84% probability uniquely identifiability. But the paper, which supposedly accounts for the actual distribution of population per zipcodes and actual clustered age brackets gets 87%. The problem is that Cook’s approach should be an upper bound, and any clustering should lower the probability.

    So I read the paper, and found that rather than doing a simulation, the author just used a simple binary “yes/no” for all residents in a zipcode depending on the number of people in their age bracket. On the bright side, this is clearly described in Section 4.3.1 (apart from what I’m hoping is a crucial typo). On the dark side, this means the 87% number doesn’t bear any relation to the simulation that Cook ran, or the actual number of people that are identifiable.

    To get a better idea of what the real number would be, I rewrote a modified version of his simulation to use the actual zipcode populations (https://blog.splitwise.com/2013/09/18/the-2010-us-census-population-by-zip-code-totally-free/). Then I ran it (on Power9!) and got 64% identifiability. If you were add in the age specific information (which I didn’t find in my quick searching), this number would drop further, although I don’t know by how much.

    So while I think the paper is right that (zip, dob, sex) does uniquely identify some large percentage of Americans, I’m disappointed that the exact number being touted turns out to be so flimsy. Maybe I’m wrong, but it feels like none of the people currently promoting the paper did any verification on whether it was actually right. When Cook talks about the 20 rejections for the paper, it makes me wonder if maybe peer review was actually doing a good job.

  3. Nathan Kurz says:

    If you were add in the age specific information (which I didn’t find
    in my quick searching), this number would drop further, although I
    don’t know by how much.

    I did figure out how to download the age bracketed zipcode population data from the census.gov website and massaged it into a form I could work with. The age clustering didn’t have much further effect on the percentage identifiable, only a couple percent more.

    It’s a little hard to compare the numbers directly, as the five-year age brackets go through age 90 and the previous assumption was for a max age of 78, but my final conclusion is that 63% is a good final estimate. That is, if we use the 5-year-age-brackets from the 2010 census, and actual populations for zipcodes, a little under 63% of Americans are uniquely identifiable by (zip, sex, dob).

  4. Victor Stewart says:

    I’ve always heard that heart and brain cells don’t regenerate and once gone are gone, etc.

    But this has always made little common sense to me. Speaking of heart cells, athletic training in the highest heart rate zone builds denser heart muscle that contracts with more strength, and the zone just below increases the volume of blood pumped per beat. And top athletes are known to have enlarged hearts for these reasons.

    Thus… clearly change is afoot?

    And there are similar arguments about brain plasticity.