Technologists must take responsibility for the toxic ideologies that our data sets and algorithms reflect.
I’ve often been told, “The data does not lie.” However, that has never been my experience. For me, the data nearly always lies. Google Image search results for “healthy skin” show only light-skinned women, and a query on “Black girls” still returns pornography. The CelebA face data set has labels of “big nose” and “big lips” that are disproportionately assigned to darker-skinned female faces like mine. ImageNet-trained models label me a “bad person,” a “drug addict,” or a “failure.” Data sets for detecting skin cancer are missing samples of darker skin types.
White supremacy often appears violently—in gunshots at a crowded Walmart or church service, in the sharp remark of a hate-fueled accusation or a rough shove on the street—but sometimes it takes a more subtle form, like these lies. When those of us building AI systems continue to allow the blatant lie of white supremacy to be embedded in everything from how we collect data to how we define data sets and how we choose to use them, it signifies a disturbing tolerance.
Non-white people are not outliers. Globally, we are the norm, and this doesn’t seem to be changing anytime soon. Data sets so specifically built in and for white spaces represent the constructed reality, not the natural one. To have accuracy calculated in the absence of my lived experience not only offends me, but also puts me in real danger.