Study finds AI algorithm biased against yellow skin

After reports in 2018 that leading facial analysis algorithms were less accurate on people with darker skin tones, companies including Google and Meta used skin tone measurements to test the effectiveness of their AI software. New research from Sony suggests these tests are blind to an important aspect of human skin color diversity.

Sony researchers say currently commonly used skin color measurement methods only use a sliding scale from lightest to darkest or from white to black to represent skin color, thereby ignoring the impact of yellow and red hues on the range of human skin color. They found that generative artificial intelligence systems, image scraping algorithms and photo analysis tools all struggled particularly with yellower skin. The same weakness may apply to a variety of technologies whose accuracy has been shown to be affected by skin tone, such as artificial intelligence software for facial recognition, body tracking and deepfake detection, or gadgets such as heart rate monitors and motion detectors.

Alice Xiang, principal research scientist and global head of AI ethics at Sony, said: "If products are just evaluated in this very singular way, there's a lot of bias that goes undetected and unmitigated. Our hope is that the work we're doing here can help replace some of the existing skin tone scales that really only focus on light versus dark colors."

But not everyone is convinced that existing options are insufficient for grading AI systems. Ellis Monk, a sociologist at Harvard University, said the 10-skin color palette he launched with Google last year offers options from light to dark, but it's not one-dimensional. "I have to admit, I'm a little confused by the suggestion that undertones and tints have been ignored in previous research on this," Monk said. "The research effort was devoted to deciding which skin tones to prioritize on the scale, and at which points. He chose the 10 skin tones on his scale based on his own research on colorism and after consulting with other experts and people from underrepresented communities."

X. Eyeé, CEO of AI ethics consulting firm Malo Santo and founder of Google's skin color research team, said the Munch Scale was never intended to be a final solution and called Sony's work an important advance. But Eyeé also cautions that camera positioning can affect CIELAB color values in images, one of several issues that make the standard a potentially unreliable reference point. "Before we can apply skin tone measurements to real-world AI algorithms, such as camera filters and video conferencing, more work needs to be done to ensure measurement consistency," Eyeé said.

The debate over scales is not just academic. Finding an appropriate measure of what AI researchers call "fairness" is a top priority for the tech industry, as lawmakers in countries including the European Union and the United States discuss requiring companies to audit their AI systems and flag risks and flaws. Researchers at Sony said weak assessment methods could undermine some of the regulations' practical benefits.

Regarding skin color, Xiang said efforts to develop more improvements are warranted: "We need to keep trying to make progress. Different measures may prove useful depending on the situation. I'm pleased that there is growing interest in this area after being ignored for so long."

Google spokesman Brian Gabriel said the company welcomed the new study and was reviewing it.

Human skin color comes from the interaction of light with proteins, blood cells, and pigments such as melanin. The standard way to test whether an algorithm is biased by skin tone is to examine how it performs on different skin tones, with six options from lightest to darkest known as the Fitzpatrick scale. This scale was originally developed by dermatologists to assess the skin's response to UV rays. Last year, artificial intelligence researchers in the tech world praised Google's launch of the Munch Scale, saying it was more inclusive.

CIELAB, the international color standard for photo editing and manufacturing, provides a more faithful way to represent the broad spectrum of skin, Sony researchers said in a study presented this week at the International Conference on Computer Vision in Paris. When they applied CIELAB standards to analyze photos of different people, they found that their skin differed not only in hue (the depth of color) but also in hue (i.e., the gradation of color).

The inability of the skin tone scale to correctly capture the red and yellow tones in human skin seems to have helped some biases go undetected in the imaging algorithm. Sony researchers tested open-source artificial intelligence systems, including an image grabber developed by Twitter and a pair of image-generating algorithms, and found that the algorithms favored red skin, meaning that large numbers of people with yellowish skin were underrepresented in the final images output by the algorithms. This has the potential to disadvantage diverse populations, including East Asia, South Asia, Latin America and the Middle East.

Sony researchers have come up with a new way of representing skin tone to capture previously overlooked diversity. Their system uses two coordinates instead of one number to describe skin tones in images. It specifies both where skin tones fall from light to dark, and from yellow to red, what the cosmetics industry sometimes calls warm to cool undertones.

The new method works by isolating all pixels in an image showing skin, converting each pixel's RGB color value into a CIELAB code, and then calculating the average hue and hue of the skin pixel population. One example from the study showed apparent profile photos of former NFL star Terrell Owens and late actress Eva Garbo with the same skin tone but different tones, with Owens' image appearing more reddish and Garbo's image more yellowish.

Color scales that do not correctly capture the red and yellow tones of human skin help biases in imaging algorithms go undetected.

When the Sony team applied their approach to data and artificial intelligence systems online, they discovered significant problems. The researchers found that CelebAMask-HQ, a popular dataset of celebrity faces used to train facial recognition and other computer vision programs, had 82% of images biased toward red skin tones, while another dataset developed by NVIDIA, FFHQ, had 66% bias toward red. Two AI generative models trained on FFHQ reproduced this bias: about four out of every five images they each generated were biased toward red tones.

The problem doesn't stop there. When AI programs ArcFace, FaceNet and Dlib were asked to identify whether two portraits corresponded to the same person, they performed better on red skin, according to Sony's research. Davis King, the developer of Dlib, said he was not surprised by the bias because the model was primarily trained on photos of American celebrities.

Cloud AI tools offered by Microsoft Azure and Amazon Web Services for detecting smiles also work better on redder tones. Sarah Bird, who leads artificial intelligence engineering at Microsoft, said the company has been increasing its investments in fairness and transparency. Amazon spokesman Patrick Neighorn said: "We welcome collaboration with the research community and we are carefully reviewing this study." NVIDIA declined to comment.

As a person with yellow skin, Xiang is very concerned about revealing the limitations of today's artificial intelligence testing methods. Sony will use the new system to analyze its own human-centered computer vision models as they come up for review, she said, declining to specify which ones. "We all have different shades of skin. This should not be used to discriminate against us," she said.

There's another potential advantage to Sony's approach. Measurements like Google's Monkscale require humans to classify where a specific individual's skin falls on the spectrum. AI developers say it's a variable task, as people's perceptions can be influenced by their location or perceptions of their own race and identity.

Sony's approach is fully automated and requires no human judgment. But Harvard's Monk questions whether that's better. Objective measurement methods like Sony's can end up simplifying or ignoring other complexities of human diversity. "If our goal is to remove bias, and bias is a social phenomenon, then I'm not so sure we should remove from the analysis how humans view skin color socially," he said.