MegaPixels
Analyzing Transnational Flows of Face Recognition Image Training Data
Where does face data originate and who's using it?

Face Datasets and Information Supply Chains

National AI strategies often rely on transnational data sources to capitalize on recent advancements in deep learning and neural networks. Researchers benefiting from these transnational data flows can yield quick and significant gains across diverse sectors from health care to biometrics. But new challenges emerge when national AI strategies collide with national interests.

Our earlier research on the MS Celeb and Duke datasets published with the Financial Times revealed that several computer vision image datasets created by US companies and universities were unexpectedly also used for research by the National University of Defense Technology in China, along with top Chinese surveillance firms including SenseTime, SenseNets, CloudWalk, Hikvision, and Megvii/Face++ which have all been linked to the oppressive surveillance of Uighur Muslims in Xinjiang.

In this new research for the Munich Security Conference's Transnational Security Report we provide summary statistics about the origins and endpoints of facial recognition information supply chains. To make it more personal, we gathered additional data on the number of public photos from embassies that are currently being used in facial recognition datasets.

24 Million Non-Cooperative Faces

In total, we found over 24 million non-cooperative, non-consensual face images in 30 publicly available face recognition and face analysis datasets. Of these 24 million images, over 15 million face images are from Internet search engines, over 5.8 million from Flickr.com, over 2.5 million from the Internet Movie Database (IMDb.com), and nearly 500,000 from CCTV footage. All 24 million images were collected without any explicit consent, a type of face image that researchers call "in the wild".

Next we manually verified 1,134 publicly available research papers that cite these datasets to determine who was using the face data and where it was being used. Even though the vast majority of the images originated in the United States, the publicly available research citations show that only about 25% citations are from the United States while the majority of citations are from China.

Over 6,000 Embassy Photos Found in Facial Recognition Training Datasets

Out of the 24 million images analyzed, over 6,000 embassies images were found in face recognition training datasets. These images were found by cross-referencing the Flickr IDs between datasets to locate 5,667 images in the MegaFace dataset, 389 images in the IBM Diversity in Faces datasets. Both of these datasets are widely used in academic, industry, and defense research projects. An additional 2,372 more images were found in the Who Goes There dataset, which is used for facial ethnicity analysis research.

In total at least 8,428 embassy images are being used in facial recognition and facial analysis studies in at least 42 countries.

Embassy Photos in Face Recognition Datasets

The embassy and consulate photos below were all found in facial recognition training datasets MegaFace or IBM Diversity in Faces. Consulates were only included if marked as "EMBASSY" by the U.S. Department of State’s Social Media Presence List. Photos were chosen because of their inclusion of an embassy logo.

 US Embassy Yaounde, Cameroon
US Embassy Yaounde, Cameroon
 US Embassy Madrid
US Embassy Madrid
 US Embassy Kabul
US Embassy Kabul
 US Embassy San Jose
US Embassy San Jose
 US Embassy Romania
US Embassy Romania
 US Embassy Stockholm
US Embassy Stockholm
 US Embassy Malta
US Embassy Malta
 US Embassy Kabul Flickr photo found in the MegaFace dataset
US Embassy Kabul Flickr photo found in the MegaFace dataset
 US Embassy Canberra Flickr photo found in the MegaFace dataset
US Embassy Canberra Flickr photo found in the MegaFace dataset
 US Embassy Tokyo Flickr photo in the MegaFace dataset
US Embassy Tokyo Flickr photo in the MegaFace dataset
 US Embassy Kingston Flickr photo in MegaFace dataset
US Embassy Kingston Flickr photo in MegaFace dataset

To make this analysis slightly more personal for Munich Security Conference readers, several photos from the US Consulate in Munich were found. Coincidentally, one of the images is from the Deutsch-amerikanischer Datenschutztag.

 US Consulate Munich Deutsch-amerikanischer Datenschutztag (data protection day) . Photo found in the MegaFace face recognition training dataset
US Consulate Munich Deutsch-amerikanischer Datenschutztag (data protection day) . Photo found in the MegaFace face recognition training dataset
 US Consulate Munich Flickr image in the MegaFace dataset
US Consulate Munich Flickr image in the MegaFace dataset

This brief research aims to shed light on the emerging politics of data. A photo is no longer just a photo when it can also be surveillance training data, and datasets can no longer be separated from the development of software when software is now built with data. "Our relationship to computers has changed", says Geoffrey Hinton, one of the founders of modern day neural networks and deep learning. "Instead of programming them, we now show them and they figure it out." 1.

As data becomes more political, national AI strategies might also want to include transnational dataset strategies.

This research post is ongoing and will updated during July and August, 2019.

FAQ

Further Reading

Supplementary Information

The list of of embassies used for this analysis are from the U.S. Department of State’s Social Media Presence List combined with manual search results. In some cases, the official U.S. Dept. of State list describes consulates and missions as embassies. For example, the US Consulate Munich and the US Mission Canada are marked as "EMBASSY". Consulates and missions listed as embassies by the U.S. Dept. of State list are included in this analysis.

Cite Our Work

If you find this analysis helpful, please cite our work:

@online{megapixels,
  author = {Harvey, Adam. LaPlace, Jules.},
  title = {MegaPixels: Origins, Ethics, and Privacy Implications of Publicly Available Face Recognition Image Datasets},
  year = 2019,
  url = {https://megapixels.cc/},
  urldate = {2019-04-18}
}

References