1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
|
------------
status: published
title: Transnational Flows of Face Recognition Image Training Data
slug: munich-security-conference
desc: Analyzing Transnational Flows of Face Recognition Image Training Data
subdesc: Where does face data originate and who's using it?
cssclass: dataset
image: assets/background.jpg
published: 2019-6-28
updated: 2019-6-29
authors: Adam Harvey
------------
## Face Datasets and Information Supply Chains
### sidebar
+ Images Analyzed: 24,302,637
+ Datasets Analyzed: 30
+ Years: 2006 - 2018
+ Status: Ongoing Investigation
+ Last Updated: June 28, 2019
### end sidebar
National AI strategies often rely on transnational data sources to capitalize on recent advancements in deep learning and neural networks. Researchers benefiting from these transnational data flows can yield quick and significant gains across diverse sectors from health care to biometrics. But new challenges emerge when national AI strategies collide with national interests.
Our [earlier research](https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e) on the [MS Celeb](/datasets/msceleb) and [Duke](/datasets/duke_mtmc) datasets published with the Financial Times revealed that several computer vision image datasets created by US companies and universities were unexpectedly also used for research by the National University of Defense Technology in China, along with top Chinese surveillance firms including SenseTime, SenseNets, CloudWalk, Hikvision, and Megvii/Face++ which have all been linked to the oppressive surveillance of Uighur Muslims in Xinjiang.
In this new research for the [Munich Security Conference's Transnational Security Report](https://tsr.securityconference.de) we provide summary statistics about the origins and endpoints of facial recognition information supply chains. To make it more personal, we gathered additional data on the number of public photos from embassies that are currently being used in facial recognition datasets.
### 24 Million Non-Cooperative Faces
In total, we analyzed 30 publicly available face recognition and face analysis datasets that collectively include over 24 million non-cooperative images. Of these 24 million images, over 15 million face images are from Internet search engines, over 5.8 million from Flickr.com, over 2.5 million from the Internet Movie Database (IMDb.com), and nearly 500,000 from CCTV footage. All 24 million images were collected without any explicit consent, a type of face image that researchers call "in the wild".
Next we manually verified 1,134 publicly available research papers that cite these datasets to determine who was using the data and where it was being used. Even though the vast majority of the images originated in the United States, the publicly available research citations show that only about 25% citations are from the country of the origin while the majority of citations are from China.
=== columns 2
```
single_pie_chart /site/research/munich_security_conference/assets/megapixels_origins_top.csv
Caption: Sources of Publicly Available Face Image Training Data 2006 - 2018
Top: 10
OtherLabel: Other
```
===
```
single_pie_chart /site/research/munich_security_conference/assets/summary_countries.csv
Caption: Locations Where Face Data Is Used Based on Public Research Citations
Top: 14
OtherLabel: Other
```
=== end columns
### Over 6,000 Embassy Photos Found in Facial Recognition Training Datasets
Of the 5.8 million Flickr images in publicly available face recognition training datasets there were over 6,000 photos from Embassy Flickr accounts. These images were mainly used in the MegaFace and IBM Diversity in Faces datasets. Over 2,000 more images were included in the Who Goes There dataset, used for facial ethnicity analysis research for a total over 8,000 embassy images used in facial analysis studies. A few of the embassy images found in facial recognition datasets are shown below.
=== columns 2
```
single_pie_chart /site/research/munich_security_conference/assets/country_counts.csv
Caption: Photos from these embassies are being used to train face recognition software
Top: 4
OtherLabel: Other
Colors: categoryRainbow
```
=====
```
single_pie_chart /site/research/munich_security_conference/assets/embassy_counts_summary_dataset.csv
Caption: Embassy images were found in these datasets
Top: 4
OtherLabel: Other
Colors: categoryRainbow
```
=== end columns





This brief research aims to shed light on the emerging politics of data. A photo is no longer just a photo when it can also be surveillance training data, and datasets can no longer be separated from the development of software when software is now built with data. "Our relationship to computers has changed", says Geoffrey Hinton, one of the founders of modern day neural networks and deep learning. "Instead of programming them, we now show them and they figure it out."[^hinton].
As data becomes more political, national AI strategies might also want to include transnational dataset strategies.
*This research post is ongoing and will updated during July and August, 2019.*
### Further Reading
- [MS Celeb Dataset Analysis](/datasets/msceleb)
- [Brainwash Dataset Analysis](/datasets/brainwash)
- [Duke MTMC Dataset Analysis](/datasets/duke_mtmc)
- [Unconstrained College Students Dataset Analysis](/datasets/uccs)
- [Duke MTMC dataset author apologies to students](https://www.dukechronicle.com/article/2019/06/duke-university-facial-recognition-data-set-study-surveillance-video-students-china-uyghur)
- [BBC coverage of MS Celeb dataset takedown](https://www.bbc.com/news/technology-48555149)
- [Spiegel coverage of MS Celeb dataset takedown](https://www.spiegel.de/netzwelt/web/microsoft-gesichtserkennung-datenbank-mit-zehn-millionen-fotos-geloescht-a-1271221.html)
{% include 'supplementary_header.html' %}
```
load_file /site/research/munich_security_conference/assets/embassy_counts_public.csv
Headings: Images, Dataset, Embassy, Flickr ID, URL, Guest, Host
```
The list of of embassies used for this analysis are from the [U.S. Department of State’s Social Media Presence List](https://www.state.gov/global-social-media-presence/) combined with manual search results. In some cases, the official U.S. Dept. of State list describes consulates and missions as embassies. For example, the US Consulate Munich, the US Mission Canada and is marked as "EMBASSY". Only consulates and missions listed as embassies by the U.S. Dept. of State list are included in this analysis.
{% include 'cite_our_work.html' %}
### Footnotes
[^hinton]: "Heroes of Deep Learning: Andrew Ng interviews Geoffrey Hinton". Published on Aug 8, 2017. <https://www.youtube.com/watch?v=-eyhCTvrEtE>
|