summaryrefslogtreecommitdiff
path: root/site/public/datasets/msceleb/index.html
blob: fa485ac09bcf4df41e42f201b89340386fa80037 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
<!doctype html>
<html>
<head>
  <title>MegaPixels</title>
  <meta charset="utf-8" />
  <meta name="author" content="Adam Harvey" />
  <meta name="description" content="Microsoft Celeb 1M is a target list and dataset of web images used for research and development of face recognition technologies" />
  <meta name="referrer" content="no-referrer" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <link rel='stylesheet' href='/assets/css/fonts.css' />
  <link rel='stylesheet' href='/assets/css/css.css' />
  <link rel='stylesheet' href='/assets/css/leaflet.css' />
  <link rel='stylesheet' href='/assets/css/applets.css' />
</head>
<body>
  <header>
    <a class='slogan' href="/">
      <div class='logo'></div>
      <div class='site_name'>MegaPixels</div>
      <div class='splash'>Microsoft Celeb</div>
    </a>
    <div class='links'>
      <a href="/datasets/">Datasets</a>
      <a href="/about/">About</a>
    </div>
  </header>
  <div class="content content-dataset">
    
  <section class='intro_section' style='background-image: url(https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/background.jpg)'><div class='inner'><div class='hero_desc'><span class='bgpad'>Microsoft Celeb 1M is a target list and dataset of web images used for research and development of face recognition technologies</span></div><div class='hero_subdesc'><span class='bgpad'>The MS Celeb dataset includes over 10 million images of about 100K people and a target list of 1 million individuals
</span></div></div></section><section><h2>Microsoft Celeb Dataset (MS Celeb)</h2>
</section><section><div class='right-sidebar'><div class='meta'>
    <div class='gray'>Published</div>
    <div>2016</div>
  </div><div class='meta'>
    <div class='gray'>Images</div>
    <div>1,000,000 </div>
  </div><div class='meta'>
    <div class='gray'>Identities</div>
    <div>100,000 </div>
  </div><div class='meta'>
    <div class='gray'>Purpose</div>
    <div>Large-scale face recognition</div>
  </div><div class='meta'>
    <div class='gray'>Created by</div>
    <div>Microsoft Research</div>
  </div><div class='meta'>
    <div class='gray'>Funded by</div>
    <div>Microsoft Research</div>
  </div><div class='meta'>
    <div class='gray'>Website</div>
    <div><a href='http://www.msceleb.org/' target='_blank' rel='nofollow noopener'>msceleb.org</a></div>
  </div></div><p>Microsoft Celeb (MS Celeb) is a dataset of 10 million face images scraped from the Internet and used for research and development of large-scale biometric recognition systems. According to Microsoft Research who created and published the <a href="http://msceleb.org">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute the initial training dataset of 100,000 individuals images and use this to accelerate reserch into recognizing a target list of one million individuals from their face images "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 2">2</a></p>
<p>These one million people, defined as Micrsoft Research as "celebrities", are often merely people who must maintain an online presence for their professional lives. Microsoft's list of 1 million people is an expansive exploitation of the current reality that for many people including academics, policy makers, writers, artists, and especially journalists maintaining an online presence is mandatory and should not allow Microsoft (or anyone else) to use their biometrics for reserach and development of surveillance technology. Many of names in target list even include people critical of the very technology Microsoft is using their name and biometric information to build. The list includes digital rights activists like Jillian York and [add more]; artists critical of surveillance including Trevor Paglen, Hito Steryl, Kyle McDonald, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glen Greenwald; Data and Society founder danah boyd; and even Julie Brill the former FTC commissioner responsible for protecting consumer’s privacy to name a few.</p>
<h3>Microsoft's 1 Million Target List</h3>
<p>Below is a list of names that were included in list of 1 million individuals curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. The entire name file can be downloaded from <a href="https://msceleb.org">msceleb.org</a>. Names appearing with * indicate that Microsoft also distributed imaged.</p>
<p>[ cleaning this up ]</p>
</section><section><div class='columns columns-2'><div class='column'><table>
<thead><tr>
<th>Name</th>
<th>ID</th>
<th>Profession</th>
<th>Images</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jeremy Scahill</td>
<td>/m/02p_8_n</td>
<td>Journalist</td>
<td>x</td>
</tr>
<tr>
<td>Jillian York</td>
<td>/m/0g9_3c3</td>
<td>Digital rights activist</td>
<td>x</td>
</tr>
<tr>
<td>Astra Taylor</td>
<td>/m/05f6_39</td>
<td>Author, activist</td>
<td>x</td>
</tr>
<tr>
<td>Jonathan Zittrain</td>
<td>/m/01f75c</td>
<td>EFF board member</td>
<td>no</td>
</tr>
<tr>
<td>Julie Brill</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Jonathan Zittrain</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Bruce Schneier</td>
<td>m.095js</td>
<td>Cryptologist and author</td>
<td>yes</td>
</tr>
<tr>
<td>Julie Brill</td>
<td>m.0bs3s9g</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Kim Zetter</td>
<td>/m/09r4j3</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Ethan Zuckerman</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Jill Magid</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Kyle McDonald</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Trevor Paglen</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>R. Luke DuBois</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
</tbody>
</table>
</div><div class='column'><table>
<thead><tr>
<th>Name</th>
<th>ID</th>
<th>Profession</th>
<th>Images</th>
</tr>
</thead>
<tbody>
<tr>
<td>Trevor Paglen</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Ai Weiwei</td>
<td>/m/0278dyq</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Jer Thorp</td>
<td>/m/01h8lg</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Edward Felten</td>
<td>/m/028_7k</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Evgeny Morozov</td>
<td>/m/05sxhgd</td>
<td>Scholar and technology critic</td>
<td>yes</td>
</tr>
<tr>
<td>danah boyd</td>
<td>/m/06zmx5</td>
<td>Data and Society founder</td>
<td>x</td>
</tr>
<tr>
<td>Bruce Schneier</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Laura Poitras</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Trevor Paglen</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Astra Taylor</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Shoshanaa Zuboff</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Eyal Weizman</td>
<td>m.0g54526</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Aram Bartholl</td>
<td>m.06_wjyc</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>James Risen</td>
<td>m.09pk6b</td>
<td>x</td>
<td>x</td>
</tr>
</tbody>
</table>
</div></div></section><section><p>After publishing this list, researchers from Microsoft Asia then worked with researchers affilliated with China's National University of Defense Technology (controlled by China's Central Military Commission) and used the the MS Celeb dataset for their <a href="https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65">research paper</a> on using "Faces as Lighting Probes via Unsupervised Deep Highlight Extraction" with potential applications in 3D face recognition.</p>
<p>In an article published by the Financial Times based on data discovered during this investigation, Samm Sacks (senior fellow at New American and China tech policy expert) commented that this research raised "red flags because of the nature of the technology, the authors affilliations, combined with the what we know about how this technology is being deployed in China right now".<a class="footnote_shim" name="[^madhu_ft]_1"> </a><a href="#[^madhu_ft]" class="footnote" title="Footnote 3">3</a></p>
<p>Four more papers published by SenseTime which also use the MS Celeb dataset raise similar flags. SenseTime is Beijing based company providing  surveillance to Chinese authorities including [ add context here ] has been <a href="https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html">flagged</a> as complicity in potential human rights violations.</p>
<p>One of the 4 SenseTime papers, "Exploring Disentangled Feature Representation Beyond Face Identification", shows how SenseTime is developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances.<a class="footnote_shim" name="[^disentangled]_1"> </a><a href="#[^disentangled]" class="footnote" title="Footnote 4">4</a></p>
<p>Earlier in 2019, Microsoft CEO <a href="https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/">Brad Smith</a> called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also <a href="https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV">announced</a> that Microsoft would seemingly take stand against potential misuse and decided to not sell face recognition to an unnamed United States law enforcement agency, citing that their technology was not accurate enough to be used on minorities because it was trained mostly on white male faces.</p>
<p>What the decision to block the sale announces is not so much that Microsoft has upgraded their ethics, but that it publicly acknolwedged it can't sell a data-driven product without data. Microsoft can't sell face recognition for faces they can't train on.</p>
<p>Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">white</a> and <a href="https://gendershades.org">male</a>. Without balanced data, facial recognition contains blind spots. And without datasets like MS Celeb, the powerful yet innaccurate facial recognition services like Microsoft's Azure Cognitive Service also would not be able to see at all.</p>
<p>Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "(<a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">One-shot Face Recognition by Promoting Underrepresented Classes</a>)", Microsoft leveraged the MS Celeb dataset to analyse their algorithms and advertise the results. Interestingly, the Microsoft's <a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">corporate version</a> does not mention they used the MS Celeb datset, but the <a href="https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70">open-acess version</a> of the paper published on arxiv.org that same year explicity mentions that Microsoft Research tested their algorithms "on the MS-Celeb-1M low-shot learning benchmark task."</p>
<p>We suggest that if Microsoft Research wants biometric data for surveillance research and development, they should start with own researcher's biometric data instead of scraping the Internet for journalists, artists, writers, and academics.</p>
</section><section>
  <h3>Who used Microsoft Celeb?</h3>

  <p>
    This bar chart presents a ranking of the top countries where dataset citations originated.  Mouse over individual columns to see yearly totals. These charts show at most the top 10 countries.
  </p>
 
 </section>

<section class="applet_container">
<!-- 	<div style="position: absolute;top: 0px;right: -55px;width: 180px;font-size: 14px;">Labeled Faces in the Wild Dataset<br><span class="numc" style="font-size: 11px;">20 citations</span>
</div> -->
 <div class="applet" data-payload="{&quot;command&quot;: &quot;chart&quot;}"></div>
</section>

<section class="applet_container">
 <div class="applet" data-payload="{&quot;command&quot;: &quot;piechart&quot;}"></div>
</section>

<section>
	
	<h3>Biometric Trade Routes</h3>

	<p>
		To help understand how Microsoft Celeb has been used around the world by commercial, military, and academic organizations; existing publicly available research citing Microsoft Celebrity Dataset was collected, verified, and geocoded to show the biometric trade routes of people appearing in the images. Click on the markers to reveal research projects at that location.
	</p>
 
 </section>

<section class="applet_container fullwidth">
 <div class="applet" data-payload="{&quot;command&quot;: &quot;map&quot;}"></div>
</section>

<div class="caption">
	<ul class="map-legend">
	<li class="edu">Academic</li>
	<li class="com">Commercial</li>
	<li class="gov">Military / Government</li>
	</ul>
	<div class="source">Citation data is collected using <a href="https://semanticscholar.org" target="_blank">SemanticScholar.org</a> then dataset usage verified and geolocated.</div >
</div>


<section class="applet_container">

  <h3>Dataset Citations</h3>
  <p>
    The dataset citations used in the visualizations were collected from <a href="https://www.semanticscholar.org">Semantic Scholar</a>, a website which aggregates and indexes research papers.  Each citation was geocoded using names of institutions found in the PDF front matter, or as listed on other resources.  These papers have been manually verified to show that researchers downloaded and used the dataset to train or test machine learning algorithms.
  </p>

  <div class="applet" data-payload="{&quot;command&quot;: &quot;citations&quot;}"></div>
</section><section>

  <div class="hr-wave-holder">
      <div class="hr-wave-line hr-wave-line1"></div>
      <div class="hr-wave-line hr-wave-line2"></div>
  </div>

  <h2>Supplementary Information</h2>
  
</section><section><h3>References</h3><section><ul class="footnotes"><li>1 <a name="[^brad_smith]" class="footnote_shim"></a><span class="backlinks"></span>Brad Smith cite
</li><li>2 <a name="[^msceleb_orig]" class="footnote_shim"></a><span class="backlinks"><a href="#[^msceleb_orig]_1">a</a></span>MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
</li><li>3 <a name="[^madhu_ft]" class="footnote_shim"></a><span class="backlinks"><a href="#[^madhu_ft]_1">a</a></span>Microsoft worked with Chinese military university on artificial intelligence
</li><li>4 <a name="[^disentangled]" class="footnote_shim"></a><span class="backlinks"><a href="#[^disentangled]_1">a</a></span>"Exploring Disentangled Feature Representation Beyond Face Identification"
</li></ul></section></section>

  </div>
  <footer>
    <div>
      <a href="/">MegaPixels.cc</a>
      <a href="/datasets/">Datasets</a>
      <a href="/about/">About</a>
      <a href="/about/press/">Press</a>
      <a href="/about/legal/">Legal and Privacy</a>
    </div>
    <div>
      MegaPixels &copy;2017-19 Adam R. Harvey /&nbsp;
      <a href="https://ahprojects.com">ahprojects.com</a>
    </div>
  </footer>
</body>

<script src="/assets/js/dist/index.js"></script>
</html>