summaryrefslogtreecommitdiff
path: root/site/public/datasets/msceleb/index.html
blob: 2e326416a0b3b01410044ccec8418612962ea6b9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
<!doctype html>
<html>
<head>
  <title>MegaPixels: Microsoft Celeb Dataset</title>
  <meta charset="utf-8" />
  <meta name="author" content="Adam Harvey" />
  <meta name="description" content="MS Celeb is a dataset of 10 million face images harvested from the Internet" />
  <meta property="og:title" content="MegaPixels: Microsoft Celeb Dataset"/>
  <meta property="og:type" content="website"/>
  <meta property="og:summary" content="MegaPixels is an art and research project about face recognition datasets created \"in the wild\"/>
  <meta property="og:image" content="https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/background.jpg" />
  <meta property="og:url" content="https://megapixels.cc/datasets/msceleb/"/>
  <meta property="og:site_name" content="MegaPixels" />
  <meta name="referrer" content="no-referrer" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"/>
  <meta name="apple-mobile-web-app-status-bar-style" content="black">
  <meta name="apple-mobile-web-app-capable" content="yes">

  <link rel="apple-touch-icon" sizes="57x57" href="/assets/img/favicon/apple-icon-57x57.png">
  <link rel="apple-touch-icon" sizes="60x60" href="/assets/img/favicon/apple-icon-60x60.png">
  <link rel="apple-touch-icon" sizes="72x72" href="/assets/img/favicon/apple-icon-72x72.png">
  <link rel="apple-touch-icon" sizes="76x76" href="/assets/img/favicon/apple-icon-76x76.png">
  <link rel="apple-touch-icon" sizes="114x114" href="/assets/img/favicon/apple-icon-114x114.png">
  <link rel="apple-touch-icon" sizes="120x120" href="/assets/img/favicon/apple-icon-120x120.png">
  <link rel="apple-touch-icon" sizes="144x144" href="/assets/img/favicon/apple-icon-144x144.png">
  <link rel="apple-touch-icon" sizes="152x152" href="/assets/img/favicon/apple-icon-152x152.png">
  <link rel="apple-touch-icon" sizes="180x180" href="/assets/img/favicon/apple-icon-180x180.png">
  <link rel="icon" type="image/png" sizes="192x192"  href="/assets/img/favicon/android-icon-192x192.png">
  <link rel="icon" type="image/png" sizes="32x32" href="/assets/img/favicon/favicon-32x32.png">
  <link rel="icon" type="image/png" sizes="96x96" href="/assets/img/favicon/favicon-96x96.png">
  <link rel="icon" type="image/png" sizes="16x16" href="/assets/img/favicon/favicon-16x16.png">
  <link rel="manifest" href="/assets/img/favicon/manifest.json">
  <meta name="msapplication-TileColor" content="#ffffff">
  <meta name="msapplication-TileImage" content="/ms-icon-144x144.png">
  <meta name="theme-color" content="#ffffff">
  
  <link rel='stylesheet' href='/assets/css/fonts.css' />
  <link rel='stylesheet' href='/assets/css/css.css' />
  <link rel='stylesheet' href='/assets/css/leaflet.css' />
  <link rel='stylesheet' href='/assets/css/applets.css' />
  <link rel='stylesheet' href='/assets/css/mobile.css' />
</head>
<body>
  <header>
    <a class='slogan' href="/">
      <div class='logo'></div>
      <div class='site_name'>MegaPixels</div>
      <div class='page_name'>Microsoft Celeb</div>
    </a>
    <div class='links'>
      <a href="/datasets/">Datasets</a>
      <a href="/about/">About</a>
      <a href="/about/news">News</a>
    </div>
  </header>
  <div class="content content-dataset">
    
  <section class='intro_section' style='background-image: url(https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/background.jpg)'><div class='inner'><div class='hero_desc'><span class='bgpad'>MS Celeb is a dataset of 10 million face images harvested from the Internet</span></div><div class='hero_subdesc'><span class='bgpad'>The MS Celeb dataset includes 10 million images of 100,000 people and an additional target list of 1,000,000 individuals
</span></div></div></section><section><h2>Microsoft Celeb Dataset (MS Celeb)</h2>
</section><section><div class='right-sidebar'><div class='meta'>
    <div class='gray'>Published</div>
    <div>2016</div>
  </div><div class='meta'>
    <div class='gray'>Images</div>
    <div>8,200,000 </div>
  </div><div class='meta'>
    <div class='gray'>Identities</div>
    <div>100,000 </div>
  </div><div class='meta'>
    <div class='gray'>Purpose</div>
    <div>Face recognition</div>
  </div><div class='meta'>
    <div class='gray'>Created by</div>
    <div>Microsoft Research</div>
  </div><div class='meta'>
    <div class='gray'>Funded by</div>
    <div>Microsoft Research</div>
  </div><div class='meta'>
    <div class='gray'>Website</div>
    <div><a href='http://www.msceleb.org/' target='_blank' rel='nofollow noopener'>msceleb.org</a></div>
  </div></div><p>Microsoft Celeb (MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the <a href="https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></p>
<p>While the majority of people in this dataset are American and British actors, the exploitative use of the term "celebrity" extends far beyond Hollywood. Many of the names in the MS Celeb face recognition dataset are merely people who must maintain an online presence for their professional lives: journalists, artists, musicians, activists, policy makers, writers, and academics. Many people in the target list are even vocal critics of the very technology Microsoft is using their name and biometric information to build. It includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; Shoshana Zuboff, author of <em>Surveillance Capitalism</em>; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy.</p>
<h3>Microsoft's 1 Million Target List</h3>
<p>Microsoft Research distributed two main digital assets: a dataset of approximately 10,000,000 images of 100,000 individuals and a target list of exactly 1 million names. The 900,000 names without images are the target list, which is used to gather more images for each subject.</p>
<p>For example in a research project authored by researchers from SenseTime's Joint Lab at the Chinese University of Hong Kong called "<a href="https://arxiv.org/pdf/1809.01407.pdf">Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition</a>", approximately 7 million images from an additional 285,000 subjects were added to their dataset. The images were obtained by crawling the internet using the MS Celeb target list as search queries.</p>
<p>Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data for "celebrities". Names with a number indicate how many images were distributed by Microsoft. Since publishing this analysis, Microsoft has quietly taken down their <a href="https://msceleb.org">msceleb.org</a> website but a cleaned list of 94,682 identities used in the dataset is still available on GitHub from <a href="https://github.com/PINTOFSTU/C-MS-Celeb">https://github.com/PINTOFSTU/C-MS-Celeb</a>, which references another <a href="https://www.hindawi.com/journals/cin/2018/4512473/abs/">NUDT affiliated project</a>. IDs in the format "m.abc123" and can be accessed through <a href="https://developers.google.com/knowledge-graph/reference/rest/v1/">Google's Knowledge Graph</a> as "/m/abc123" to obtain subject names and descriptions.</p>
<p>NB: names without a number indicate that Microsoft only distributed your name and encouraged researchers to download your face images to build a biometric profile. Images with a number indicate that Microsoft definitely included your faces images in their dataset. If images were not included by Microsoft it's more likely than not that your face was used for MS-Celeb-1M related challenges by organizations including NUDT, Megvii, SenseTime, IBM, Hitachi, and others.</p>
</section><section><div class='columns columns-2'><div class='column'><table>
<thead><tr>
<th>Name (images)</th>
<th>Profession</th>
</tr>
</thead>
<tbody>
<tr>
<td>Adrian Chen</td>
<td>Journalist</td>
</tr>
<tr>
<td>Ai Weiwei (220)</td>
<td>Artist, activist</td>
</tr>
<tr>
<td>Aram Bartholl</td>
<td>Conceptual artist</td>
</tr>
<tr>
<td>Astra Taylor</td>
<td>Author, director, activist</td>
</tr>
<tr>
<td>Bruce Schneier (107)</td>
<td>Cryptologist</td>
</tr>
<tr>
<td>Cory Doctorow (104)</td>
<td>Blogger, journalist</td>
</tr>
<tr>
<td>danah boyd</td>
<td>Data &amp; Society founder</td>
</tr>
<tr>
<td>Edward Felten</td>
<td>Former FTC Chief Technologist</td>
</tr>
<tr>
<td>Evgeny Morozov (108)</td>
<td>Tech writer, researcher</td>
</tr>
<tr>
<td>Glenn Greenwald (86)</td>
<td>Journalist, author</td>
</tr>
<tr>
<td>Hito Steyerl</td>
<td>Artist, writer</td>
</tr>
<tr>
<td>James Risen</td>
<td>Journalist</td>
</tr>
</tbody>
</table>
</div><div class='column'><table>
<thead><tr>
<th>Name (images)</th>
<th>Profession</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jeremy Scahill (200)</td>
<td>Journalist</td>
</tr>
<tr>
<td>Jill Magid</td>
<td>Artist</td>
</tr>
<tr>
<td>Jillian York</td>
<td>Digital rights activist</td>
</tr>
<tr>
<td>Jonathan Zittrain</td>
<td>EFF board member</td>
</tr>
<tr>
<td>Julie Brill</td>
<td>Former FTC Commissioner</td>
</tr>
<tr>
<td>Kim Zetter</td>
<td>Journalist, author</td>
</tr>
<tr>
<td>Laura Poitras (104)</td>
<td>Filmmaker</td>
</tr>
<tr>
<td>Luke DuBois</td>
<td>Artist</td>
</tr>
<tr>
<td>Michael Anti</td>
<td>Political blogger</td>
</tr>
<tr>
<td>Manal al-Sharif (101)</td>
<td>Women's rights activist</td>
</tr>
<tr>
<td>Shoshana Zuboff</td>
<td>Author, academic</td>
</tr>
<tr>
<td>Trevor Paglen</td>
<td>Artist, researcher</td>
</tr>
</tbody>
</table>
</div></div></section><section><p>After the MS Celeb dataset was first introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's <a href="https://en.wikipedia.org/wiki/National_University_of_Defense_Technology">National University of Defense Technology (NUDT)</a> (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "<a href="https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65">Faces as Lighting Probes via Unsupervised Deep Highlight Extraction</a>" with potential applications in 3D face recognition.</p>
<p>In an April 10, 2019 <a href="https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a">article</a> published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".<a class="footnote_shim" name="[^madhu_ft]_1"> </a><a href="#[^madhu_ft]" class="footnote" title="Footnote 2">2</a></p>
<p>Four more papers published by SenseTime that also use the MS Celeb dataset raise similar flags. SenseTime is a computer vision surveillance company that until <a href="https://uhrp.org/news-commentary/china%E2%80%99s-sensetime-sells-out-xinjiang-security-joint-venture">April 2019</a> provided surveillance to Chinese authorities to monitor and track Uighur Muslims in Xinjiang province, and had been <a href="https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html">flagged</a> numerous times as having potential links to human rights violations.</p>
<p>One of the 4 SenseTime papers, "<a href="https://www.semanticscholar.org/paper/Exploring-Disentangled-Feature-Representation-Face-Liu-Wei/1fd5d08394a3278ef0a89639e9bfec7cb482e0bf">Exploring Disentangled Feature Representation Beyond Face Identification</a>", shows how SenseTime was developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances, and using the MS Celeb dataset to build their technology.</p>
<p>Earlier in 2019, Microsoft President and Chief Legal Officer <a href="https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/">Brad Smith</a> called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also <a href="https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV">announced</a> that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. In effect, Microsoft's face recognition software was not suitable to be used on minorities because it was trained mostly on white male faces.</p>
<p>What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics policy, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition if they don't have enough face training data to build it.</p>
<p>Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">white</a> and <a href="https://gendershades.org">male</a>. Without balanced data, facial recognition contains blind spots. But without the large-scale datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft Azure Cognitive would be even less usable.</p>
</section><section class='images'><div class='image'><img src='https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/msceleb_montage.jpg' alt=' A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)'><div class='caption'> A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)</div></div></section><section><p>Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "<a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">One-shot Face Recognition by Promoting Underrepresented Classes</a>," Microsoft used the MS Celeb face dataset to build their algorithms and advertise the results. Interestingly, Microsoft's <a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">corporate version</a> of the paper does not mention they used the MS Celeb datset, but the <a href="https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70">open-access version</a> published on arxiv.org does. It states that Microsoft Research analyzed their algorithms using "the MS-Celeb-1M low-shot learning benchmark task."<a class="footnote_shim" name="[^one_shot]_1"> </a><a href="#[^one_shot]" class="footnote" title="Footnote 5">5</a></p>
<p>Typically researchers will phrase this differently and say that they only use a dataset to validate their algorithm. But validation data can't be easily separated from the training process. To develop a neural network model, image training datasets are split into three parts: train, test, and validation. Training data is used to fit a model, and the validation and test data are used to provide feedback about the hyperparameters, biases, and outputs. In reality, test and validation data steers and influences the final results of neural networks.</p>
<h2>Runaway Data</h2>
<p>Despite the recent termination of the <a href="https://msceleb.org">msceleb.org</a> website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world.</p>
<p>For example, on October 28, 2019, the MS Celeb dataset will be used for a new competition called "<a href="https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/">Lightweight Face Recognition Challenge &amp; Workshop</a>" where the best face recognition entries will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the <a href="http://iccv2019.thecvf.com/program/workshops">ICCV 2019 conference</a>. This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and <a href="https://github.com/deepinsight/insightface">InsightFace</a> (CN). The organizers provide a <a href="https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/">25GB download of cropped faces</a> from MS Celeb for anyone to download (in .rec format).</p>
<p>And in June, shortly after <a href="https://twitter.com/adamhrv/status/1134511293526937600">posting</a> about the disappearance of the MS Celeb dataset, it reemerged on <a href="https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech">Academic Torrents</a>. As of June 10, the MS Celeb dataset files have been redistributed in at least 9 countries and downloaded 44 times without any restrictions. The files were seeded and are mostly distributed by an AI company based in China called Hyper.ai, which states that it redistributes MS Celeb and other datasets for "teachers and students of service industry-related practitioners and research institutes."<a class="footnote_shim" name="[^hyperai_readme]_1"> </a><a href="#[^hyperai_readme]" class="footnote" title="Footnote 6">6</a></p>
<p>Earlier in 2019 images from the MS Celeb were also repackaged into another face dataset called <em>Racial Faces in the Wild (RFW)</em>. To create it, the RFW authors uploaded face images from the MS Celeb dataset to the Face++ API and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects. That dataset then appeared in a subsequent research project from researchers affiliated with IIIT-Delhi and IBM TJ Watson called <a href="https://arxiv.org/abs/1904.01219">Deep Learning for Face Recognition: Pride or Prejudiced?</a>, which aims to reduce bias but also inadvertently furthers racist language and ideologies that can not be repeated here.</p>
<p>The estimated racial scores for the MS Celeb face images used in the RFW dataset were computed using the Face++ API, which is owned by Megvii Inc, a company that has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the <a href="https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang">ChinAI Newsletter</a> and <a href="https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii">BuzzFeedNews</a>, Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China. If they didn't already, it's highly likely that Megvii has a copy of everyone's biometric faceprint from the MS Celeb dataset, either from uploads to the Face++ API or through the research projects explicitly referencing MS Celeb dataset usage, such as a 2018 paper called <a href="https://arxiv.org/pdf/1808.06210.pdf">GridFace: Face Rectification via Learning Local Homography Transformations</a> jointly published by 3 authors, all of whom worked for Megvii.</p>
<h2>Commercial Usage</h2>
<p>Microsoft's <a href="http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset">MS Celeb website</a> says it was created for "non-commercial research purpose only." Publicly available research citations and competitions show otherwise.</p>
<p>In 2017 Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia used the MS Celeb dataset to compete for the highest performance scores. The 2017 winner was Beijing-based OrionStar Technology Co., Ltd.. In their <a href="https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html">press release</a>, OrionStar boasted a 13% increase on the difficult set over last year's winner. The prior year's competitors included Beijing-based Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.</p>
<p>Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii/Face++, Microsoft, Microsoft Asia, SenseTime, OrionStar, Faceall), military use (National University of Defense Technology in China), the proliferation of subset data (Racial Faces in the Wild), and the real-time visible proliferation via Academic Torrents it's fairly clear that Microsoft has lost control of their MS Celeb dataset and the biometric data of nearly 100,000 individuals.</p>
<p>To provide insight into where these 10 million faces images have traveled, over 100 research papers have been verified and geolocated to show who used the dataset and where they used it.</p>
</section><section>
  <h3>Who used Microsoft Celeb?</h3>

  <p>
    This bar chart presents a ranking of the top countries where dataset citations originated.  Mouse over individual columns to see yearly totals. These charts show at most the top 10 countries.
  </p>
 
 </section>

<section class="applet_container">
<!-- 	<div style="position: absolute;top: 0px;right: -55px;width: 180px;font-size: 14px;">Labeled Faces in the Wild Dataset<br><span class="numc" style="font-size: 11px;">20 citations</span>
</div> -->
 <div class="applet" data-payload="{&quot;command&quot;: &quot;chart&quot;}"></div>
</section>

<section class="applet_container">
 <div class="applet" data-payload="{&quot;command&quot;: &quot;piechart&quot;}"></div>
</section>

<section>
	
	<h3>Information Supply chain</h3>

	<p>
		To help understand how Microsoft Celeb has been used around the world by commercial, military, and academic organizations; existing publicly available research citing Microsoft Celebrity Dataset was collected, verified, and geocoded to show the biometric trade routes of people appearing in the images. Click on the markers to reveal research projects at that location.
	</p>
 
 </section>

<section class="applet_container fullwidth">
 <div class="applet" data-payload="{&quot;command&quot;: &quot;map&quot;}"></div>
</section>

<div class="caption">
	<ul class="map-legend">
	<li class="edu">Academic</li>
	<li class="com">Commercial</li>
	<li class="gov">Military / Government</li>
	</ul>
	<div class="source">Citation data is collected using <a href="https://semanticscholar.org" target="_blank">SemanticScholar.org</a> then dataset usage verified and geolocated.</div >
</div>


<section class="applet_container">

  <h3>Dataset Citations</h3>
  <p>
    The dataset citations used in the visualizations were collected from <a href="https://www.semanticscholar.org">Semantic Scholar</a>, a website which aggregates and indexes research papers.  Each citation was geocoded using names of institutions found in the PDF front matter, or as listed on other resources.  These papers have been manually verified to show that researchers downloaded and used the dataset to train or test machine learning algorithms. If you use our data, please <a href="/about/attribution">cite our work</a>.
  </p>

  <div class="applet" data-payload="{&quot;command&quot;: &quot;citations&quot;}"></div>
</section><section>

  <div class="hr-wave-holder">
      <div class="hr-wave-line hr-wave-line1"></div>
      <div class="hr-wave-line hr-wave-line2"></div>
  </div>

  <h2>Supplementary Information</h2>
  
</section><section><h5>FAQs and Fact Check</h5>
<ul>
<li><strong>The MS Celeb images were not derived from Creative Commons sources</strong>. They were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"<a class="footnote_shim" name="[^msceleb_orig]_2"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a>. The dataset actually includes many copyrighted images. Microsoft doesn't provide any image URLs, but manually reviewing a small portion of images from the dataset shows many images with watermarked "Copyright" text over the image. TinEye could be used to more accurately determine the image origins in aggregate</li>
<li><strong>Microsoft did not distribute images of all one million people.</strong> They distributed images for about 100,000 and then encouraged other researchers to download the remaining 900,000 people "by using all the possibly collected face images of this individual on the web as training data."<a class="footnote_shim" name="[^msceleb_orig]_3"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></li>
<li><strong>Microsoft had not deleted or stopped distribution of their MS Celeb at the time of most press reports on June 4.</strong> Until at least June 6, 2019 the Microsoft Research data portal provided the MS Celeb dataset for download: <a href="http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737">http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737</a></li>
</ul>
</section><section><h3>References</h3><section><ul class="footnotes"><li>1 <a name="[^msceleb_orig]" class="footnote_shim"></a><span class="backlinks"><a href="#[^msceleb_orig]_1">a</a><a href="#[^msceleb_orig]_2">b</a><a href="#[^msceleb_orig]_3">c</a></span>MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Accessed April 18, 2019. <a href="http://web.archive.org/web/20190418151913/http://msceleb.org/">http://web.archive.org/web/20190418151913/http://msceleb.org/</a>
</li><li>2 <a name="[^madhu_ft]" class="footnote_shim"></a><span class="backlinks"><a href="#[^madhu_ft]_1">a</a></span>Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019.
</li><li>3 <a name="[^rfw]" class="footnote_shim"></a><span class="backlinks"></span>Wang, Mei; Deng, Weihong; Hu, Jiani; Peng, Jianteng; Tao, Xunqiang; Huang, Yaohai. Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation. 2018. <a href="http://arxiv.org/abs/1812.00194">http://arxiv.org/abs/1812.00194</a>
</li><li>4 <a name="[^pride_prejudice]" class="footnote_shim"></a><span class="backlinks"></span>Nagpal, Shruti; Singh, Maneet; Singh, Richa; Vatsa, Mayank; Ratha, Nalini K.. Deep Learning for Face Recognition: Pride or Prejudiced? 2019. <a href="http://arxiv.org/abs/1904.01219">http://arxiv.org/abs/1904.01219</a>
</li><li>5 <a name="[^one_shot]" class="footnote_shim"></a><span class="backlinks"><a href="#[^one_shot]_1">a</a></span>Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. <a href="https://arxive.org/abs/1707.05574">https://arxive.org/abs/1707.05574</a>
</li><li>6 <a name="[^hyperai_readme]" class="footnote_shim"></a><span class="backlinks"><a href="#[^hyperai_readme]_1">a</a></span>readme.txt. MS-Celeb-1M download via Academic Torrents. Accessed June 9, 2019. <a href="https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech">https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech</a>
</li></ul></section></section>

  </div>
  <footer>
    <ul class="footer-left">
      <li><a href="/">MegaPixels.cc</a></li>
      <li><a href="/datasets/">Datasets</a></li>
      <li><a href="/about/">About</a></li>
      <li><a href="/about/news/">News</a></li>
      <li><a href="/about/legal/">Legal &amp; Privacy</a></li>
    </ul>
    <ul class="footer-right">
      <li>MegaPixels &copy;2017-19 &nbsp;<a href="https://ahprojects.com">Adam R. Harvey</a></li>
      <li>Made with support from &nbsp;<a href="https://mozilla.org">Mozilla</a></li>
    </ul>
  </footer>
</body>

<script src="/assets/js/dist/index.js"></script>
</html>