summaryrefslogtreecommitdiff
path: root/site/content/pages/datasets/msceleb/index.md
blob: 58bacf1e064542937cc4a865da93e0ccf9fff91d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
------------

status: published
title: Microsoft Celeb Dataset
desc: Microsoft Celeb 1M is a dataset of 10 million face images harvested from the Internet
subdesc: The MS Celeb dataset includes 100K people and a target list of 1 million individuals
slug: msceleb
cssclass: dataset
image: assets/background.jpg
year: 2015
published: 2019-4-18
updated: 2019-4-18
authors: Adam Harvey

------------

## Microsoft Celeb Dataset (MS Celeb)

### sidebar
### end sidebar

Microsoft Celeb (MS Celeb) is a dataset of 10 million face images scraped from the Internet and used for research and development of large-scale biometric recognition systems. According to Microsoft Research, who created and published the [dataset](https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/) in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' images to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".[^msceleb_orig]

These one million people, defined by Microsoft Research as "celebrities", are often merely people who must maintain an online presence for their professional lives. Microsoft's list of 1 million people is an expansive exploitation of the current reality that for many people, including academics, policy makers, writers, artists, activists, and  journalists; maintaining an online presence is mandatory.  This fact should not allow Microsoft nor anyone else to use their biometrics for research and development of surveillance technology. Many names in the target list even include people critical of the very technology Microsoft is using their name and biometric information to build. The list includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy, to name only 8 out of 1 million.

### Microsoft's 1 Million Target List

Below is a selection of 24 names from the full target list, curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. The entire name file can be downloaded from [msceleb.org](https://www.msceleb.org). You can email <a href="mailto:msceleb@microsoft.com?subject=MS-Celeb-1M Removal Request&body=Dear%20Microsoft%2C%0A%0AI%20recently%20discovered%20that%20you%20use%20my%20identity%20for%20commercial%20use%20in%20your%20MS-Celeb-1M%20dataset%20used%20for%20research%20and%20development%20of%20face%20recognition.%20I%20do%20not%20wish%20to%20be%20included%20in%20your%20dataset%20in%20any%20format.%20%0A%0APlease%20remove%20my%20name%20and%2For%20any%20associated%20images%20immediately%20and%20send%20a%20confirmation%20once%20you've%20updated%20your%20%22Top1M_MidList.Name.tsv%22%20file.%0A%0AThanks%20for%20promptly%20handing%20this%2C%0A%5B%20your%20name%20%5D">msceleb@microsoft.com</a> to have your name removed. Subjects whose images were distributed by Microsoft are indicated with the total image count. No number indicates the name is only exists in target list.

=== columns 2

| Name (images) |  Profession |
| --- | --- | --- |
| Adrian Chen | Journalist |
| Ai Weiwei (220) | Artist, activist |
| Aram Bartholl | Conceptual artist |
| Astra Taylor |  Author, director, activist |
| Bruce Schneier (107)  | Cryptologist |
| Cory Doctorow (104) | Blogger, journalist |
| danah boyd | Data &amp; Society founder |
| Edward Felten | Former FTC Chief Technologist |
| Evgeny Morozov (108) | Tech writer, researcher |
| Glenn Greenwald (86) | Journalist, author |
| Hito Steyerl | Artist, writer |
| James Risen | Journalist |

====

| Name (images) |  Profession |
| --- | --- | --- |
| Jeremy Scahill (200) | Journalist |
| Jill Magid | Artist |
| Jillian York | Digital rights activist |
| Jonathan Zittrain |  EFF board member |
| Julie Brill | Former FTC Commissioner|
| Kim Zetter | Journalist, author |
| Laura Poitras (104) | Filmmaker |
| Luke DuBois | Artist  |
| Michael Anti | Political blogger |
| Manal al-Sharif (101) | Womens's rights activist |
| Shoshana Zuboff | Author, academic |
| Trevor Paglen | Artist, researcher |

=== end columns


After publishing this list, researchers affiliated with Microsoft Asia then worked with researchers affiliated with China's [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb image dataset for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition.

In an April 10, 2019 [article](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".[^madhu_ft]

Four more papers published by SenseTime that also use the MS Celeb dataset raise similar flags. SenseTime is a computer vision surveillance company that until [April 2019](https://uhrp.org/news-commentary/china%E2%80%99s-sensetime-sells-out-xinjiang-security-joint-venture) provided surveillance to Chinese authorities to monitor and track Uighur Muslims in Xinjiang province, and had been [flagged](https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html) numerous times as having potential links to human rights violations.

One of the 4 SenseTime papers, "[Exploring Disentangled Feature Representation Beyond Face Identification](https://www.semanticscholar.org/paper/Exploring-Disentangled-Feature-Representation-Face-Liu-Wei/1fd5d08394a3278ef0a89639e9bfec7cb482e0bf)", shows how SenseTime was developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances.

Earlier in 2019, Microsoft President and Chief Legal Officer [Brad Smith](https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/) called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also [announced](https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV) that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. In effect, Microsoft's face recognition software was not suitable to be used on minorities because it was trained mostly on white male faces.

What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition for faces they can't train on.

Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly [white](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) and [male](https://gendershades.org). Without balanced data, facial recognition contains blind spots. And without datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive Service the services might not exist at all.

![caption: A visualization of 2,000 of the 100,000 identity included in the image dataset distributed by Microsoft Research. Credit: megapixels.cc. License: Open Data Commons Public Domain Dedication (PDDL)](assets/msceleb_montage.jpg)

Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "[One-shot Face Recognition by Promoting Underrepresented Classes](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/)," Microsoft leveraged the MS Celeb dataset to build their algorithms and advertise the results. Interestingly, Microsoft's [corporate version](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/) of the paper does not mention they used the MS Celeb datset, but the [open-access version](https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70) published on arxiv.org explicitly mentions that Microsoft Research introspected their algorithms "on the MS-Celeb-1M low-shot learning benchmark task."

We suggest that if Microsoft Research wants to make biometric data publicly available for surveillance research and development, they should start with releasing their researchers' own biometric data, instead of scraping the Internet for journalists, artists, writers, actors, athletes, musicians, and academics.

{% include 'dashboard.html' %}

{% include 'supplementary_header.html' %}

### Footnotes

[^msceleb_orig]: MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
[^madhu_ft]: Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019.