summaryrefslogtreecommitdiff
path: root/site/content/pages
diff options
context:
space:
mode:
Diffstat (limited to 'site/content/pages')
-rw-r--r--site/content/pages/about/attribution.md12
-rw-r--r--site/content/pages/about/index.md4
-rw-r--r--site/content/pages/about/legal.md4
-rw-r--r--site/content/pages/about/news.md55
-rw-r--r--site/content/pages/about/press.md42
-rw-r--r--site/content/pages/about/updates.md38
-rw-r--r--site/content/pages/datasets/brainwash/index.md9
-rw-r--r--site/content/pages/datasets/index.md4
-rw-r--r--site/content/pages/datasets/msceleb/index.md50
9 files changed, 106 insertions, 112 deletions
diff --git a/site/content/pages/about/attribution.md b/site/content/pages/about/attribution.md
index bf190478..2f8807a4 100644
--- a/site/content/pages/about/attribution.md
+++ b/site/content/pages/about/attribution.md
@@ -11,12 +11,12 @@ authors: Adam Harvey
------------
-# Legal
+# Attribution
<section class="about-menu">
<ul>
<li><a href="/about/">About</a></li>
-<li><a href="/about/press/">Press</a></li>
+<li><a href="/about/news/">News</a></li>
<li><a class="current" href="/about/attribution/">Attribution</a></li>
<li><a href="/about/legal/">Legal / Privacy</a></li>
</ul>
@@ -36,21 +36,23 @@ If you use the MegaPixels data or any data derived from it, please cite the orig
}
</pre>
-If you redistribute any data from this site, you must also include this [license](assets/megapixels_license.pdf) in PDF format
+If you redistribute any data from this site, you must also include this [license](/assets/legal/megapixels_license.pdf) in PDF format
The MegaPixel dataset is made available under the Open Data Commons Attribution License (https://opendatacommons.org/licenses/by/1.0/) and for academic use only.
READABLE SUMMARY OF Open Data Commons Attribution License
+As long as you:
+
+> Attribute: You must attribute any public use of the database, or works produced from the database, in the manner specified in the license. For any use or redistribution of the database, or works produced from it, you must make clear to others the license of the database and keep intact any notices on the original database.
+
You are free:
> To Share: To copy, distribute and use the dataset
To Create: To produce works from the dataset
To Adapt: To modify, transform and build upon the database
-As long as you:
-> Attribute: You must attribute any public use of the database, or works produced from the database, in the manner specified in the license. For any use or redistribution of the database, or works produced from it, you must make clear to others the license of the database and keep intact any notices on the original database.
diff --git a/site/content/pages/about/index.md b/site/content/pages/about/index.md
index 36e28d22..f238387c 100644
--- a/site/content/pages/about/index.md
+++ b/site/content/pages/about/index.md
@@ -16,7 +16,7 @@ authors: Adam Harvey
<section class="about-menu">
<ul>
<li><a class="current" href="/about/">About</a></li>
-<li><a href="/about/press/">Press</a></li>
+<li><a href="/about/news/">News</a></li>
<li><a href="/about/attribution/">Attribution</a></li>
<li><a href="/about/legal/">Legal / Privacy</a></li>
</ul>
@@ -49,6 +49,8 @@ MegaPixels aims to provide a critical perspective on machine learning image data
MegaPixels is an independent project, designed as a public resource for educators, students, journalists, and researchers. Each dataset presented on this site undergoes a thorough review of its images, intent, and funding sources. Though the goals are similar to publishing an academic paper, MegaPixels is a website-first research project, with an academic publication to follow.
+A dataset of verified geocoded citations and datasets statistics will be published in Fall 2019 along with a research paper as part of a research fellowship for [Karlsruhe HfG](http://kim.hfg-karlsruhe.de/).
+
=== columns 3
##### Team
diff --git a/site/content/pages/about/legal.md b/site/content/pages/about/legal.md
index e88fbb17..08538e9d 100644
--- a/site/content/pages/about/legal.md
+++ b/site/content/pages/about/legal.md
@@ -11,12 +11,12 @@ authors: Adam Harvey
------------
-# Legal
+# Legal and Privacy
<section class="about-menu">
<ul>
<li><a href="/about/">About</a></li>
-<li><a href="/about/press/">Press</a></li>
+<li><a href="/about/news/">News</a></li>
<li><a href="/about/attribution/">Attribution</a></li>
<li><a class="current" href="/about/legal/">Legal / Privacy</a></li>
</ul>
diff --git a/site/content/pages/about/news.md b/site/content/pages/about/news.md
new file mode 100644
index 00000000..de3e8f95
--- /dev/null
+++ b/site/content/pages/about/news.md
@@ -0,0 +1,55 @@
+------------
+
+status: published
+title: MegaPixels News, Press and Recent Events
+desc: MegaPixels News, Press and Recent Events
+slug: news
+cssclass: about
+published: 2018-12-04
+updated: 2018-12-04
+authors: Adam Harvey
+
+------------
+
+# News
+
+<section class="about-menu">
+<ul>
+<li><a href="/about/">About</a></li>
+<li><a class="current" href="/about/news/">News</a></li>
+<li><a href="/about/attribution/">Attribution</a></li>
+<li><a href="/about/legal/">Legal / Privacy</a></li>
+</ul>
+</section>
+
+
+Since launching MegaPixels in April 2019, several of the datasets mentioned have disappeared and one surveillance workshop was canceled. Below is a list of responses, reactions, and press:
+
+
+##### June 2019
+
+- June 7: Additional coverage of FT's story by [BBC](https://www.bbc.com/news/technology-48555149), [Spiegel.de](https://www.spiegel.de/netzwelt/web/microsoft-gesichtserkennung-datenbank-mit-zehn-millionen-fotos-geloescht-a-1271221.html), [IrishTimes](https://www.irishtimes.com/business/technology/microsoft-quietly-deletes-largest-public-face-recognition-data-set-1.3916825), and [Gizmodo](https://gizmodo.com/microsoft-quietly-pulls-its-database-of-100-000-faces-u-1835296212)
+- June 6: Financial Times covers the abrupt disappearance of four facial recognition datasets: [Microsoft quietly deletes largest public face recognition data set](https://www.ft.com/content/7d3e0d6a-87a0-11e9-a028-86cea8523dc2) by Madhumita Murgia
+- June 2: A person tracking surveillance workshop at CVPR ([reid-mct.github.io/2019](https://reid-mct.github.io/2019/)) has been canceled due to the [Duke MTMC dataset](/datasets/duke_mtmc) no longer being available: "Due to some unforeseen circumstances, the test data has not been available. The multi-target multi-camera tracking and person re-identification challenge is canceled. We sincerely apologize for any inconvenience caused."
+- June 2: The [Duke MTMC dataset](/datasets/duke_mtmc) website ([vision.cs.duke.edu/DukeMTMC](http://vision.cs.duke.edu/DukeMTMC)) has abruptly gone blank. An archive from April 18 is still available on the Wayback Machine ([web.archive.org/web/20190418085103/http://vision.cs.duke.edu/DukeMTMC/](https://web.archive.org/web/20190418085103/http://vision.cs.duke.edu/DukeMTMC/))
+- June 1: The [Brainwash](/datasets/brainwash) face/head dataset has been taken down by its author at [exhibits.stanford.edu/data/catalog/sx925dc9385](https://exhibits.stanford.edu/data/catalog/sx925dc9385). "This data was removed from access at the request of the depositor."
+- June 1: The [UCCS dataset page](/dataset/uccs) has been updated with a response from the author to clarify that he did not provide any face data to government agencies. Funding was for technology transfer. This site never mentioned that he did provide data to government agencies, only that his work benefited their objectives.
+
+##### May 2019
+
+- May 31: Semantic Scholar appears to be censoring citations used in this project. Two of the citations linking the [Brainwash](/datasets/brainwash) dataset to research from the National University of Defense Technology (NUDT) in China have disabled. [NUDT citation 1](https://www.semanticscholar.org/paper/A-Replacement-Algorithm-of-Non-Maximum-Suppression-Zhao-Wang/591a4bfa6380c9fcd5f3ae690e3ac5c09b7bf37b), [NUDT citation 2](https://www.semanticscholar.org/paper/Localized-region-context-and-object-feature-fusion-Li-Dou/b02d31c640b0a31fb18c4f170d841d8e21ffb66c), and the [original paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666) show that the NUDT citation has been censored (see the references section on Semantic Scholar pages)
+- May 28: The [Microsoft Celeb](/datasets/msceleb) (MS-Celeb-1M) face dataset website is now 404 and all the download links were deactivated. It appears that someone at Microsoft Research has shuttered access to the MS Celeb dataset. Yet it remains available, as of writing this, on [Imperial College London's website](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/) and on <https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737>
+- May 29, 2019: Stories about the [UnConstrained College Students Dataset](/datasets/uccs) appeared on [Engadget](https://www.engadget.com/2019/05/28/uccs-facial-recognition-study-students/), [AP News](https://www.apnews.com/003bec760eae4d8085265af9e5175254), [New York Times](https://www.nytimes.com/aponline/2019/05/28/us/ap-us-facial-recognition.html), [US News](https://www.usnews.com/news/best-states/colorado/articles/2019-05-28/colorado-campus-photographed-for-facial-recognition-research), [Daily Dot](https://www.dailydot.com/layer8/college-students-secret-face-recognition-project/), [Washington Post](https://www.washingtonpost.com/business/technology/colorado-students-photographed-for-facial-recognition-study/2019/05/28/0838be48-8165-11e9-b585-e36b16a531aa_story.html), [MSN](https://www.msn.com/en-us/news/politics/colorado-students-unknowingly-photographed-for-facial-recognition-study/ar-AAC2Zkv), [International Association of Privacy Professionals](https://iapp.org/news/a/students-photographed-for-facial-recognition-study/), [The Denver Channel](https://www.youtube.com/watch?v=61NPPD6Mhys), [Daily Mail](https://www.dailymail.co.uk/sciencetech/article-7079865/Spy-cameras-imaged-1-700-unwitting-subjects-facial-recognition-study-funded-U-S-government.html), [New York Post](https://nypost.com/2019/05/29/college-students-secretly-photographed-for-facial-recognition-study/), [Yahoo! News](https://news.yahoo.com/colorado-students-photographed-facial-recognition-162127139.html)
+- May 27, 2019: Denver Post writes about the UCCS dataset: [CU Colorado Springs students secretly photographed for government-backed facial-recognition research](https://www.denverpost.com/2019/05/27/cu-colorado-springs-facial-recognition-research/)
+- May 22, 2019: Interview with CS Indy about the UCCS dataset [UCCS secretly photographed students to advance facial recognition technology](https://www.csindy.com/coloradosprings/uccs-secretly-photographed-students-to-advance-facial-recognition-technology/Content?oid=19664437) by J. Adrian Stanley
+
+##### April 2019
+
+- April 20: Washington Post Editorial Board responds to Financial Times article based on data surfaced in MegaPixels project: [Opinion | Microsoft worked with a Chinese military university on AI. Does that make sense?](https://www.washingtonpost.com/opinions/microsoft-worked-with-a-chinese-military-university-on-ai-does-that-make-sense/2019/04/21/a0fb82c6-5d59-11e9-842d-7d3ed7eb3957_story.html)
+- April 19: Financial Times feature on MegaPixels project: [Who's Using Your Face](https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e) by Madhumita Murgia
+- April 19: MegaPixels data cited by report in Financial Times: [Western AI researchers partnered with Chinese surveillance firm](https://www.ft.com/content/41be9878-61d9-11e9-b285-3acd5d43599e) by Madhumita Murgia
+- April 10: [Microsoft worked with Chinese military university on artificial intelligence](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) based on data surfaced in [MS Celeb dataset](/datasets/msceleb)
+
+##### 2018
+
+- Aug 22: HRT Transgender dataset on Verge.com: [Transgender YouTubers had their videos grabbed to train facial recognition software](https://www.theverge.com/2017/8/22/16180080/transgender-youtubers-ai-facial-recognition-dataset) by James Vincent
diff --git a/site/content/pages/about/press.md b/site/content/pages/about/press.md
deleted file mode 100644
index 91480e93..00000000
--- a/site/content/pages/about/press.md
+++ /dev/null
@@ -1,42 +0,0 @@
-------------
-
-status: published
-title: MegaPixels Press and News
-desc: MegaPixels Press and News
-slug: press
-cssclass: about
-published: 2018-12-04
-updated: 2018-12-04
-authors: Adam Harvey
-
-------------
-
-# Press
-
-<section class="about-menu">
-<ul>
-<li><a href="/about/">About</a></li>
-<li><a class="current" href="/about/press/">Press</a></li>
-<li><a href="/about/attribution/">Attribution</a></li>
-<li><a href="/about/legal/">Legal / Privacy</a></li>
-</ul>
-</section>
-
-
-##### In the News
-
-- May 29, 2019: UnConstrained College Students Dataset on [Engadget](https://www.engadget.com/2019/05/28/uccs-facial-recognition-study-students/), [AP News](https://www.apnews.com/003bec760eae4d8085265af9e5175254), [New York Times](https://www.nytimes.com/aponline/2019/05/28/us/ap-us-facial-recognition.html), [US News](https://www.usnews.com/news/best-states/colorado/articles/2019-05-28/colorado-campus-photographed-for-facial-recognition-research), [Daily Dot](https://www.dailydot.com/layer8/college-students-secret-face-recognition-project/), [Washington Post](https://www.washingtonpost.com/business/technology/colorado-students-photographed-for-facial-recognition-study/2019/05/28/0838be48-8165-11e9-b585-e36b16a531aa_story.html), [MSN](https://www.msn.com/en-us/news/politics/colorado-students-unknowingly-photographed-for-facial-recognition-study/ar-AAC2Zkv), [International Association of Privacy Professionals](https://iapp.org/news/a/students-photographed-for-facial-recognition-study/), [The Denver Channel](https://www.youtube.com/watch?v=61NPPD6Mhys), [Daily Mail](https://www.dailymail.co.uk/sciencetech/article-7079865/Spy-cameras-imaged-1-700-unwitting-subjects-facial-recognition-study-funded-U-S-government.html), [New York Post](https://nypost.com/2019/05/29/college-students-secretly-photographed-for-facial-recognition-study/), [Yahoo! News](https://news.yahoo.com/colorado-students-photographed-facial-recognition-162127139.html)
-- May 27, 2019: [CU Colorado Springs students secretly photographed for government-backed facial-recognition research](https://www.denverpost.com/2019/05/27/cu-colorado-springs-facial-recognition-research/)
-- May 22, 2019: [UCCS secretly photographed students to advance facial recognition technology](https://www.csindy.com/coloradosprings/uccs-secretly-photographed-students-to-advance-facial-recognition-technology/Content?oid=19664437) by J. Adrian Stanley
-- April 19, 2019: [Who's Using Your Face](https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e) by Madhumita Murgia for FT.com
-
-##### Cited by
-
-- April 19, 2019: [Western AI researchers partnered with Chinese surveillance firm](https://www.ft.com/content/41be9878-61d9-11e9-b285-3acd5d43599e) by Madhumita Murgia for FT.com
-
-
-##### Related
-
-- April 20: Washington Post Editorial Board [Opinion | Microsoft worked with a Chinese military university on AI. Does that make sense?](https://www.washingtonpost.com/opinions/microsoft-worked-with-a-chinese-military-university-on-ai-does-that-make-sense/2019/04/21/a0fb82c6-5d59-11e9-842d-7d3ed7eb3957_story.html)
-- April 10, 2019: [Microsoft worked with Chinese military university on artificial intelligence](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) based on data surfaced in [MS Celeb dataset](/datasets/msceleb)
-- Aug 22, 2018: [Transgender YouTubers had their videos grabbed to train facial recognition software](https://www.theverge.com/2017/8/22/16180080/transgender-youtubers-ai-facial-recognition-dataset) by James Vincent
diff --git a/site/content/pages/about/updates.md b/site/content/pages/about/updates.md
deleted file mode 100644
index 3cac2143..00000000
--- a/site/content/pages/about/updates.md
+++ /dev/null
@@ -1,38 +0,0 @@
-------------
-
-status: published
-title: MegaPixels Site Updates
-desc: MegaPixels Site Updates
-slug: updates
-cssclass: about
-published: 2019-06-02
-updated: 2019-06-02
-authors: Adam Harvey
-
-------------
-
-# Updates and Responses
-
-<section class="about-menu">
-<ul>
-<li><a href="/about/">About</a></li>
-<li><a class="current" href="/about/updates/">Updates</a></li>
-<li><a href="/about/press/">Press</a></li>
-<li><a href="/about/attribution/">Attribution</a></li>
-<li><a href="/about/legal/">Legal / Privacy</a></li>
-</ul>
-</section>
-
-Since publishing this project, several of datasets have disappeared. Below is a chronical of recents events related to the datasets on this site.
-
-June 2019
-
-- June 2: The Duke MTMC main webpage was deactivated and the entire dataset seems to be no longer available from Duke
-- June 2: The has been https://reid-mct.github.io/2019/
-- June 1: The Brainwash face/head dataset has been taken down by its author after posting it about it
-
-May 2019
-
-- May 31: Semantic Scholar appears to be censoring citations used in this project. Two of the citations linking the Brainwash dataset to a military research in China have been intentionally disabled.
-- May 28: The Microsoft Celeb (MS Celeb) face dataset website is now 404 and all the download links are deactivated. It appears that Microsoft Reserach has shuttered access to their MS Celeb dataset. Yet it remains available, as of June 2, on [Imperial College London's website](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/)
-- \ No newline at end of file
diff --git a/site/content/pages/datasets/brainwash/index.md b/site/content/pages/datasets/brainwash/index.md
index 861eabc0..2a5346b5 100644
--- a/site/content/pages/datasets/brainwash/index.md
+++ b/site/content/pages/datasets/brainwash/index.md
@@ -19,18 +19,21 @@ authors: Adam Harvey
### sidebar
### end sidebar
-Brainwash is a dataset of livecam images taken from San Francisco's Brainwash Cafe. It includes 11,917 images of "everyday life of a busy downtown cafe"[^readme] captured at 100 second intervals throughout the entire day. The Brainwash dataset includes 3 full days of webcam images taken on October 27, November 13, and November 24 in 2014. According the author's [research paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666) introducing the dataset, the images were acquired with the help of Angelcam.com. [^end_to_end]
+Brainwash is a dataset of livecam images taken from San Francisco's Brainwash Cafe. It includes 11,917 images of "everyday life of a busy downtown cafe"[^readme] captured at 100 second intervals throughout the day. The Brainwash dataset includes 3 full days of webcam images taken on October 27, November 13, and November 24 in 2014. According the author's [research paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666) introducing the dataset, the images were acquired with the help of Angelcam.com. [^end_to_end]
-The Brainwash dataset is unique because it uses images from a publicly available webcam that records people inside a privately owned business without any consent. No ordinary cafe customer would ever suspect that their image would end up in dataset used for surveillance research and development, but that is exactly what happened to customers at Brainwash cafe in San Francisco.
+The Brainwash dataset is unique because it uses images from a publicly available webcam that records people inside a privately owned business without their consent. No ordinary cafe customer could ever suspect that their image would end up in dataset used for surveillance research and development, but that is exactly what happened to customers at Brainwash Cafe in San Francisco.
Although Brainwash appears to be a less popular dataset, it was notably used in 2016 and 2017 by researchers affiliated with the National University of Defense Technology in China for two [research](https://www.semanticscholar.org/paper/Localized-region-context-and-object-feature-fusion-Li-Dou/b02d31c640b0a31fb18c4f170d841d8e21ffb66c) [projects](https://www.semanticscholar.org/paper/A-Replacement-Algorithm-of-Non-Maximum-Suppression-Zhao-Wang/591a4bfa6380c9fcd5f3ae690e3ac5c09b7bf37b) on advancing the capabilities of object detection to more accurately isolate the target region in an image. [^localized_region_context] [^replacement_algorithm] The [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) is controlled by China's top military body, the Central Military Commission.
-The Brainwash dataset also appears in a 2018 research paper affiliated with Megvii (Face++) that used images from Brainwash Cafe "to validate the generalization ability of [their] CrowdHuman dataset for head detection."[^crowdhuman]. Megvii is the parent company of Face++, who has provided surveillance technology to [monitor Uighur Muslims](https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html) in Xinjiang and may be [blacklisted](https://www.bloomberg.com/news/articles/2019-05-22/trump-weighs-blacklisting-two-chinese-surveillance-companies) in the United States.
+The Brainwash dataset also appears in a 2018 research paper affiliated with Megvii (Face++) that used images from Brainwash cafe "to validate the generalization ability of [their] CrowdHuman dataset for head detection."[^crowdhuman]. Megvii is the parent company of Face++, who has provided surveillance technology to [monitor Uighur Muslims](https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html) in Xinjiang and may be [blacklisted](https://www.bloomberg.com/news/articles/2019-05-22/trump-weighs-blacklisting-two-chinese-surveillance-companies) in the United States.
#### Updates
Since [posting](https://twitter.com/adamhrv/status/1132201604999000065) about this dataset and [showing](https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e) its connections to the National Unviversity of Defense Technology in China, the Brainwash dataset is no longer available for download. As of June 2, 2019 it has been "removed from access at the request of the depositor."
+The two papers associated with the National University of Defense Technology in China have also been affected. The citations linking back to the Brainwash dataset paper no longer appear in the Semantic Scholar API search results. The citation references on the pages for [NUDT citation 1](https://www.semanticscholar.org/paper/A-Replacement-Algorithm-of-Non-Maximum-Suppression-Zhao-Wang/591a4bfa6380c9fcd5f3ae690e3ac5c09b7bf37b) and [NUDT citation 2](https://www.semanticscholar.org/paper/Localized-region-context-and-object-feature-fusion-Li-Dou/b02d31c640b0a31fb18c4f170d841d8e21ffb66c) now display the text "Sorry, this paper is not in our corpus", no longer linking back to the [original Brainwash paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666), effectively censoring the NUDT connections from API search results.
+
+
![caption: An sample image from the Brainwash dataset used for training face and head detection algorithms for surveillance. The dataset contains a total of 11,917 images and 81,973 annotated heads. Graphic by megapixels.cc based on Brainwash dataset by Russel et. al. License: <a href="https://opendatacommons.org/licenses/pddl/summary/index.html">Open Data Commons Public Domain Dedication</a> (PDDL)](assets/brainwash_example.jpg)
![caption: A visualization of the active regions for 81,973 head annotations in the Brainwash dataset training partition. Graphic by megapixels.cc based on Brainwash dataset by Russel et. al. License: <a href="https://opendatacommons.org/licenses/pddl/summary/index.html">Open Data Commons Public Domain Dedication</a> (PDDL)](assets/brainwash_saliency_map.jpg)
diff --git a/site/content/pages/datasets/index.md b/site/content/pages/datasets/index.md
index 6e96f19e..8dbee237 100644
--- a/site/content/pages/datasets/index.md
+++ b/site/content/pages/datasets/index.md
@@ -13,4 +13,6 @@ sync: false
# Dataset Analyses
-Explore face and person recognition datasets contributing to the growing crisis of authoritarian biometric surveillance. This first group of 5 datasets focuses on image usage connected to foreign surveillance and defense organizations. Since publishing this project in April 2019, the [Brainwash](https://purl.stanford.edu/sx925dc9385), [Duke MTMC](http://vision.cs.duke.edu/DukeMTMC/), and [MS Celeb](http://msceleb.org/) datasets have been taken down by their authors. The [UCCS](https://vast.uccs.edu/Opensetface/) dataset was temporarily deactivated due to metadata exposure and the [Town Centre data](http://www.robots.ox.ac.uk/ActiveVision/Research/Projects/2009bbenfold_headpose/project.html) remains active.
+Explore face and person recognition datasets contributing to the growing crisis of authoritarian biometric surveillance. This first group of 5 datasets focuses on image usage connected to foreign surveillance and defense organizations.
+
+Since publishing this project in April 2019, the [Brainwash](https://purl.stanford.edu/sx925dc9385), [Duke MTMC](http://vision.cs.duke.edu/DukeMTMC/), and [MS Celeb](http://msceleb.org/) datasets have been taken down by their authors. The [UCCS](https://vast.uccs.edu/Opensetface/) dataset was temporarily deactivated due to metadata exposure and the [Town Centre data](http://www.robots.ox.ac.uk/ActiveVision/Research/Projects/2009bbenfold_headpose/project.html) remains active.
diff --git a/site/content/pages/datasets/msceleb/index.md b/site/content/pages/datasets/msceleb/index.md
index 22a799e0..5095da3d 100644
--- a/site/content/pages/datasets/msceleb/index.md
+++ b/site/content/pages/datasets/msceleb/index.md
@@ -19,17 +19,20 @@ authors: Adam Harvey
### sidebar
### end sidebar
-Microsoft Celeb (MS Celeb or MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the [dataset](https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/) in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".[^msceleb_orig]
+Microsoft Celeb (MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the [dataset](https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/) in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".[^msceleb_orig]
While the majority of people in this dataset are American and British actors, the exploitative use of the term "celebrity" extends far beyond Hollywood. Many of the names in the MS Celeb face recognition dataset are merely people who must maintain an online presence for their professional lives: journalists, artists, musicians, activists, policy makers, writers, and academics. Many people in the target list are even vocal critics of the very technology Microsoft is using their name and biometric information to build. It includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; Shoshana Zuboff, author of *Surveillance Capitalism*; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy.
### Microsoft's 1 Million Target List
-Microsoft Research distributed two main digital assets: a dataset of approximately 10,000,000 images of 100,000 individuals and a target list of exactly 1 million names. The 900,000 names without images are the target list, which is used to gather more images for these individuals.
+Microsoft Research distributed two main digital assets: a dataset of approximately 10,000,000 images of 100,000 individuals and a target list of exactly 1 million names. The 900,000 names without images are the target list, which is used to gather more images for each subject.
-For example in a research project authored by researchers from SenseTime's Joint Lab at the Chinese University of Hong Kong called "[Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition](https://arxiv.org/pdf/1809.01407.pdf)", approximately 7 million images from an additional 285,000 subjects were added to their dataset. The images were obtained by crawling the internet using the MS Celeb target list as the search query.
+For example in a research project authored by researchers from SenseTime's Joint Lab at the Chinese University of Hong Kong called "[Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition](https://arxiv.org/pdf/1809.01407.pdf)", approximately 7 million images from an additional 285,000 subjects were added to their dataset. The images were obtained by crawling the internet using the MS Celeb target list as search queries.
+
+Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data for "celebrities". Names with a number indicate how many images were distributed by Microsoft. Since publishing this analysis, Microsoft has quietly taken down their [msceleb.org](https://msceleb.org) website but a cleaned list of 94,682 identities used in the dataset is still available on GitHub from <https://github.com/PINTOFSTU/C-MS-Celeb>, which references another [NUDT affiliated project](https://www.hindawi.com/journals/cin/2018/4512473/abs/). IDs in the format "m.abc123" and can be accessed through [Google's Knowledge Graph](https://developers.google.com/knowledge-graph/reference/rest/v1/) as "/m/abc123" to obtain subject names and descriptions.
+
+NB: names without a number indicate that Microsoft only distributed your name and encouraged researchers to download your face images to build a biometric profile. Images with a number indicate that Microsoft definitely included your faces images in their dataset. If images were not included by Microsoft it's more likely than not that your face was used for MS-Celeb-1M related challenges by organizations including NUDT, Megvii, SenseTime, IBM, Hitachi, and others.
-Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. Names with a number indicate how many images were distributed by Microsoft. Since publishing the analysis, Microsoft has quietly taken down their [msceleb.org](https://msceleb.org) website but a partial list of the identifiers is still available on [github.com/JinRC/C-MS-Celeb/](https://github.com/JinRC/C-MS-Celeb/). The IDs are in the format "m.abc123" and can be accessed through [Google's Knowledge Graph](https://developers.google.com/knowledge-graph/reference/rest/v1/) as "/m/abc123" to obtain the subject names.
=== columns 2
@@ -68,7 +71,7 @@ Below is a selection of 24 names from both the target list and image list curate
=== end columns
-After the MS Celeb dataset was introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition.
+After the MS Celeb dataset was first introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's [National University of Defense Technology (NUDT)](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition.
In an April 10, 2019 [article](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".[^madhu_ft]
@@ -78,42 +81,48 @@ One of the 4 SenseTime papers, "[Exploring Disentangled Feature Representation B
Earlier in 2019, Microsoft President and Chief Legal Officer [Brad Smith](https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/) called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also [announced](https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV) that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. In effect, Microsoft's face recognition software was not suitable to be used on minorities because it was trained mostly on white male faces.
-What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics policy, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition if they don't have enough data to build it.
+What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics policy, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition if they don't have enough face training data to build it.
-Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly [white](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) and [male](https://gendershades.org). Without balanced data, facial recognition contains blind spots. But without the large-scale datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive would be even less usable.
+Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly [white](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) and [male](https://gendershades.org). Without balanced data, facial recognition contains blind spots. But without the large-scale datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft Azure Cognitive would be even less usable.
![caption: A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)](assets/msceleb_montage.jpg)
Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "[One-shot Face Recognition by Promoting Underrepresented Classes](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/)," Microsoft used the MS Celeb face dataset to build their algorithms and advertise the results. Interestingly, Microsoft's [corporate version](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/) of the paper does not mention they used the MS Celeb datset, but the [open-access version](https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70) published on arxiv.org does. It states that Microsoft Research analyzed their algorithms using "the MS-Celeb-1M low-shot learning benchmark task."[^one_shot]
-Typically researchers will phrase this differently and say they use data to validate their algorithm. But in reality neural network algorithms without data are only blueprints for how to use the data. Neural network algorithms are used to extract knowledge and distill it into an active format where it can be used for inference. Passing a face image through a face recognition neural network is to pass that image through the entire dataset.
+Typically researchers will phrase this differently and say that they only use a dataset to validate their algorithm. But validation data can't be easily separated from the training process. To develop a neural network model, image training datasets are split into three parts: train, test, and validation. Training data is used to fit a model, and the validation and test data are used to provide feedback about the hyperparameters, biases, and outputs. In reality, test and validation data steers and influences the final results of neural networks.
## Runaway Data
-Despite Microsoft's recent action to quietly shut down their large scale distribution of non-cooperative biometrics on the [MS Celeb](https://msceleb.org) website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world.
+Despite the recent termination of the [msceleb.org](https://msceleb.org) website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world.
+
+For example, on October 28, 2019, the MS Celeb dataset will be used for a new competition called "[Lightweight Face Recognition Challenge & Workshop](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/)" where the best face recognition entries will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the [ICCV 2019 conference](http://iccv2019.thecvf.com/program/workshops). This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and [InsightFace](https://github.com/deepinsight/insightface) (CN). The organizers provide a [25GB download of cropped faces](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/) from MS Celeb for anyone to download (in .rec format).
-The most recent of which is a paper uploaded to arxiv.org on April 2, 2019 jointly authored by researchers from IIIT-Delhi and IBM TJ Watson Research Center. In their paper titled [Deep Learning for Face Recognition: Pride or Prejudiced?](https://arxiv.org/abs/1904.01219), the researchers use a new dataset, called *Racial Faces in the Wild* (RFW), made entirely from the original images of the MS Celeb dataset. To create it, the RFW authors uploaded everyone's image from the MS Celeb dataset to Face++ and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects.
+And in June, shortly after [posting](https://twitter.com/adamhrv/status/1134511293526937600) about the disappearance of the MS Celeb dataset, it reemerged on [Academic Torrents](https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech). As of June 10, the MS Celeb dataset files have been redistributed in at least 9 countries and downloaded 44 times without any restrictions. The files were seeded and are mostly distributed by an AI company based in China called Hyper.ai, which states that it redistributes MS Celeb and other datasets for "teachers and students of service industry-related practitioners and research institutes."[^hyperai_readme]
-Face++ is a face recognition product from Megvii Inc. who has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the [ChinAI Newsletter](https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang) and [BuzzFeedNews](https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii), Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China.
+Earlier in 2019 images from the MS Celeb were also repackaged into another face dataset called *Racial Faces in the Wild (RFW)*. To create it, the RFW authors uploaded face images from the MS Celeb dataset to the Face++ API and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects. That dataset then appeared in a subsequent research project from researchers affiliated with IIIT-Delhi and IBM TJ Watson called [Deep Learning for Face Recognition: Pride or Prejudiced?](https://arxiv.org/abs/1904.01219), which aims to reduce bias but also inadvertently furthers racist language and ideologies that can not be repeated here.
+
+The estimated racial scores for the MS Celeb face images used in the RFW dataset were computed using the Face++ API, which is owned by Megvii Inc, a company that has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the [ChinAI Newsletter](https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang) and [BuzzFeedNews](https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii), Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China. If they didn't already, it's highly likely that Megvii has a copy of everyone's biometric faceprint from the MS Celeb dataset, either from uploads to the Face++ API or through the research projects explicitly referencing MS Celeb dataset usage, such as a 2018 paper called [GridFace: Face Rectification via Learning Local Homography Transformations](https://arxiv.org/pdf/1808.06210.pdf) jointly published by 3 authors, all of whom worked for Megvii.
## Commercial Usage
-Megvii publicly acknowledges using the MS Celeb face dataset in their 2018 research project called [GridFace: Face Rectification via Learning Local Homography Transformations](https://arxiv.org/pdf/1808.06210.pdf). The paper has three authors, all of whom were associated with Megvii, indicating that the dataset has been used for research associated with commercial activity. However, on Microsoft's [website](http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset) they state that the dataset was released "for non-commercial research purpose only."
+Microsoft's [MS Celeb website](http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset) says it was created for "non-commercial research purpose only." Publicly available research citations and competitions show otherwise.
-A more clear example of commercial use happened in 2017 when Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia compete to achieve the highest performance using their recognition technology. In 2017, the winner of the MS-Celeb-1M challenge was Beijing-based OrionStar Technology Co., Ltd.. In their [press release](https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html), OrionStar boast 13% increase on the difficult set over last year's winner.
+In 2017 Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia used the MS Celeb dataset to compete for the highest performance scores. The 2017 winner was Beijing-based OrionStar Technology Co., Ltd.. In their [press release](https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html), OrionStar boasted a 13% increase on the difficult set over last year's winner. The prior year's competitors included Beijing-based Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.
-Microsoft Research also ran a similar competition in 2016 that with other commercial participants including Beijing Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.
+Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii/Face++, Microsoft, Microsoft Asia, SenseTime, OrionStar, Faceall), military use (National University of Defense Technology in China), the proliferation of subset data (Racial Faces in the Wild), and the real-time visible proliferation via Academic Torrents it's fairly clear that Microsoft has lost control of their MS Celeb dataset and the biometric data of nearly 100,000 individuals.
-On October 28, 2019, the MS Celeb dataset will be used for yet competition called "[Lightweight Face Recognition Challenge & Workshop](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/)" where the best face recognition entry will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the [ICCV 2019 conference](http://iccv2019.thecvf.com/program/workshops). This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and [InsightFace](https://github.com/deepinsight/insightface) (CN).
+To provide insight into where these 10 million faces images have traveled, over 100 research papers have been verified and geolocated to show who used the dataset and where they used it.
-Even though Microsoft has shuttered access to the official distribution website [msceleb.org](https://msceleb.org) the dataset can still be easily downloaded from [https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/) without agreeing to any terms for usage or further distribution.
+{% include 'dashboard.html' %}
-Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii, Microsoft, Microsoft Asia, SenseTime), military use (National University of Defense Technology in China), and the proliferation of subsets being used for new face recognition competitions it's fairly clear that Microsoft is no longer in control of their MS Celeb dataset nor the biometric data of nearly 10 million images of 100,000 individuals whose images were distributed in the dataset.
+{% include 'supplementary_header.html' %}
-To provide insight into where these 10 million faces images have traveled, we mapped all the publicly available research citations to show who used the dataset and where it was used.
+##### FAQs and Fact Check
-{% include 'dashboard.html' %}
+- **The MS Celeb images were not derived from Creative Commons sources**. They were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"[^msceleb_orig]. The dataset actually includes many copyrighted images. Microsoft doesn't provide any image URLs, but manually reviewing a small portion of images from the dataset shows many images with watermarked "Copyright" text over the image. TinEye could be used to more accurately determine the image origins in aggregate
+- **Microsoft did not distribute images of all one million people.** They distributed images for about 100,000 and then encouraged other researchers to download the remaining 900,000 people "by using all the possibly collected face images of this individual on the web as training data."[^msceleb_orig]
+- **Microsoft had not deleted or stopped distribution of their MS Celeb at the time of most press reports on June 4.** Until at least June 6, 2019 the Microsoft Research data portal provided the MS Celeb dataset for download: <http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737>
### Footnotes
@@ -121,4 +130,5 @@ To provide insight into where these 10 million faces images have traveled, we ma
[^madhu_ft]: Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019.
[^rfw]: Wang, Mei; Deng, Weihong; Hu, Jiani; Peng, Jianteng; Tao, Xunqiang; Huang, Yaohai. Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation. 2018. http://arxiv.org/abs/1812.00194
[^pride_prejudice]: Nagpal, Shruti; Singh, Maneet; Singh, Richa; Vatsa, Mayank; Ratha, Nalini K.. Deep Learning for Face Recognition: Pride or Prejudiced? 2019. http://arxiv.org/abs/1904.01219
-[^one_shot]: Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. https://arxive.org/abs/1707.05574 \ No newline at end of file
+[^one_shot]: Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. https://arxive.org/abs/1707.05574
+[^hyperai_readme]: readme.txt. MS-Celeb-1M download via Academic Torrents. Accessed June 9, 2019. https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech \ No newline at end of file