summaryrefslogtreecommitdiff
path: root/site/content/pages/datasets/msceleb/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'site/content/pages/datasets/msceleb/index.md')
-rw-r--r--site/content/pages/datasets/msceleb/index.md31
1 files changed, 21 insertions, 10 deletions
diff --git a/site/content/pages/datasets/msceleb/index.md b/site/content/pages/datasets/msceleb/index.md
index 5468773b..5095da3d 100644
--- a/site/content/pages/datasets/msceleb/index.md
+++ b/site/content/pages/datasets/msceleb/index.md
@@ -29,7 +29,10 @@ Microsoft Research distributed two main digital assets: a dataset of approximate
For example in a research project authored by researchers from SenseTime's Joint Lab at the Chinese University of Hong Kong called "[Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition](https://arxiv.org/pdf/1809.01407.pdf)", approximately 7 million images from an additional 285,000 subjects were added to their dataset. The images were obtained by crawling the internet using the MS Celeb target list as search queries.
-Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data for "celebrities". Names with a number indicate how many images were distributed by Microsoft. Since publishing the analysis, Microsoft has quietly taken down their [msceleb.org](https://msceleb.org) website but a partial list of the identifiers is still available on [github.com/JinRC/C-MS-Celeb/](https://github.com/JinRC/C-MS-Celeb/). The IDs are in the format "m.abc123" and can be accessed through [Google's Knowledge Graph](https://developers.google.com/knowledge-graph/reference/rest/v1/) as "/m/abc123" to obtain the subject names.
+Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data for "celebrities". Names with a number indicate how many images were distributed by Microsoft. Since publishing this analysis, Microsoft has quietly taken down their [msceleb.org](https://msceleb.org) website but a cleaned list of 94,682 identities used in the dataset is still available on GitHub from <https://github.com/PINTOFSTU/C-MS-Celeb>, which references another [NUDT affiliated project](https://www.hindawi.com/journals/cin/2018/4512473/abs/). IDs in the format "m.abc123" and can be accessed through [Google's Knowledge Graph](https://developers.google.com/knowledge-graph/reference/rest/v1/) as "/m/abc123" to obtain subject names and descriptions.
+
+NB: names without a number indicate that Microsoft only distributed your name and encouraged researchers to download your face images to build a biometric profile. Images with a number indicate that Microsoft definitely included your faces images in their dataset. If images were not included by Microsoft it's more likely than not that your face was used for MS-Celeb-1M related challenges by organizations including NUDT, Megvii, SenseTime, IBM, Hitachi, and others.
+
=== columns 2
@@ -68,7 +71,7 @@ Below is a selection of 24 names from both the target list and image list curate
=== end columns
-After the MS Celeb dataset was introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition.
+After the MS Celeb dataset was first introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's [National University of Defense Technology (NUDT)](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition.
In an April 10, 2019 [article](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".[^madhu_ft]
@@ -93,31 +96,39 @@ Typically researchers will phrase this differently and say that they only use a
Despite the recent termination of the [msceleb.org](https://msceleb.org) website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world.
-For example, on October 28, 2019, the MS Celeb dataset will be used for a new competition called "[Lightweight Face Recognition Challenge & Workshop](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/)" where the best face recognition entries will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the [ICCV 2019 conference](http://iccv2019.thecvf.com/program/workshops). This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and [InsightFace](https://github.com/deepinsight/insightface) (CN).
+For example, on October 28, 2019, the MS Celeb dataset will be used for a new competition called "[Lightweight Face Recognition Challenge & Workshop](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/)" where the best face recognition entries will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the [ICCV 2019 conference](http://iccv2019.thecvf.com/program/workshops). This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and [InsightFace](https://github.com/deepinsight/insightface) (CN). The organizers provide a [25GB download of cropped faces](https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/) from MS Celeb for anyone to download (in .rec format).
-And earlier in 2019 images from the MS Celeb were repackaged into another face dataset called *Racial Faces in the Wild (RFW)*. To create it, the RFW authors uploaded face images from the MS Celeb dataset to the Face++ API and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects. That dataset then appeared in a subsequent research project from researchers affiliated with IIIT-Delhi and IBM TJ Watson called [Deep Learning for Face Recognition: Pride or Prejudiced?](https://arxiv.org/abs/1904.01219), which aims to reduce bias but also inadvertently furthers racist language and ideologies in the paper.
+And in June, shortly after [posting](https://twitter.com/adamhrv/status/1134511293526937600) about the disappearance of the MS Celeb dataset, it reemerged on [Academic Torrents](https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech). As of June 10, the MS Celeb dataset files have been redistributed in at least 9 countries and downloaded 44 times without any restrictions. The files were seeded and are mostly distributed by an AI company based in China called Hyper.ai, which states that it redistributes MS Celeb and other datasets for "teachers and students of service industry-related practitioners and research institutes."[^hyperai_readme]
-The technology that was used to compute the estimated racial scores for the MS Celeb face images used in the RFW dataset, Face++, is owned by Megvii Inc, who has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the [ChinAI Newsletter](https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang) and [BuzzFeedNews](https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii), Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China. If they didn't already, it's highly likely that Megvii has a copy of everyone's biometric faceprint from the MS Celeb dataset.
+Earlier in 2019 images from the MS Celeb were also repackaged into another face dataset called *Racial Faces in the Wild (RFW)*. To create it, the RFW authors uploaded face images from the MS Celeb dataset to the Face++ API and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects. That dataset then appeared in a subsequent research project from researchers affiliated with IIIT-Delhi and IBM TJ Watson called [Deep Learning for Face Recognition: Pride or Prejudiced?](https://arxiv.org/abs/1904.01219), which aims to reduce bias but also inadvertently furthers racist language and ideologies that can not be repeated here.
-Megvii also publicly acknowledges using the MS Celeb face dataset in their 2018 research project called [GridFace: Face Rectification via Learning Local Homography Transformations](https://arxiv.org/pdf/1808.06210.pdf). The paper has three authors, all of whom were associated with Megvii.
+The estimated racial scores for the MS Celeb face images used in the RFW dataset were computed using the Face++ API, which is owned by Megvii Inc, a company that has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the [ChinAI Newsletter](https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang) and [BuzzFeedNews](https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii), Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China. If they didn't already, it's highly likely that Megvii has a copy of everyone's biometric faceprint from the MS Celeb dataset, either from uploads to the Face++ API or through the research projects explicitly referencing MS Celeb dataset usage, such as a 2018 paper called [GridFace: Face Rectification via Learning Local Homography Transformations](https://arxiv.org/pdf/1808.06210.pdf) jointly published by 3 authors, all of whom worked for Megvii.
## Commercial Usage
-
-The Microsoft Celeb dataset [website](http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset) says it was created for "non-commercial research purpose only." Publicly available research citations and competitions show otherwise.
+Microsoft's [MS Celeb website](http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset) says it was created for "non-commercial research purpose only." Publicly available research citations and competitions show otherwise.
In 2017 Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia used the MS Celeb dataset to compete for the highest performance scores. The 2017 winner was Beijing-based OrionStar Technology Co., Ltd.. In their [press release](https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html), OrionStar boasted a 13% increase on the difficult set over last year's winner. The prior year's competitors included Beijing-based Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.
-Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii/Face++, Microsoft, Microsoft Asia, SenseTime), military use (National University of Defense Technology in China), and the proliferation of subset data (Racial Faces in the Wild) being used to develop face recognition technology for commercial or defense purposes it's fairly clear that Microsoft has lost control of their MS Celeb dataset and biometric data of nearly 100,000 individuals.
+Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii/Face++, Microsoft, Microsoft Asia, SenseTime, OrionStar, Faceall), military use (National University of Defense Technology in China), the proliferation of subset data (Racial Faces in the Wild), and the real-time visible proliferation via Academic Torrents it's fairly clear that Microsoft has lost control of their MS Celeb dataset and the biometric data of nearly 100,000 individuals.
To provide insight into where these 10 million faces images have traveled, over 100 research papers have been verified and geolocated to show who used the dataset and where they used it.
{% include 'dashboard.html' %}
+{% include 'supplementary_header.html' %}
+
+##### FAQs and Fact Check
+
+- **The MS Celeb images were not derived from Creative Commons sources**. They were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"[^msceleb_orig]. The dataset actually includes many copyrighted images. Microsoft doesn't provide any image URLs, but manually reviewing a small portion of images from the dataset shows many images with watermarked "Copyright" text over the image. TinEye could be used to more accurately determine the image origins in aggregate
+- **Microsoft did not distribute images of all one million people.** They distributed images for about 100,000 and then encouraged other researchers to download the remaining 900,000 people "by using all the possibly collected face images of this individual on the web as training data."[^msceleb_orig]
+- **Microsoft had not deleted or stopped distribution of their MS Celeb at the time of most press reports on June 4.** Until at least June 6, 2019 the Microsoft Research data portal provided the MS Celeb dataset for download: <http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737>
+
### Footnotes
[^msceleb_orig]: MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Accessed April 18, 2019. http://web.archive.org/web/20190418151913/http://msceleb.org/
[^madhu_ft]: Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019.
[^rfw]: Wang, Mei; Deng, Weihong; Hu, Jiani; Peng, Jianteng; Tao, Xunqiang; Huang, Yaohai. Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation. 2018. http://arxiv.org/abs/1812.00194
[^pride_prejudice]: Nagpal, Shruti; Singh, Maneet; Singh, Richa; Vatsa, Mayank; Ratha, Nalini K.. Deep Learning for Face Recognition: Pride or Prejudiced? 2019. http://arxiv.org/abs/1904.01219
-[^one_shot]: Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. https://arxive.org/abs/1707.05574 \ No newline at end of file
+[^one_shot]: Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. https://arxive.org/abs/1707.05574
+[^hyperai_readme]: readme.txt. MS-Celeb-1M download via Academic Torrents. Accessed June 9, 2019. https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech \ No newline at end of file