diff options
Diffstat (limited to 'site/public/datasets')
| -rw-r--r-- | site/public/datasets/msceleb/index.html | 26 |
1 files changed, 17 insertions, 9 deletions
diff --git a/site/public/datasets/msceleb/index.html b/site/public/datasets/msceleb/index.html index b57077a8..bbf648cf 100644 --- a/site/public/datasets/msceleb/index.html +++ b/site/public/datasets/msceleb/index.html @@ -76,10 +76,12 @@ </div><div class='meta'> <div class='gray'>Website</div> <div><a href='http://www.msceleb.org/' target='_blank' rel='nofollow noopener'>msceleb.org</a></div> - </div></div><p>Microsoft Celeb (MS Celeb) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the <a href="https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></p> + </div></div><p>Microsoft Celeb (MS Celeb or MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the <a href="https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></p> <p>While the majority of people in this dataset are American and British actors, the exploitative use of the term "celebrity" extends far beyond Hollywood. Many of the names in the MS Celeb face recognition dataset are merely people who must maintain an online presence for their professional lives: journalists, artists, musicians, activists, policy makers, writers, and academics. Many people in the target list are even vocal critics of the very technology Microsoft is using their name and biometric information to build. It includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; Shoshana Zuboff, author of <em>Surveillance Capitalism</em>; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy.</p> <h3>Microsoft's 1 Million Target List</h3> -<p>Microsoft Research distributed two main digital assets: a list of 1 million names, and a dataset of 10,000,000 images of 100,000 individuals. The 900,000 names without images are the problem researchers are trying to solve: million scale face recognition. Below is a selection of 24 names from the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. Names with a number indicate how many images were distributed by Microsoft. Since publishing the analysis, Microsoft has quietly taken down their website <a href="https://msceleb.org">msceleb.org</a> but a partial list of the identifiers is still available on <a href="https://github.com/JinRC/C-MS-Celeb/">github.com/JinRC/C-MS-Celeb/</a>. The IDs are in the format "m.abc123" and can be accessed through <a href="https://developers.google.com/knowledge-graph/reference/rest/v1/">Google's knowledge graph</a> as "/m/abc123" to obtain the subject names.</p> +<p>Microsoft Research distributed two main digital assets: a dataset of approximately 10,000,000 images of 100,000 individuals and a target list of exactly 1 million names. The 900,000 names without images are the target list, which is used to gather more images for these individuals.</p> +<p>For example in a research project authored by researchers from SenseTime's Joint Lab at the Chinese University of Hong Kong called "<a href="https://arxiv.org/pdf/1809.01407.pdf">Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition</a>", approximately 7 million images from an additional 285,000 subjects were added to their dataset. The images were obtained by crawling the internet using the MS Celeb target list as the search query.</p> +<p>Below is a selection of 24 names from both the target list and image list curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. Names with a number indicate how many images were distributed by Microsoft. Since publishing the analysis, Microsoft has quietly taken down their <a href="https://msceleb.org">msceleb.org</a> website but a partial list of the identifiers is still available on <a href="https://github.com/JinRC/C-MS-Celeb/">github.com/JinRC/C-MS-Celeb/</a>. The IDs are in the format "m.abc123" and can be accessed through <a href="https://developers.google.com/knowledge-graph/reference/rest/v1/">Google's Knowledge Graph</a> as "/m/abc123" to obtain the subject names.</p> </section><section><div class='columns columns-2'><div class='column'><table> <thead><tr> <th>Name (images)</th> @@ -194,21 +196,26 @@ </tr> </tbody> </table> -</div></div></section><section><p>After MS Celeb was first published in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's <a href="https://en.wikipedia.org/wiki/National_University_of_Defense_Technology">National University of Defense Technology</a> (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "<a href="https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65">Faces as Lighting Probes via Unsupervised Deep Highlight Extraction</a>" with potential applications in 3D face recognition.</p> +</div></div></section><section><p>After the MS Celeb dataset was introduced in 2016, researchers affiliated with Microsoft Asia worked with researchers affiliated with China's <a href="https://en.wikipedia.org/wiki/National_University_of_Defense_Technology">National University of Defense Technology</a> (controlled by China's Central Military Commission) and used the MS Celeb images for their research paper on using "<a href="https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65">Faces as Lighting Probes via Unsupervised Deep Highlight Extraction</a>" with potential applications in 3D face recognition.</p> <p>In an April 10, 2019 <a href="https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a">article</a> published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".<a class="footnote_shim" name="[^madhu_ft]_1"> </a><a href="#[^madhu_ft]" class="footnote" title="Footnote 2">2</a></p> <p>Four more papers published by SenseTime that also use the MS Celeb dataset raise similar flags. SenseTime is a computer vision surveillance company that until <a href="https://uhrp.org/news-commentary/china%E2%80%99s-sensetime-sells-out-xinjiang-security-joint-venture">April 2019</a> provided surveillance to Chinese authorities to monitor and track Uighur Muslims in Xinjiang province, and had been <a href="https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html">flagged</a> numerous times as having potential links to human rights violations.</p> -<p>One of the 4 SenseTime papers, "<a href="https://www.semanticscholar.org/paper/Exploring-Disentangled-Feature-Representation-Face-Liu-Wei/1fd5d08394a3278ef0a89639e9bfec7cb482e0bf">Exploring Disentangled Feature Representation Beyond Face Identification</a>", shows how SenseTime was developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances.</p> +<p>One of the 4 SenseTime papers, "<a href="https://www.semanticscholar.org/paper/Exploring-Disentangled-Feature-Representation-Face-Liu-Wei/1fd5d08394a3278ef0a89639e9bfec7cb482e0bf">Exploring Disentangled Feature Representation Beyond Face Identification</a>", shows how SenseTime was developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances, and using the MS Celeb dataset to build their technology.</p> <p>Earlier in 2019, Microsoft President and Chief Legal Officer <a href="https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/">Brad Smith</a> called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also <a href="https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV">announced</a> that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. In effect, Microsoft's face recognition software was not suitable to be used on minorities because it was trained mostly on white male faces.</p> <p>What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics policy, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition if they don't have enough data to build it.</p> -<p>Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">white</a> and <a href="https://gendershades.org">male</a>. Without balanced data, facial recognition contains blind spots. But without the large-scale datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive Service the services might not exist at all.</p> -</section><section class='images'><div class='image'><img src='https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/msceleb_montage.jpg' alt=' A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)'><div class='caption'> A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)</div></div></section><section><p>Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "<a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">One-shot Face Recognition by Promoting Underrepresented Classes</a>," Microsoft leveraged the MS Celeb dataset to build their algorithms and advertise the results. Interestingly, Microsoft's <a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">corporate version</a> of the paper does not mention they used the MS Celeb datset, but the <a href="https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70">open-access version</a> published on arxiv.org explicitly mentions that Microsoft Research analyzed their algorithms using "the MS-Celeb-1M low-shot learning benchmark task."</p> -<p>Typically researchers will phrase this differently and say they use data to validate their algorithm. In reality algorithms without data are only concepts or blueprints for how to use the data. Algorithms are used to extract the knowledge and distill it into an active format where it can be used for inference. Passing a face image through a face recognition neural network is to pass that image through the entire dataset.</p> +<p>Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">white</a> and <a href="https://gendershades.org">male</a>. Without balanced data, facial recognition contains blind spots. But without the large-scale datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive would be even less usable.</p> +</section><section class='images'><div class='image'><img src='https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/msceleb_montage.jpg' alt=' A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)'><div class='caption'> A visualization of 2,000 of the 100,000 identities included in the MS-Celeb-1M dataset distributed by Microsoft Research. License: Open Data Commons Public Domain Dedication (PDDL)</div></div></section><section><p>Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "<a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">One-shot Face Recognition by Promoting Underrepresented Classes</a>," Microsoft used the MS Celeb face dataset to build their algorithms and advertise the results. Interestingly, Microsoft's <a href="https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/">corporate version</a> of the paper does not mention they used the MS Celeb datset, but the <a href="https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70">open-access version</a> published on arxiv.org does. It states that Microsoft Research analyzed their algorithms using "the MS-Celeb-1M low-shot learning benchmark task."<a class="footnote_shim" name="[^one_shot]_1"> </a><a href="#[^one_shot]" class="footnote" title="Footnote 5">5</a></p> +<p>Typically researchers will phrase this differently and say they use data to validate their algorithm. But in reality neural network algorithms without data are only blueprints for how to use the data. Neural network algorithms are used to extract knowledge and distill it into an active format where it can be used for inference. Passing a face image through a face recognition neural network is to pass that image through the entire dataset.</p> <h2>Runaway Data</h2> <p>Despite Microsoft's recent action to quietly shut down their large scale distribution of non-cooperative biometrics on the <a href="https://msceleb.org">MS Celeb</a> website, the dataset still exists in several repositories on GitHub, the hard drives of countless researchers, and will likely continue to be used in research projects around the world.</p> <p>The most recent of which is a paper uploaded to arxiv.org on April 2, 2019 jointly authored by researchers from IIIT-Delhi and IBM TJ Watson Research Center. In their paper titled <a href="https://arxiv.org/abs/1904.01219">Deep Learning for Face Recognition: Pride or Prejudiced?</a>, the researchers use a new dataset, called <em>Racial Faces in the Wild</em> (RFW), made entirely from the original images of the MS Celeb dataset. To create it, the RFW authors uploaded everyone's image from the MS Celeb dataset to Face++ and used the inferred racial scores to segregate people into four subsets: Caucasian, Asian, Indian, and African each with 3,000 subjects.</p> <p>Face++ is a face recognition product from Megvii Inc. who has been repeatedly linked to the oppressive surveillance of Uighur Muslims in Xinjiang, China. According to posts from the <a href="https://chinai.substack.com/p/chinai-newsletter-11-companies-involved-in-expanding-chinas-public-security-apparatus-in-xinjiang">ChinAI Newsletter</a> and <a href="https://www.buzzfeednews.com/article/ryanmac/us-money-funding-facial-recognition-sensetime-megvii">BuzzFeedNews</a>, Megvii announced in 2017 at the China-Eurasia Security Expo in Ürümqi, Xinjiang, that it would be the official technical support unit of the "Public Security Video Laboratory" in Xinjiang, China.</p> -<p>Megvii also publicly acknowledges using the MS Celeb face dataset in their 2018 research project called <a href="https://arxiv.org/pdf/1808.06210.pdf">GridFace: Face Rectification via Learning Local Homography Transformations</a>. The paper has three authors, all of whom worked for Megvii, indicating that the dataset has been used for commercial purposes. However, on Microsoft's <a href="http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset">website</a> they state that the dataset was released "for non-commercial research purpose only."</p> -</section><section><div class='columns columns-2'><section class='images'><div class='image'><img src='https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/1812.00194.jpg' alt=' Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation by Beijing University of Posts and Telecommunications and Canon Information Technology Co., Ltd. 2018. Source <a href="https://arxiv.org/pdf/1812.00194">arxiv.org</a>'><div class='caption'> Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation by Beijing University of Posts and Telecommunications and Canon Information Technology Co., Ltd. 2018. Source <a href="https://arxiv.org/pdf/1812.00194">arxiv.org</a></div></div></section><section class='images'><div class='image'><img src='https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/1808.06210.jpg' alt=' GridFace: Face Rectification via Learning LocalHomography Transformations by Megvii (Face++) in 2018. Source: <a href="https://arxiv.org/abs/1808.06210">arxiv.org</a>'><div class='caption'> GridFace: Face Rectification via Learning LocalHomography Transformations by Megvii (Face++) in 2018. Source: <a href="https://arxiv.org/abs/1808.06210">arxiv.org</a></div></div></section></div></section><section><p>Considering the multiple examples of commercial use (SenseTime, Megvii, Canon, Hitachi, Microsoft, Microsoft Asia), military use (National University of Defense Technology in China), and the proliferation of subsets it's clear that Microsoft is no longer in control of the MS Celeb dataset nor the biometric data of the 100,000 individuals whose images were distributed in the dataset.</p> +<h2>Commercial Usage</h2> +<p>Megvii publicly acknowledges using the MS Celeb face dataset in their 2018 research project called <a href="https://arxiv.org/pdf/1808.06210.pdf">GridFace: Face Rectification via Learning Local Homography Transformations</a>. The paper has three authors, all of whom were associated with Megvii, indicating that the dataset has been used for research associated with commercial activity. However, on Microsoft's <a href="http://web.archive.org/web/20180218212120/http://www.msceleb.org/download/sampleset">website</a> they state that the dataset was released "for non-commercial research purpose only."</p> +<p>A more clear example of commercial use happened in 2017 when Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia compete to achieve the highest performance using their recognition technology. In 2017, the winner of the MS-Celeb-1M challenge was Beijing-based OrionStar Technology Co., Ltd.. In their <a href="https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html">press release</a>, OrionStar boast 13% increase on the difficult set over last year's winner.</p> +<p>Microsoft Research also ran a similar competition in 2016 that with other commercial participants including Beijing Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.</p> +<p>On October 28, 2019, the MS Celeb dataset will be used for yet competition called "<a href="https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/">Lightweight Face Recognition Challenge & Workshop</a>" where the best face recognition entry will be awarded $5,000 from Huawei and $3,000 from DeepGlint. The competition is part of the <a href="http://iccv2019.thecvf.com/program/workshops">ICCV 2019 conference</a>. This time the challenge is no longer being organized by Microsoft, who created the dataset, but instead by Imperial College London (UK) and <a href="https://github.com/deepinsight/insightface">InsightFace</a> (CN).</p> +<p>Even though Microsoft has shuttered access to the official distribution website <a href="https://msceleb.org">msceleb.org</a> the dataset can still be easily downloaded from <a href="https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/">https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/</a> without agreeing to any terms for usage or further distribution.</p> +<p>Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii, Microsoft, Microsoft Asia, SenseTime), military use (National University of Defense Technology in China), and the proliferation of subsets being used for new face recognition competitions it's fairly clear that Microsoft is no longer in control of their MS Celeb dataset nor the biometric data of nearly 10 million images of 100,000 individuals whose images were distributed in the dataset.</p> <p>To provide insight into where these 10 million faces images have traveled, we mapped all the publicly available research citations to show who used the dataset and where it was used.</p> </section><section> <h3>Who used Microsoft Celeb?</h3> @@ -274,6 +281,7 @@ </li><li>2 <a name="[^madhu_ft]" class="footnote_shim"></a><span class="backlinks"><a href="#[^madhu_ft]_1">a</a></span>Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019. </li><li>3 <a name="[^rfw]" class="footnote_shim"></a><span class="backlinks"></span>Wang, Mei; Deng, Weihong; Hu, Jiani; Peng, Jianteng; Tao, Xunqiang; Huang, Yaohai. Racial Faces in-the-Wild: Reducing Racial Bias by Deep Unsupervised Domain Adaptation. 2018. <a href="http://arxiv.org/abs/1812.00194">http://arxiv.org/abs/1812.00194</a> </li><li>4 <a name="[^pride_prejudice]" class="footnote_shim"></a><span class="backlinks"></span>Nagpal, Shruti; Singh, Maneet; Singh, Richa; Vatsa, Mayank; Ratha, Nalini K.. Deep Learning for Face Recognition: Pride or Prejudiced? 2019. <a href="http://arxiv.org/abs/1904.01219">http://arxiv.org/abs/1904.01219</a> +</li><li>5 <a name="[^one_shot]" class="footnote_shim"></a><span class="backlinks"><a href="#[^one_shot]_1">a</a></span>Guo, Yandong; Zhang,Lei. One-shot Face Recognition by Promoting Underrepresented Classes. 2017. <a href="https://arxive.org/abs/1707.05574">https://arxive.org/abs/1707.05574</a> </li></ul></section></section> </div> |
