diff options
Diffstat (limited to 'site/public/datasets/msceleb')
| -rw-r--r-- | site/public/datasets/msceleb/index.html | 29 |
1 files changed, 20 insertions, 9 deletions
diff --git a/site/public/datasets/msceleb/index.html b/site/public/datasets/msceleb/index.html index 42a44571..f0da450f 100644 --- a/site/public/datasets/msceleb/index.html +++ b/site/public/datasets/msceleb/index.html @@ -55,8 +55,8 @@ </header> <div class="content content-dataset"> - <section class='intro_section' style='background-image: url(https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/background.jpg)'><div class='inner'><div class='hero_desc'><span class='bgpad'>MS Celeb is a dataset of 10 million face images harvested from the Internet</span></div><div class='hero_subdesc'><span class='bgpad'>The MS Celeb dataset includes 10 million images of 100,000 people and an additional target list of 1,000,000 individuals -</span></div></div></section><section><h2>Microsoft Celeb Dataset (MS Celeb)</h2> + <section class='intro_section' style='background-image: url(https://nyc3.digitaloceanspaces.com/megapixels/v1/datasets/msceleb/assets/background.jpg)'></section><section><div class='image'><div class='intro-caption caption'>Example images forom the MS-Celeb-1M dataset</div></div></section><section><h1>Microsoft Celeb Dataset (MS Celeb)</h1> +<p><em>Update: In response to this report and an <a href="https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e">investigation</a> by the Financial Times, Microsoft has terminated their MS-Celeb website <a href="https://msceleb.org">https://msceleb.org</a>.</em></p> </section><section><div class='right-sidebar'><div class='meta'> <div class='gray'>Published</div> <div>2016</div> @@ -78,7 +78,8 @@ </div><div class='meta'> <div class='gray'>Website</div> <div><a href='http://www.msceleb.org/' target='_blank' rel='nofollow noopener'>msceleb.org</a></div> - </div></div><p>Microsoft Celeb (MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies. According to Microsoft Research, who created and published the <a href="https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></p> + </div><div class='meta'><div class='gray'>Press coverage</div><div><a href="https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e">Financial Times</a>, <a href="https://www.nytimes.com/2019/07/13/technology/databases-faces-facial-recognition-technology.html">New York Times</a>, <a href="https://www.bbc.com/news/technology-48555149">BBC</a>, <a href="https://www.spiegel.de/netzwelt/web/microsoft-gesichtserkennung-datenbank-mit-zehn-millionen-fotos-geloescht-a-1271221.html">Spiegel</a>, <a href="https://www.lesechos.fr/tech-medias/intelligence-artificielle/le-mariage-explosif-de-nos-donnees-et-de-lia-1031813">Les Echos</a>, <a href="https://www.lastampa.it/2019/06/22/tecnologia/microsoft-ha-cancellato-il-suo-database-per-il-riconoscimento-facciale-PWwLGmpO1fKQdykMZVBd9H/pagina.html">La Stampa</a></div></div></div><p>Microsoft Celeb (MS-Celeb-1M) is a dataset of 10 million face images harvested from the Internet for the purpose of developing face recognition technologies.</p> +<p>According to Microsoft Research, who created and published the <a href="https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/">dataset</a> in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' biometric data to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".<a class="footnote_shim" name="[^msceleb_orig]_1"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></p> <p>While the majority of people in this dataset are American and British actors, the exploitative use of the term "celebrity" extends far beyond Hollywood. Many of the names in the MS Celeb face recognition dataset are merely people who must maintain an online presence for their professional lives: journalists, artists, musicians, activists, policy makers, writers, and academics. Many people in the target list are even vocal critics of the very technology Microsoft is using their name and biometric information to build. It includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; Shoshana Zuboff, author of <em>Surveillance Capitalism</em>; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy.</p> <h3>Microsoft's 1 Million Target List</h3> <p>Microsoft Research distributed two main digital assets: a dataset of approximately 10,000,000 images of 100,000 individuals and a target list of exactly 1 million names. The 900,000 names without images are the target list, which is used to gather more images for each subject.</p> @@ -219,6 +220,8 @@ <p>In 2017 Microsoft Research organized a face recognition competition at the International Conference on Computer Vision (ICCV), one of the top 2 computer vision conferences worldwide, where industry and academia used the MS Celeb dataset to compete for the highest performance scores. The 2017 winner was Beijing-based OrionStar Technology Co., Ltd.. In their <a href="https://www.prnewswire.com/news-releases/orionstar-wins-challenge-to-recognize-one-million-celebrity-faces-with-artificial-intelligence-300494265.html">press release</a>, OrionStar boasted a 13% increase on the difficult set over last year's winner. The prior year's competitors included Beijing-based Faceall Technology Co., Ltd., a company providing face recognition for "smart city" applications.</p> <p>Considering the multiple citations from commercial organizations (Canon, Hitachi, IBM, Megvii/Face++, Microsoft, Microsoft Asia, SenseTime, OrionStar, Faceall), military use (National University of Defense Technology in China), the proliferation of subset data (Racial Faces in the Wild), and the real-time visible proliferation via Academic Torrents it's fairly clear that Microsoft has lost control of their MS Celeb dataset and the biometric data of nearly 100,000 individuals.</p> <p>To provide insight into where these 10 million faces images have traveled, over 100 research papers have been verified and geolocated to show who used the dataset and where they used it.</p> +<h2>GDPR and MS-Celeb</h2> +<p>[ in progress ]</p> </section><section> <h3>Who used Microsoft Celeb?</h3> @@ -240,10 +243,10 @@ <section> - <h3>Information Supply chain</h3> + <h3>Information Supply Chain</h3> <p> - To help understand how Microsoft Celeb has been used around the world by commercial, military, and academic organizations; existing publicly available research citing Microsoft Celebrity Dataset was collected, verified, and geocoded to show the biometric trade routes of people appearing in the images. Click on the markers to reveal research projects at that location. + To help understand how Microsoft Celeb has been used around the world by commercial, military, and academic organizations; existing publicly available research citing Microsoft Celebrity Dataset was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location. </p> </section> @@ -279,11 +282,19 @@ <h2>Supplementary Information</h2> -</section><section><h5>FAQs and Fact Check</h5> +</section><section><h3>Age and Gender Distribution</h3> +</section><section><div class='columns columns-2'><section class='applet_container'><div class='applet' data-payload='{"command": "single_pie_chart /datasets/msceleb/assets/age.csv", "fields": ["Caption: MS-Celeb dataset age distribution", "Top: 10", "OtherLabel: Other"]}'></div></section><section class='applet_container'><div class='applet' data-payload='{"command": "single_pie_chart /datasets/helen/assets/gender.csv", "fields": ["Caption: MS-Celeb dataset gender distribution", "Top: 10", "OtherLabel: Other"]}'></div></section></div></section><section><h5>FAQs and Fact Check</h5> <ul> -<li><strong>The MS Celeb images were not derived from Creative Commons sources</strong>. They were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"<a class="footnote_shim" name="[^msceleb_orig]_2"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a>. The dataset actually includes many copyrighted images. Microsoft doesn't provide any image URLs, but manually reviewing a small portion of images from the dataset shows many images with watermarked "Copyright" text over the image. TinEye could be used to more accurately determine the image origins in aggregate</li> -<li><strong>Microsoft did not distribute images of all one million people.</strong> They distributed images for about 100,000 and then encouraged other researchers to download the remaining 900,000 people "by using all the possibly collected face images of this individual on the web as training data."<a class="footnote_shim" name="[^msceleb_orig]_3"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></li> -<li><strong>Microsoft had not deleted or stopped distribution of their MS Celeb at the time of most press reports on June 4.</strong> Until at least June 6, 2019 the Microsoft Research data portal provided the MS Celeb dataset for download: <a href="http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737">http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737</a></li> +<li><strong>Despite several erroneous reports mentioning the MS-Celeb images were derived from Creative Commons licensed media, the MS Celeb images were obtained from web search engines</strong>. The authors mention "they were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"<a class="footnote_shim" name="[^msceleb_orig]_2"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a>. Many, if not the vast majority, are copyrighted images. Microsoft doesn't provide image URLs, but manually reviewing a small portion of images from the dataset shows images with watermarked "Copyright" text over the image and sources including stock photo agencies such as Getty. TinEye could be used to more accurately determine the image origins in aggregate.</li> +<li><strong>Most reports incorrectly reported that Microsoft distributed images of all one million people. As this analysis mentions several times, Microsoft distributed images for 100,000 people and a separate target list of 900,000 more names.</strong> Other researchers where then expected and encouraged to download the remaining 900,000 people "by using all the possibly collected face images of this individual on the web as training data."<a class="footnote_shim" name="[^msceleb_orig]_3"> </a><a href="#[^msceleb_orig]" class="footnote" title="Footnote 1">1</a></li> +<li><strong>Microsoft claimed that they had deleted or stopped distribution of their MS Celeb dataset in April 2019 after the Financial Times investigation. This false.</strong> Until at least June 6, 2019 the Microsoft Research data portal freely provided the full MS Celeb dataset download: <a href="http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737">http://web.archive.org/web/20190606150005/https://msropendata.com/datasets/98fdfc70-85ee-5288-a69f-d859bbe9c737</a></li> +</ul> +<h3>Press Coverage</h3> +<ul> +<li>Financial Times (original story): <a href="https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e">Who’s using your face? The ugly truth about facial recognition</a> </li> +<li>New York Times (front page story): <a href="https://www.nytimes.com/2019/07/13/technology/databases-faces-facial-recognition-technology.html">Facial Recognition Tech Is Growing Stronger, Thanks to Your Face</a></li> +<li>BBC: <a href="https://www.bbc.com/news/technology-48555149">Microsoft deletes massive face recognition database</a></li> +<li>Spiegel: <a href="https://www.spiegel.de/netzwelt/web/microsoft-gesichtserkennung-datenbank-mit-zehn-millionen-fotos-geloescht-a-1271221.html">Microsoft löscht Datenbank mit zehn Millionen Fotos</a></li> </ul> </section><section><h3>References</h3><section><ul class="footnotes"><li>1 <a name="[^msceleb_orig]" class="footnote_shim"></a><span class="backlinks"><a href="#[^msceleb_orig]_1">a</a><a href="#[^msceleb_orig]_2">b</a><a href="#[^msceleb_orig]_3">c</a></span>MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Accessed April 18, 2019. <a href="http://web.archive.org/web/20190418151913/http://msceleb.org/">http://web.archive.org/web/20190418151913/http://msceleb.org/</a> </li><li>2 <a name="[^madhu_ft]" class="footnote_shim"></a><span class="backlinks"><a href="#[^madhu_ft]_1">a</a></span>Murgia, Madhumita. Microsoft worked with Chinese military university on artificial intelligence. Financial Times. April 10, 2019. |
