diff options
| author | Adam Harvey <adam@ahprojects.com> | 2019-05-23 18:37:06 +0200 |
|---|---|---|
| committer | Adam Harvey <adam@ahprojects.com> | 2019-05-23 18:37:06 +0200 |
| commit | b2b2c7d7816baa7d6de36c1de3576a31aa92a209 (patch) | |
| tree | 9105ef39a3bfcd78e9cf4b8c183ee21e7149bf66 /site/content | |
| parent | 4559cf6cccfb6f6d8b8e59e95984044fdf5a5610 (diff) | |
| parent | 84b286e1bd85feba12174a2a480d2be404e7b9c5 (diff) | |
merge
Diffstat (limited to 'site/content')
19 files changed, 218 insertions, 46 deletions
diff --git a/site/content/_drafts_/adience/index.md b/site/content/_drafts_/adience/index.md new file mode 100644 index 00000000..60a6cd1f --- /dev/null +++ b/site/content/_drafts_/adience/index.md @@ -0,0 +1,32 @@ +------------ + +status: draft +title: Adience +desc: <span class="dataset-name">Adience</span> is a ... +subdesc: Adience contains ... +slug: Adience +cssclass: dataset +image: assets/background.jpg +year: 2007 +published: 2019-2-23 +updated: 2019-2-23 +authors: Adam Harvey + +------------ + +## Adience Dataset + +### sidebar +### end sidebar + +[ page under development ] + +- Deep Age Estimation: From Classification to Ranking + - https://verify.megapixels.cc/paper/adience/verify/4f1249369127cc2e2894f6b2f1052d399794919a + - funded by FordMotor Company University Reserach Program +- Unconstrained Age Estimation with Deep Convolutional Neural Networks + - https://verify.megapixels.cc/paper/adience/verify/31f1e711fcf82c855f27396f181bf5e565a2f58d + - "we augment our data by sampling 1000 images for the age group of 0-20 from Adience [3]" + - the work was supported by IARPA and ODNI + +{% include 'dashboard.html' %}
\ No newline at end of file diff --git a/site/content/_drafts_/ibm_dif/index.md b/site/content/_drafts_/ibm_dif/index.md new file mode 100644 index 00000000..5d72193b --- /dev/null +++ b/site/content/_drafts_/ibm_dif/index.md @@ -0,0 +1,28 @@ +------------ + +status: draft +title: IBM Diversity in Faces +desc: <span class="dataset-name">IBM Diversity in Faces</span> is a person re-identification dataset of images captured at UC Santa Cruz in 2007 +subdesc: IBM Diversity in Faces contains 1,264 images and 632 persons on the UC Santa Cruz campus and is used to train person re-identification algorithms for surveillance +slug: IBM Diversity in Faces +cssclass: dataset +image: assets/background.jpg +year: 2007 +published: 2019-2-23 +updated: 2019-2-23 +authors: Adam Harvey + +------------ + +## IBM Diversity in Faces Dataset + +### sidebar +### end sidebar + +[ page under development ] + +in "Understanding Unequal Gender Classification Accuracyfrom Face Images" researcher affilliated with IBM created a new version of PPB so they didn't have to agree to the terms of the original PPB. + +>We use an approximation of the PPB dataset for the ex-periments in this paper. This dataset contains images ofparliament members from the six countries identified in[4] and were manually labeled by us into the categoriesdark-skinned and light-skinned.1Our approximation tothe PPB dataset, which we call PPB*, is very similar toPPB and satisfies the relevant characteristics for the study we perform. Table 1 compares the decomposition of theoriginal PPB dataset and our PPB* approximation accord-ing to skin type and gender. + +{% include 'dashboard.html' %}
\ No newline at end of file diff --git a/site/content/_drafts_/lfw/index.md b/site/content/_drafts_/lfw/index.md index 5d90e87f..ad43e2dd 100644 --- a/site/content/_drafts_/lfw/index.md +++ b/site/content/_drafts_/lfw/index.md @@ -18,6 +18,15 @@ authors: Adam Harvey ### sidebar ### end sidebar + + +## Research notes + +- Used in https://verify.megapixels.cc/paper/feret/verify/8aff9c8a0e17be91f55328e5be5e94aea5227a35https://verify.megapixels.cc/paper/feret/verify/8aff9c8a0e17be91f55328e5be5e94aea5227a35 by Raythen BBN https://en.wikipedia.org/wiki/BBN_Technologies a military contractor +----- + +## Old content + [ PAGE UNDER DEVELOPMENT ] *Labeled Faces in The Wild* (LFW) is "a database of face photographs designed for studying the problem of unconstrained face recognition[^lfw_www]. It is used to evaluate and improve the performance of facial recognition algorithms in academic, commercial, and government research. According to BiometricUpdate.com[^lfw_pingan], LFW is "the most widely used evaluation set in the field of facial recognition, LFW attracts a few dozen teams from around the globe including Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong." diff --git a/site/content/_drafts_/megaface/index.md b/site/content/_drafts_/megaface/index.md new file mode 100644 index 00000000..4c7bb309 --- /dev/null +++ b/site/content/_drafts_/megaface/index.md @@ -0,0 +1,49 @@ +------------ + +status: draft +title: MegaFace +desc: <span class="dataset-name">MegaFace</span> is a face recognition dataset created by scraping Flickr photo albums +subdesc: MegaFace contains 1,264 images and 632 persons on the UC Santa Cruz campus and is used to train person re-identification algorithms for surveillance +slug: MegaFace +cssclass: dataset +image: assets/background.jpg +year: 2007 +published: 2019-2-23 +updated: 2019-2-23 +authors: Adam Harvey + +------------ + +## MegaFace Dataset + +### sidebar +### end sidebar + +[ page under development ] + +*MegaFace (Viewpoint Invariant Pedestrian Recognition)* is a dataset of pedestrian images captured at University of California Santa Cruz in 2007. Accoriding to the reserachers 2 "cameras were placed in different locations in an academic setting and subjects were notified of the presence of cameras, but were not coached or instructed in any way." + +MegaFace is amongst the most widely used publicly available person re-identification datasets. In 2017 the MegaFace dataset was combined into a larger person re-identification created by the Chinese University of Hong Kong called PETA (PEdesTrian Attribute). + +{% include 'dashboard.html' %} + + +### Research notes + +Dataset was used in research paper funded by SenseTime + +- https://verify.megapixels.cc/paper/megaface/verify/380d5138cadccc9b5b91c707ba0a9220b0f39271 +- x + +From "On Low-Resolution Face Recognition in the Wild:Comparisons and New Techniques" + +- Says 130,154 Flickr accounts, but I got 48,382 +- https://verify.megapixels.cc/paper/megaface/verify/841855205818d3a6d6f85ec17a22515f4f062882 + +> 2) MegaFace Challenge 2 LR subset:The MegaFace challenge 2 (MF2) training dataset [48] is the largest (in the numberof identities) publicly available facial recognition dataset, with4.7 million face images and over 672,000 identities. The MF2dataset is obtained by running the Dlib [29] face detector onimages from Flickr [68], yielding 40 million unlabeled faces across 130,154 distinct Flickr accounts. Automatic identity labeling is performed using a clustering algorithm. We per-formed a subset selection from the MegaFace Challenge 2training set with tight bounding boxes to generate a LR subsetof this dataset. Faces smaller than 50x50 pixels are gathered for each identity, and then we eliminated identities with fewer thanfive images available. This subset selection approach produced 6,700 identities and 85,344 face images in total. The extractionprocess does yield some non-face images, as does the originaldataset processing. No further data cleaning is conducted onthis subset. + +UHDB31: A Dataset for Better Understanding Face Recognitionacross Pose and Illumination Variatio + +- http://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w37/Le_UHDB31_A_Dataset_ICCV_2017_paper.pdf +- MegaFace 1 used 690,572 and 1,027,060 +- MegaFace 2 used 672,057 and 4,753,320
\ No newline at end of file diff --git a/site/content/pages/about/attribution.md b/site/content/pages/about/attribution.md index 148fe6d1..5060b2d9 100644 --- a/site/content/pages/about/attribution.md +++ b/site/content/pages/about/attribution.md @@ -16,6 +16,7 @@ authors: Adam Harvey <section class="about-menu"> <ul> <li><a href="/about/">About</a></li> +<li><a href="/about/press/">Press</a></li> <li><a class="current" href="/about/attribution/">Attribution</a></li> <li><a href="/about/legal/">Legal / Privacy</a></li> </ul> diff --git a/site/content/pages/about/index.md b/site/content/pages/about/index.md index 3884290b..4cf390fc 100644 --- a/site/content/pages/about/index.md +++ b/site/content/pages/about/index.md @@ -16,6 +16,7 @@ authors: Adam Harvey <section class="about-menu"> <ul> <li><a class="current" href="/about/">About</a></li> +<li><a href="/about/press/">Press</a></li> <li><a href="/about/attribution/">Attribution</a></li> <li><a href="/about/legal/">Legal / Privacy</a></li> </ul> @@ -46,7 +47,7 @@ MegaPixels is an art and research project first launched in 2017 for an [install MegaPixels aims to provide a critical perspective on machine learning image datasets, one that might otherwise escape academia and industry funded artificial intelligence think tanks that are often supported by the several of the same technology companies who have created datasets presented on this site. -MegaPixels is an independent project, designed as a public resource for educators, students, journalists, and researchers. Each dataset presented on this site undergoes a thorough review of its images, intent, and funding sources. Though the goals are similar to publishing an academic paper, MegaPixels is a website-first research project, with a academic publications to follow. +MegaPixels is an independent project, designed as a public resource for educators, students, journalists, and researchers. Each dataset presented on this site undergoes a thorough review of its images, intent, and funding sources. Though the goals are similar to publishing an academic paper, MegaPixels is a website-first research project, with an academic publication to follow. One of the main focuses of the dataset investigations presented on this site is to uncover where funding originated. Because of our emphasis on other researcher's funding sources, it is important that we are transparent about our own. This site and the past year of research have been primarily funded by a privacy art grant from Mozilla in 2018. The original MegaPixels installation in 2017 was built as a commission for and with support from Tactical Technology Collective and Mozilla. The research into pedestrian analysis datasets was funded by a commission from Elevate Arts, and continued development in 2019 is supported in part by a 1-year Researcher-in-Residence grant from Karlsruhe HfG, as well as lecture and workshop fees. diff --git a/site/content/pages/about/legal.md b/site/content/pages/about/legal.md index a58fde48..e88fbb17 100644 --- a/site/content/pages/about/legal.md +++ b/site/content/pages/about/legal.md @@ -16,6 +16,7 @@ authors: Adam Harvey <section class="about-menu"> <ul> <li><a href="/about/">About</a></li> +<li><a href="/about/press/">Press</a></li> <li><a href="/about/attribution/">Attribution</a></li> <li><a class="current" href="/about/legal/">Legal / Privacy</a></li> </ul> @@ -36,7 +37,7 @@ In order to provide certain features of the site, some 3rd party services are ne ### Links To Other Web Sites -The MegaPixels.cc contains many links to 3rd party websites, especially in the list of citations that are provided for each dataset. This website has no control over and assumes no responsibility for, the content, privacy policies, or practices of any third party web sites or services. You acknowledge and agree that megapixels.cc shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with use of or reliance on any such content, goods or services available on or through any such web sites or services. +The MegaPixels.cc contains many links to 3rd party websites, especially in the list of citations that are provided for each dataset. This website has no control over and assumes no responsibility for the content, privacy policies, or practices of any third party web sites or services. You acknowledge and agree that megapixels.cc (and its creators) shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with use of or reliance on any such content, goods or services available on or through any such web sites or services. We advise you to read the terms and conditions and privacy policies of any third-party web sites or services that you visit. diff --git a/site/content/pages/about/press.md b/site/content/pages/about/press.md index 11194ce2..2839bf20 100644 --- a/site/content/pages/about/press.md +++ b/site/content/pages/about/press.md @@ -16,12 +16,24 @@ authors: Adam Harvey <section class="about-menu"> <ul> <li><a href="/about/">About</a></li> +<li><a class="current" href="/about/press/">Press</a></li> <li><a href="/about/attribution/">Attribution</a></li> <li><a href="/about/legal/">Legal / Privacy</a></li> </ul> </section> -https://megapixels.cc + +##### Features - April 19, 2019: [Who's Using Your Face](https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e) by Madhumita Murgia for FT.com -- Aug 22, 2018: [Transgender YouTubers had their videos grabbed to train facial recognition software](https://www.theverge.com/2017/8/22/16180080/transgender-youtubers-ai-facial-recognition-dataset) by James Vincent
\ No newline at end of file + +##### Cited by + +- April 19, 2019: [Western AI researchers partnered with Chinese surveillance firm](https://www.ft.com/content/41be9878-61d9-11e9-b285-3acd5d43599e) by Madhumita Murgia for FT.com + + +##### Related + +- April 20: Washington Post Editorial Board [Opinion | Microsoft worked with a Chinese military university on AI. Does that make sense?](https://www.washingtonpost.com/opinions/microsoft-worked-with-a-chinese-military-university-on-ai-does-that-make-sense/2019/04/21/a0fb82c6-5d59-11e9-842d-7d3ed7eb3957_story.html) +- April 10, 2019: [Microsoft worked with Chinese military university on artificial intelligence](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) see also [MS Celeb](/datasets/msceleb) +- Aug 22, 2018: [Transgender YouTubers had their videos grabbed to train facial recognition software](https://www.theverge.com/2017/8/22/16180080/transgender-youtubers-ai-facial-recognition-dataset) by James Vincent diff --git a/site/content/pages/datasets/brainwash/index.md b/site/content/pages/datasets/brainwash/index.md index 47c41fd7..e6217a18 100644 --- a/site/content/pages/datasets/brainwash/index.md +++ b/site/content/pages/datasets/brainwash/index.md @@ -1,9 +1,9 @@ ------------ status: published -title: Brainwash -desc: Brainwash is a dataset of webcam images taken from the Brainwash Cafe in San Francisco in 2014 -subdesc: The Brainwash dataset includes 11,918 images of "everyday life of a busy downtown cafe" and is used for training head detection surveillance algorithms +title: Brainwash Dataset +desc: Brainwash is a dataset of webcam images taken from the Brainwash Cafe in San Francisco +subdesc: The Brainwash dataset includes 11,917 images of "everyday life of a busy downtown cafe" and is used for training head detection surveillance algorithms slug: brainwash cssclass: dataset image: assets/background.jpg @@ -19,23 +19,25 @@ authors: Adam Harvey ### sidebar ### end sidebar -Brainwash is a dataset of livecam images taken from San Francisco's Brainwash Cafe. It includes 11,918 images of "everyday life of a busy downtown cafe"[^readme] captured at 100 second intervals throughout the entire day. The Brainwash dataset includes 3 full days of webcam images taken on October 27, November 13, and November 24 in 2014. According the author's [research paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666) introducing the dataset, the images were acquired with the help of Angelcam.com[^end_to_end] +Brainwash is a dataset of livecam images taken from San Francisco's Brainwash Cafe. It includes 11,917 images of "everyday life of a busy downtown cafe"[^readme] captured at 100 second intervals throughout the entire day. The Brainwash dataset includes 3 full days of webcam images taken on October 27, November 13, and November 24 in 2014. According the author's [research paper](https://www.semanticscholar.org/paper/End-to-End-People-Detection-in-Crowded-Scenes-Stewart-Andriluka/1bd1645a629f1b612960ab9bba276afd4cf7c666) introducing the dataset, the images were acquired with the help of Angelcam.com. [^end_to_end] The Brainwash dataset is unique because it uses images from a publicly available webcam that records people inside a privately owned business without any consent. No ordinary cafe customer would ever suspect that their image would end up in dataset used for surveillance research and development, but that is exactly what happened to customers at Brainwash cafe in San Francisco. -Although Brainwash appears to be a less popular dataset, it was notably used in 2016 and 2017 by researchers affiliated with the National University of Defense Technology in China for two [research](https://www.semanticscholar.org/paper/Localized-region-context-and-object-feature-fusion-Li-Dou/b02d31c640b0a31fb18c4f170d841d8e21ffb66c) [projects](https://www.semanticscholar.org/paper/A-Replacement-Algorithm-of-Non-Maximum-Suppression-Zhao-Wang/591a4bfa6380c9fcd5f3ae690e3ac5c09b7bf37b) on advancing the capabilities of object detection to more accurately isolate the target region in an image. [^localized_region_context] [^replacement_algorithm]. The [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) is controlled by China's top military body, the Central Military Commission. +Although Brainwash appears to be a less popular dataset, it was notably used in 2016 and 2017 by researchers affiliated with the National University of Defense Technology in China for two [research](https://www.semanticscholar.org/paper/Localized-region-context-and-object-feature-fusion-Li-Dou/b02d31c640b0a31fb18c4f170d841d8e21ffb66c) [projects](https://www.semanticscholar.org/paper/A-Replacement-Algorithm-of-Non-Maximum-Suppression-Zhao-Wang/591a4bfa6380c9fcd5f3ae690e3ac5c09b7bf37b) on advancing the capabilities of object detection to more accurately isolate the target region in an image. [^localized_region_context] [^replacement_algorithm] The [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) is controlled by China's top military body, the Central Military Commission. The dataset also appears in a 2017 [research paper](https://ieeexplore.ieee.org/document/7877809) from Peking University for the purpose of improving surveillance capabilities for "people detection in the crowded scenes". - + + + + {% include 'dashboard.html' %} {% include 'supplementary_header.html' %} - - + {% include 'cite_our_work.html' %} diff --git a/site/content/pages/datasets/duke_mtmc/index.md b/site/content/pages/datasets/duke_mtmc/index.md index 69de167b..928c79fa 100644 --- a/site/content/pages/datasets/duke_mtmc/index.md +++ b/site/content/pages/datasets/duke_mtmc/index.md @@ -1,14 +1,14 @@ ------------ status: published -title: Duke MTMC +title: Duke MTMC Dataset desc: <span class="dataset-name">Duke MTMC</span> is a dataset of surveillance camera footage of students on Duke University campus subdesc: Duke MTMC contains over 2 million video frames and 2,700 unique identities collected from 8 HD cameras at Duke University campus in March 2014 slug: duke_mtmc cssclass: dataset image: assets/background.jpg published: 2019-4-18 -updated: 2019-4-18 +updated: 2019-05-22 authors: Adam Harvey ------------ @@ -20,13 +20,13 @@ authors: Adam Harvey Duke MTMC (Multi-Target, Multi-Camera) is a dataset of surveillance video footage taken on Duke University's campus in 2014 and is used for research and development of video tracking systems, person re-identification, and low-resolution facial recognition. The dataset contains over 14 hours of synchronized surveillance video from 8 cameras at 1080p and 60 FPS, with over 2 million frames of 2,000 students walking to and from classes. The 8 surveillance cameras deployed on campus were specifically setup to capture students "during periods between lectures, when pedestrian traffic is heavy"[^duke_mtmc_orig]. -In this investigation into the Duke MTMC dataset we tracked down over 100 publicly available research papers that explicitly acknowledged using Duke MTMC. Our analysis shows that the dataset has spread far beyond its origins and intentions in academic research projects at Duke University. Since its publication in 2016, more than twice as many research citations originated in China as in the United States. Among these citations were papers with explicit and direct links to the Chinese military and several of the companies known to provide Chinese authorities with the oppressive surveillance technology used to monitor millions of Uighur Muslims. +In this investigation into the Duke MTMC dataset we tracked down over 100 publicly available research papers that explicitly acknowledged using Duke MTMC. Our analysis shows that the dataset has spread far beyond its origins and intentions in academic research projects at Duke University. Since its publication in 2016, more than twice as many research citations originated in China as in the United States. Among these citations were papers links to the Chinese military and several of the companies known to provide Chinese authorities with the oppressive surveillance technology used to monitor millions of Uighur Muslims. In one 2018 [paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_Attention-Aware_Compositional_Network_CVPR_2018_paper.pdf) jointly published by researchers from SenseNets and SenseTime (and funded by SenseTime Group Limited) entitled [Attention-Aware Compositional Network for Person Re-identification](https://www.semanticscholar.org/paper/Attention-Aware-Compositional-Network-for-Person-Xu-Zhao/14ce502bc19b225466126b256511f9c05cadcb6e), the Duke MTMC dataset was used for "extensive experiments" on improving person re-identification across multiple surveillance cameras with important applications in "finding missing elderly and children, and suspect tracking, etc." Both SenseNets and SenseTime have been directly linked to the providing surveillance technology to monitor Uighur Muslims in China. [^xinjiang_nyt][^sensetime_qz][^sensenets_uyghurs]  -Despite [repeated](https://www.hrw.org/news/2017/11/19/china-police-big-data-systems-violate-privacy-target-dissent) [warnings](https://www.hrw.org/news/2018/02/26/china-big-data-fuels-crackdown-minority-region) by Human Rights Watch that the authoritarian surveillance used in China represents a violation of human rights, researchers at Duke University continued to provide open access to their dataset for anyone to use for any project. As the surveillance crisis in China grew, so did the number of citations with links to organizations complicit in the crisis. In 2018 alone there were over 70 research projects happening in China that publicly acknowledged benefiting from the Duke MTMC dataset. Amongst these were projects from SenseNets, SenseTime, CloudWalk, Megvii, Beihang University, and the PLA's National University of Defense Technology. +Despite [repeated](https://www.hrw.org/news/2017/11/19/china-police-big-data-systems-violate-privacy-target-dissent) [warnings](https://www.hrw.org/news/2018/02/26/china-big-data-fuels-crackdown-minority-region) by Human Rights Watch that the authoritarian surveillance used in China represents humanitarian crisis, researchers at Duke University continued to provide open access to their dataset for anyone to use for any project. As the surveillance crisis in China grew, so did the number of citations with links to organizations complicit in the crisis. In 2018 alone there were over 90 research projects happening in China that publicly acknowledged using and benefiting from the Duke MTMC dataset. Amongst these were projects from CloudWalk, Hikvision, Megvii (Face++), SenseNets, SenseTime, Beihang University, and the PLA's National University of Defense Technology. | Organization | Paper | Link | Year | Used Duke MTMC | |---|---|---|---| @@ -34,6 +34,7 @@ Despite [repeated](https://www.hrw.org/news/2017/11/19/china-police-big-data-sys | Beihang University | Online Inter-Camera Trajectory Association Exploiting Person Re-Identification and Camera Topology | [acm.org](https://dl.acm.org/citation.cfm?id=3240663) | 2018 | ✔ | | CloudWalk | CloudWalk re-identification technology extends facial biometric tracking with improved accuracy | [BiometricUpdate.com](https://www.biometricupdate.com/201903/cloudwalk-re-identification-technology-extends-facial-biometric-tracking-with-improved-accuracy) | 2019 | ✔ | |CloudWalk| Horizontal Pyramid Matching for Person Re-identification | [arxiv.org](https://arxiv.org/pdf/1804.05275.pdf) | 2018 | ✔ | +| Hikvision | Learning Incremental Triplet Margin for Person Re-identification | [arxiv.org](https://arxiv.org/abs/1812.06576) | 2018 | ✔ | | Megvii | Person Re-Identification (slides) | [github.io](https://zsc.github.io/megvii-pku-dl-course/slides/Lecture%2011,%20Human%20Understanding_%20ReID%20and%20Pose%20and%20Attributes%20and%20Activity%20.pdf) | 2017 | ✔ | | Megvii | Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project | [SemanticScholar](https://www.semanticscholar.org/paper/Multi-Target%2C-Multi-Camera-Tracking-by-Hierarchical-Zhang-Wu/10c20cf47d61063032dce4af73a4b8e350bf1128) | 2018 | ✔ | | Megvii | SCPNet: Spatial-Channel Parallelism Network for Joint Holistic and Partial PersonRe-Identification | [arxiv.org](https://arxiv.org/abs/1810.06996) | 2018 | ✔ | diff --git a/site/content/pages/datasets/ijb_c/index.md b/site/content/pages/datasets/ijb_c/index.md index 0671252b..d1ac769b 100644 --- a/site/content/pages/datasets/ijb_c/index.md +++ b/site/content/pages/datasets/ijb_c/index.md @@ -88,6 +88,15 @@ The first 777 are non-alphabetical. From 777-3531 is alphabetical  +## Research notes + +From original papers: https://noblis.org/wp-content/uploads/2018/03/icb2018.pdf + +Collection for the dataset began by identifying CreativeCommons subject videos, which are often more scarce thanCreative Commons subject images. Search terms that re-sulted in large quantities of person-centric videos (e.g. “in-terview”) were generated and translated into numerous lan-guages including Arabic, Korean, Swahili, and Hindi to in-crease diversity of the subject pool. Certain YouTube userswho upload well-labeled, person-centric videos, such as the World Economic Forum and the International University Sports Federation were also identified. Titles of videos per-taining to these search terms and usernames were scrapedusing the YouTube Data API and translated into English us-ing the Yandex Translate API4. Pattern matching was per-formed to extract potential names of subjects from the trans-lated titles, and these names were searched using the Wiki-data API to verify the subject’s existence and status as a public figure, and to check for Wikimedia Commons im-agery. Age, gender, and geographic region were collectedusing the Wikipedia API.Using the candidate subject names, Creative Commonsimages were scraped from Google and Wikimedia Com-mons, and Creative Commons videos were scraped fromYouTube. After images and videos of the candidate subjectwere identified, AMT Workers were tasked with validat-ing the subject’s presence throughout the video. The AMTWorkers marked segments of the video in which the subjectwas present, and key frames + + +IARPA funds Italian researcher https://www.micc.unifi.it/projects/glaivejanus/ + {% include 'dashboard.html' %} {% include 'supplementary_header.html' %} diff --git a/site/content/pages/datasets/index.md b/site/content/pages/datasets/index.md index 95c96e7f..2c7def38 100644 --- a/site/content/pages/datasets/index.md +++ b/site/content/pages/datasets/index.md @@ -1,11 +1,11 @@ ------------ status: published -title: MegaPixels: Datasets +title: MegaPixels: Face Recognition Datasets desc: Facial Recognition Datasets slug: home published: 2018-12-15 -updated: 2018-12-15 +updated: 2019-04-24 authors: Adam Harvey sync: false @@ -13,4 +13,4 @@ sync: false # Face Recognition Datasets -Explore face recognition datasets contributing the growing crisis of authoritarian biometric surveillance technologies. This first group of datasets focuses usage connected to foreign surveillance companies and defense organizations. +Explore face recognition datasets contributing to the growing crisis of authoritarian biometric surveillance technologies. This first group of 5 datasets focuses on image usage connected to foreign surveillance and defense organizations. diff --git a/site/content/pages/datasets/msceleb/assets/notes.md b/site/content/pages/datasets/msceleb/assets/notes.md new file mode 100644 index 00000000..0d8900d1 --- /dev/null +++ b/site/content/pages/datasets/msceleb/assets/notes.md @@ -0,0 +1,3 @@ +## Derivative Datasets + +- Racial Faces in the Wild http://whdeng.cn/RFW/index.html
\ No newline at end of file diff --git a/site/content/pages/datasets/msceleb/index.md b/site/content/pages/datasets/msceleb/index.md index 3d5c6c59..8623767b 100644 --- a/site/content/pages/datasets/msceleb/index.md +++ b/site/content/pages/datasets/msceleb/index.md @@ -1,9 +1,9 @@ ------------ status: published -title: Microsoft Celeb -desc: Microsoft Celeb 1M is a target list and dataset of web images used for research and development of face recognition -subdesc: The MS Celeb dataset includes over 10 million images of about 100K people and a target list of 1 million individuals +title: Microsoft Celeb Dataset +desc: Microsoft Celeb 1M is a dataset of 10 million face images harvested from the Internet +subdesc: The MS Celeb dataset includes 100,000 people and a target list of 1,000,000 individuals slug: msceleb cssclass: dataset image: assets/background.jpg @@ -19,67 +19,70 @@ authors: Adam Harvey ### sidebar ### end sidebar -Microsoft Celeb (MS Celeb) is a dataset of 10 million face images scraped from the Internet and used for research and development of large-scale biometric recognition systems. According to Microsoft Research, who created and published the [dataset](https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/) in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' images, and to use this dataset to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".[^msceleb_orig] +Microsoft Celeb (MS Celeb) is a dataset of 10 million face images scraped from the Internet and used for research and development of large-scale biometric recognition systems. According to Microsoft Research, who created and published the [dataset](https://www.microsoft.com/en-us/research/publication/ms-celeb-1m-dataset-benchmark-large-scale-face-recognition-2/) in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of nearly 100,000 individuals. Microsoft's goal in building this dataset was to distribute an initial training dataset of 100,000 individuals' images to accelerate research into recognizing a larger target list of one million people "using all the possibly collected face images of this individual on the web as training data".[^msceleb_orig] -These one million people, defined by Microsoft Research as "celebrities", are often merely people who must maintain an online presence for their professional lives. Microsoft's list of 1 million people is an expansive exploitation of the current reality that for many people, including academics, policy makers, writers, artists, and especially journalists, maintaining an online presence is mandatory. This fact should not allow Microsoft or anyone else to use their biometrics for research and development of surveillance technology. Many names in the target list even include people critical of the very technology Microsoft is using their name and biometric information to build. The list includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy, to name a few. +These one million people, defined by Microsoft Research as "celebrities", are often merely people who must maintain an online presence for their professional lives. Microsoft's list of 1 million people is an expansive exploitation of the current reality that for many people, including academics, policy makers, writers, artists, activists, and journalists; maintaining an online presence is mandatory. This fact should not allow Microsoft nor anyone else to use their biometrics for research and development of surveillance technology. Many names in the target list even include people critical of the very technology Microsoft is using their name and biometric information to build. The list includes digital rights activists like Jillian York; artists critical of surveillance including Trevor Paglen, Jill Magid, and Aram Bartholl; Intercept founders Laura Poitras, Jeremy Scahill, and Glenn Greenwald; Data and Society founder danah boyd; and even Julie Brill, the former FTC commissioner responsible for protecting consumer privacy, to name only 8 out of 1 million. ### Microsoft's 1 Million Target List -Below is a selection of names from the full target list, curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. The entire name file can be downloaded from [msceleb.org](https://www.msceleb.org). You can email <a href="mailto:msceleb@microsoft.com?subject=MS-Celeb-1M Removal Request&body=Dear%20Microsoft%2C%0A%0AI%20recently%20discovered%20that%20you%20use%20my%20identity%20for%20commercial%20use%20in%20your%20MS-Celeb-1M%20dataset%20used%20for%20research%20and%20development%20of%20face%20recognition.%20I%20do%20not%20wish%20to%20be%20included%20in%20your%20dataset%20in%20any%20format.%20%0A%0APlease%20remove%20my%20name%20and%2For%20any%20associated%20images%20immediately%20and%20send%20a%20confirmation%20once%20you've%20updated%20your%20%22Top1M_MidList.Name.tsv%22%20file.%0A%0AThanks%20for%20promptly%20handing%20this%2C%0A%5B%20your%20name%20%5D">msceleb@microsoft.com</a> to have your name removed. Names appearing with * indicate that Microsoft also distributed images. +Below is a selection of 24 names from the full target list, curated to illustrate Microsoft's expansive and exploitative practice of scraping the Internet for biometric training data. The entire name file can be downloaded from [msceleb.org](https://www.msceleb.org). You can email <a href="mailto:msceleb@microsoft.com?subject=MS-Celeb-1M Removal Request&body=Dear%20Microsoft%2C%0A%0AI%20recently%20discovered%20that%20you%20use%20my%20identity%20for%20commercial%20use%20in%20your%20MS-Celeb-1M%20dataset%20used%20for%20research%20and%20development%20of%20face%20recognition.%20I%20do%20not%20wish%20to%20be%20included%20in%20your%20dataset%20in%20any%20format.%20%0A%0APlease%20remove%20my%20name%20and%2For%20any%20associated%20images%20immediately%20and%20send%20a%20confirmation%20once%20you've%20updated%20your%20%22Top1M_MidList.Name.tsv%22%20file.%0A%0AThanks%20for%20promptly%20handing%20this%2C%0A%5B%20your%20name%20%5D">msceleb@microsoft.com</a> to have your name removed. Subjects whose images were distributed by Microsoft are indicated with the total image count. No number indicates the name is only exists in target list. === columns 2 -| Name | Profession | +| Name (images) | Profession | | --- | --- | --- | | Adrian Chen | Journalist | -| Ai Weiwei* | Artist | -| Aram Bartholl | Internet artist | +| Ai Weiwei (220) | Artist, activist | +| Aram Bartholl | Conceptual artist | | Astra Taylor | Author, director, activist | -| Alexander Madrigal | Journalist | -| Bruce Schneier* | Cryptologist | +| Bruce Schneier (107) | Cryptologist | +| Cory Doctorow (104) | Blogger, journalist | | danah boyd | Data & Society founder | | Edward Felten | Former FTC Chief Technologist | -| Evgeny Morozov* | Tech writer, researcher | -| Glenn Greenwald* | Journalist, author | +| Evgeny Morozov (108) | Tech writer, researcher | +| Glenn Greenwald (86) | Journalist, author | | Hito Steyerl | Artist, writer | +| James Risen | Journalist | -=== +==== -| Name | Profession | +| Name (images) | Profession | | --- | --- | --- | -| James Risen | Journalist | -| Jeremy Scahill* | Journalist | +| Jeremy Scahill (200) | Journalist | | Jill Magid | Artist | | Jillian York | Digital rights activist | | Jonathan Zittrain | EFF board member | | Julie Brill | Former FTC Commissioner| | Kim Zetter | Journalist, author | -| Laura Poitras* | Filmmaker | +| Laura Poitras (104) | Filmmaker | | Luke DuBois | Artist | +| Michael Anti | Political blogger | +| Manal al-Sharif (101) | Womens's rights activist | | Shoshana Zuboff | Author, academic | | Trevor Paglen | Artist, researcher | === end columns -After publishing this list, researchers from Microsoft Asia then worked with researchers affiliated with China's National University of Defense Technology (controlled by China's Central Military Commission) and used the the MS Celeb dataset for their [research paper](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65) on using "Faces as Lighting Probes via Unsupervised Deep Highlight Extraction" with potential applications in 3D face recognition. -In an [article](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".[^madhu_ft] +After publishing this list, researchers affiliated with Microsoft Asia then worked with researchers affiliated with China's [National University of Defense Technology](https://en.wikipedia.org/wiki/National_University_of_Defense_Technology) (controlled by China's Central Military Commission) and used the MS Celeb image dataset for their research paper on using "[Faces as Lighting Probes via Unsupervised Deep Highlight Extraction](https://www.semanticscholar.org/paper/Faces-as-Lighting-Probes-via-Unsupervised-Deep-Yi-Zhu/b301fd2fc33f24d6f75224e7c0991f4f04b64a65)" with potential applications in 3D face recognition. + +In an April 10, 2019 [article](https://www.ft.com/content/9378e7ee-5ae6-11e9-9dde-7aedca0a081a) published by Financial Times based on data surfaced during this investigation, Samm Sacks (a senior fellow at the New America think tank) commented that this research raised "red flags because of the nature of the technology, the author's affiliations, combined with what we know about how this technology is being deployed in China right now". Adding, that "the [Chinese] government is using these technologies to build surveillance systems and to detain minorities [in Xinjiang]".[^madhu_ft] -Four more papers published by SenseTime, which also use the MS Celeb dataset, raise similar flags. SenseTime is a computer vision surveillance company that until [April 2019](https://uhrp.org/news-commentary/china%E2%80%99s-sensetime-sells-out-xinjiang-security-joint-venture) provided surveillance to Chinese authorities to monitor and track Uighur Muslims in Xinjiang province, and had been [flagged](https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html) numerous times as having potential links to human rights violations. +Four more papers published by SenseTime that also use the MS Celeb dataset raise similar flags. SenseTime is a computer vision surveillance company that until [April 2019](https://uhrp.org/news-commentary/china%E2%80%99s-sensetime-sells-out-xinjiang-security-joint-venture) provided surveillance to Chinese authorities to monitor and track Uighur Muslims in Xinjiang province, and had been [flagged](https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html) numerous times as having potential links to human rights violations. One of the 4 SenseTime papers, "[Exploring Disentangled Feature Representation Beyond Face Identification](https://www.semanticscholar.org/paper/Exploring-Disentangled-Feature-Representation-Face-Liu-Wei/1fd5d08394a3278ef0a89639e9bfec7cb482e0bf)", shows how SenseTime was developing automated face analysis technology to infer race, narrow eyes, nose size, and chin size, all of which could be used to target vulnerable ethnic groups based on their facial appearances. -Earlier in 2019, Microsoft President and Chief Legal Officer [Brad Smith](https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/) called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also [announced](https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV) that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. The software was not suitable to be used on minorities, because it was trained mostly on white male faces. +Earlier in 2019, Microsoft President and Chief Legal Officer [Brad Smith](https://blogs.microsoft.com/on-the-issues/2018/12/06/facial-recognition-its-time-for-action/) called for the governmental regulation of face recognition, citing the potential for misuse, a rare admission that Microsoft's surveillance-driven business model had lost its bearing. More recently Smith also [announced](https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV) that Microsoft would seemingly take a stand against such potential misuse, and had decided to not sell face recognition to an unnamed United States agency, citing a lack of accuracy. In effect, Microsoft's face recognition software was not suitable to be used on minorities because it was trained mostly on white male faces. What the decision to block the sale announces is not so much that Microsoft had upgraded their ethics, but that Microsoft publicly acknowledged it can't sell a data-driven product without data. In other words, Microsoft can't sell face recognition for faces they can't train on. -Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly [white](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) and [male](https://gendershades.org). Without balanced data, facial recognition contains blind spots. And without datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive Service also would not be able to see at all. +Until now, that data has been freely harvested from the Internet and packaged in training sets like MS Celeb, which are overwhelmingly [white](https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html) and [male](https://gendershades.org). Without balanced data, facial recognition contains blind spots. And without datasets like MS Celeb, the powerful yet inaccurate facial recognition services like Microsoft's Azure Cognitive Service the services might not exist at all.  -Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "[One-shot Face Recognition by Promoting Underrepresented Classes](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/)," Microsoft leveraged the MS Celeb dataset to analyze their algorithms and advertise the results. Interestingly, Microsoft's [corporate version](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/) of the paper does not mention they used the MS Celeb datset, but the [open-access version](https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70) published on arxiv.org explicitly mentions that Microsoft Research introspected their algorithms "on the MS-Celeb-1M low-shot learning benchmark task." +Microsoft didn't only create MS Celeb for other researchers to use, they also used it internally. In a publicly available 2017 Microsoft Research project called "[One-shot Face Recognition by Promoting Underrepresented Classes](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/)," Microsoft leveraged the MS Celeb dataset to build their algorithms and advertise the results. Interestingly, Microsoft's [corporate version](https://www.microsoft.com/en-us/research/publication/one-shot-face-recognition-promoting-underrepresented-classes/) of the paper does not mention they used the MS Celeb datset, but the [open-access version](https://www.semanticscholar.org/paper/One-shot-Face-Recognition-by-Promoting-Classes-Guo/6cacda04a541d251e8221d70ac61fda88fb61a70) published on arxiv.org explicitly mentions that Microsoft Research introspected their algorithms "on the MS-Celeb-1M low-shot learning benchmark task." -We suggest that if Microsoft Research wants to make biometric data publicly available for surveillance research and development, they should start with releasing their researchers' own biometric data, instead of scraping the Internet for journalists, artists, writers, actors, athletes, musicians, and academics. +If Microsoft Research wants to make biometric data publicly available for surveillance research and development, perhaps they should start with releasing their employees own biometric data instead of scraping the Internet for journalists, artists, writers, actors, athletes, musicians, and academics. A publicly available face recognition dataset of all Microsoft Researcher employees would be a welcome replacement. {% include 'dashboard.html' %} diff --git a/site/content/pages/datasets/oxford_town_centre/index.md b/site/content/pages/datasets/oxford_town_centre/index.md index fbabcce5..bd340113 100644 --- a/site/content/pages/datasets/oxford_town_centre/index.md +++ b/site/content/pages/datasets/oxford_town_centre/index.md @@ -1,7 +1,7 @@ ------------ status: published -title: Oxford Town Centre +title: Oxford Town Centre Dataset desc: Oxford Town Centre is a dataset of surveillance camera footage from Cornmarket St Oxford, England subdesc: The Oxford Town Centre dataset includes approximately 2,200 identities and is used for research and development of face recognition systems slug: oxford_town_centre diff --git a/site/content/pages/datasets/uccs/assets/notes.md b/site/content/pages/datasets/uccs/assets/notes.md new file mode 100644 index 00000000..d248573d --- /dev/null +++ b/site/content/pages/datasets/uccs/assets/notes.md @@ -0,0 +1,5 @@ + +## Additional papers that used UCCS + +- https://verify.megapixels.cc/paper/megaface/verify/841855205818d3a6d6f85ec17a22515f4f062882 +- "we use the database subset that has assigned identities (180 identities total)." diff --git a/site/content/pages/datasets/uccs/index.md b/site/content/pages/datasets/uccs/index.md index 0850bd99..55b48a07 100644 --- a/site/content/pages/datasets/uccs/index.md +++ b/site/content/pages/datasets/uccs/index.md @@ -1,7 +1,7 @@ ------------ status: published -title: UnConstrained College Students +title: UnConstrained College Students Dataset slug: uccs desc: <span class="dataset-name">UnConstrained College Students</span> is a dataset of long-range surveillance photos of students on University of Colorado in Colorado Springs campus subdesc: The UnConstrained College Students dataset includes 16,149 images of 1,732 students, faculty, and pedestrians and is used for developing face recognition and face detection algorithms diff --git a/site/content/pages/research/00_introduction/index.md b/site/content/pages/research/00_introduction/index.md index 477679d4..ad8e2200 100644 --- a/site/content/pages/research/00_introduction/index.md +++ b/site/content/pages/research/00_introduction/index.md @@ -32,6 +32,17 @@ There is only biased feature vector clustering and probabilistic thresholding. Yesterday's [decision](https://www.reuters.com/article/us-microsoft-ai/microsoft-turned-down-facial-recognition-sales-on-human-rights-concerns-idUSKCN1RS2FV) by Brad Smith, CEO of Microsoft, to not sell facial recognition to a US law enforcement agency is not an about face by Microsoft to become more humane, it's simply a perfect illustration of the value of training data. Without data, you don't have a product to sell. Microsoft realized that doesn't have enough training data to sell +## Cost of Faces + +Univ Houston paid subjects $20/ea +http://web.archive.org/web/20170925053724/http://cbl.uh.edu/index.php/pages/research/collecting_facial_images_from_multiples_in_texas + +FaceMeta facedataset.com + +- BASIC: 15,000 images for $6,000 USD +- RECOMMENDED: 50,000 images for $12,000 USD +- ADVANCED: 100,000 images for $18,000 USD* + ## Use Your Own Biometrics First diff --git a/site/content/pages/research/01_from_1_to_100_pixels/index.md b/site/content/pages/research/01_from_1_to_100_pixels/index.md index b219dffb..ddffdf91 100644 --- a/site/content/pages/research/01_from_1_to_100_pixels/index.md +++ b/site/content/pages/research/01_from_1_to_100_pixels/index.md @@ -40,6 +40,11 @@ What can you know from a very small amount of information? - 100x100 all you need for medical diagnosis - 100x100 0.5% of one Instagram photo + +Notes: + +- Google FaceNet used images with (face?) sizes: Input sizes range from 96x96 pixels to 224x224pixels in our experiments. FaceNet: A Unified Embedding for Face Recognition and Clustering https://arxiv.org/pdf/1503.03832.pdf + Ideas: - Find specific cases of facial resolution being used in legal cases, forensic investigations, or military footage |
