MegaPixels is an art and research project by Adam Harvey about the origins and ethics of facial analysis datasets. Where do they come from? Who's included? Who created it and for what reason?
-
MegaPixels sets out to answer to these questions and reveal the stories behind the millions of images used to train, evaluate, and power the facial recognition surveillance algorithms used today. MegaPixels is authored by Adam Harvey, developed in collaboration with Jules LaPlace, and produced in partnership with Mozilla.
-
MegaPixels sets out to answer to these questions and reveal the stories behind the millions of images used to train, evaluate, and power the facial recognition surveillance algorithms used today. MegaPixels is authored by Adam Harvey, developed in collaboration with Jules LaPlace, and produced in partnership with Mozilla.
-
Notes
-
-
critical but informative
-
not anti-dataset
-
pro-sharing, pro-public dataset
-
w/o data
-
not generally anti-researcher, their parent organization should have checks in place to prevent dubious dataset collection methods
-
-
-
Adam Harvey is an American artist and researcher based in Berlin. His previous projects (CV Dazzle, Stealth Wear, and SkyLift) explore the potential for countersurveillance as artwork. He is the founder of VFRAME (visual forensics software for human rights groups), the recipient of 2 PrototypeFund awards, and is currently a researcher in residence at Karlsruhe HfG studying artifical intelligence and datasets.
-
Jules LaPlace is an American artist and technologist also based in Berlin. He was previously the CTO of a NYC digital agency and currently works at VFRAME, developing computer vision for human rights groups, and building creative software for artists.
-
Partnership
-
MegaPixels is produced in partnership with Mozilla, a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, with only minor exceptions. The community is supported institutionally by the not-for-profit Mozilla Foundation and its tax-paying subsidiary, the Mozilla Corporation.
-
+
Ever since government agencies began developing face recognition in the early 1960's, datasets of face images have always been central to the development and evaluation of their algorithms. Today, these datasets no longer originate in labs, but instead from family photos albums posted on Flickr, CCTV cameras on college campuses, livestreams at cafes, search engine queries for celebrities, or videos on YouTube.
These datasets include many public figures, politicans, athletes, and actors; but also many non-public figures: digital activists, students, and pedestrians. Some of the images were originally used with creative commons licenses, but others were taken in unconstrained scenarios without anyone's awareness or consent. During the last year hundreds of these datasets have been collected to understand how they contribute to a global data supply chain powering surveillance.
+
MegaPixels is art and research by Adam Harvey about publicly available facial recognition datasets that aims to unravel the stories behind these datasets.
+
[Mozilla[(https://mozilla.org)] has provided the funding to launch this site, research the datasets, and build tools to help you understand the role that these datasets have played in creating surveillance technolgoies. The MegaPixels site is developed by Jules LaPlace.
+
Team
+
Adam Harvey is Berlin-based American artist and researcher. His previous projects (CV Dazzle, Stealth Wear, and SkyLift) explore the potential for countersurveillance as artwork. He is the founder of VFRAME (visual forensics software for human rights groups), the recipient of 2 PrototypeFund awards, and a researcher in residence at Karlsruhe HfG.
+
Jules LaPlace is an American creative technologist also based in Berlin. He was previously the CTO of a digital agency in NYC and now also works at VFRAME, developing computer vision for human rights groups. Jules also builds creative software for artists and musicians.
+
+
Who used Brainwash Dataset?
+
+
+ This bar chart presents a ranking of the top countries where citations originated. Mouse over individual columns
+ to see yearly totals. Colors are only assigned to the top 10 overall countries.
+
+
+
+
+
+
+
@@ -77,6 +88,7 @@
Supplementary Information
+
Citations
Citations were collected from Semantic Scholar, a website which aggregates
@@ -87,18 +99,6 @@
-
-
Who used Brainwash Dataset?
-
-
- This bar chart presents a ranking of the top countries where citations originated. Mouse over individual columns
- to see yearly totals. Colors are only assigned to the top 10 overall countries.
-
- To understand how this dataset has been used around the world...
+ To understand how CelebA Dataset has been used around the world...
affected global research on computer vision, surveillance, defense, and consumer technology, the and where this dataset has been used the locations of each organization that used or referenced the datast
@@ -76,6 +75,7 @@
Supplementary Information
+
Citations
Citations were collected from Semantic Scholar, a website which aggregates
diff --git a/site/public/datasets/cofw/index.html b/site/public/datasets/cofw/index.html
index 72e38eb6..8410559f 100644
--- a/site/public/datasets/cofw/index.html
+++ b/site/public/datasets/cofw/index.html
@@ -21,7 +21,6 @@
@@ -55,7 +54,7 @@ To increase the number of training images, and since COFW has the exact same la
-->
- To understand how this dataset has been used around the world...
+ To understand how COFW Dataset has been used around the world...
affected global research on computer vision, surveillance, defense, and consumer technology, the and where this dataset has been used the locations of each organization that used or referenced the datast
@@ -86,6 +85,7 @@ To increase the number of training images, and since COFW has the exact same la
Supplementary Information
+
Citations
Citations were collected from Semantic Scholar, a website which aggregates
diff --git a/site/public/datasets/facebook/index.html b/site/public/datasets/facebook/index.html
index a9f1b225..7fb1901a 100644
--- a/site/public/datasets/facebook/index.html
+++ b/site/public/datasets/facebook/index.html
@@ -21,7 +21,6 @@
- To understand how this dataset has been used around the world...
+ To understand how LFW has been used around the world...
affected global research on computer vision, surveillance, defense, and consumer technology, the and where this dataset has been used the locations of each organization that used or referenced the datast
@@ -80,6 +79,18 @@
The data is generated by collecting all citations for all original research papers associated with the dataset. Then the PDFs are then converted to text and the organization names are extracted and geocoded. Because of the automated approach to extracting data, actual use of the dataset can not yet be confirmed. This visualization is provided to help locate and confirm usage and will be updated as data noise is reduced.
+
Who used LFW?
+
+
+ This bar chart presents a ranking of the top countries where citations originated. Mouse over individual columns
+ to see yearly totals. Colors are only assigned to the top 10 overall countries.
+
+
+
+
+
+
+
@@ -89,6 +100,7 @@
Supplementary Information
+
Citations
Citations were collected from Semantic Scholar, a website which aggregates
@@ -99,18 +111,6 @@
-
-
Who used LFW?
-
-
- This bar chart presents a ranking of the top countries where citations originated. Mouse over individual columns
- to see yearly totals. Colors are only assigned to the top 10 overall countries.
-
-
-
-
-
-
Commercial Use
Add a paragraph about how usage extends far beyond academia into research centers for largest companies in the world. And even funnels into CIA funded research in the US and defense industry usage in China.
If you are affected by disclosure of your identity in this dataset please do contact the authors. Many have stated that they are willing to remove images upon request. The authors of the LFW dataset provide the following email for inquiries:
-
You can use the following message to request removal from the dataset:
Subject: Request for Removal from LFW Face Dataset
-
Dear [researcher name],
-
I am writing to you about the "Labeled Faces in The Wild Dataset". Recently I discovered that your dataset includes my identity and I no longer wish to be included in your dataset.
-
The dataset is being used thousands of companies around the world to improve facial recognition software including usage by governments for the purpose of law enforcement, national security, tracking consumers in retail environments, and tracking individuals through public spaces.
-
My name as it appears in your dataset is [your name]. Please remove all images from your dataset and inform your newsletter subscribers to likewise update their copies.
- To understand how this dataset has been used around the world...
+ To understand how MARS has been used around the world...
affected global research on computer vision, surveillance, defense, and consumer technology, the and where this dataset has been used the locations of each organization that used or referenced the datast
@@ -76,6 +75,7 @@
Supplementary Information
+
Citations
Citations were collected from Semantic Scholar, a website which aggregates
diff --git a/site/public/datasets/uccs/index.html b/site/public/datasets/uccs/index.html
index 21d1e6bb..0283bf3b 100644
--- a/site/public/datasets/uccs/index.html
+++ b/site/public/datasets/uccs/index.html
@@ -21,7 +21,6 @@
Regular Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Eighteen of the 5,749 people in the Labeled Faces in the Wild Dataset. The most widely used face dataset for benchmarking commercial face recognition algorithms.
Intro
-
Labeled Faces in The Wild (LFW) is among the most widely used facial recognition training datasets in the world and is the first of its kind to be created entirely from images posted online. The LFW dataset includes 13,233 images of 5,749 people that were collected between 2002-2004. Use the tools below to check if you were included in this dataset or scroll down to read the analysis.
-
Three paragraphs describing the LFW dataset in a format that can be easily replicated for the other datasets. Nothing too custom. An analysis of the initial research papers with context relative to all the other dataset papers.
-
From George W. Bush to Jamie Lee Curtis: all 5,749 people in the LFW Dataset sorted from most to least images collected.
LFW by the Numbers
-
-
Was first published in 2007
-
Developed out of a prior dataset from Berkely called "Faces in the Wild" or "Names and Faces" [^lfw_original_paper]
-
Includes 13,233 images and 5,749 different people [^lfw_website]
-
There are about 3 men for every 1 woman (4,277 men and 1,472 women)[^lfw_website]
-
The person with the most images is George W. Bush with 530
-
Most people (70%) in the dataset have only 1 image
-
Thre are 1,680 people in the dataset with 2 or more images [^lfw_website]
-
Two out of 4 of the original authors received funding from the Office of Director of National Intelligence and IARPA for their 2016 LFW survey follow up report
-
The LFW dataset includes over 500 actors, 30 models, 10 presidents, 24 football players, 124 basketball players, 11 kings, and 2 queens
-
In all the LFW publications provided by the authors the words "ethics", "consent", and "privacy" appear 0 times [^lfw_original_paper], [^lfw_survey], [^lfw_tech_report] , [^lfw_website]
-
The word "future" appears 71 times
-
-
Facts
-
-
Was created for the purpose of improving "unconstrained face recognition" [^lfw_original_paper]
-
All images in LFW were obtained "in the wild" meaning without any consent from the subject or from the photographer
-
The faces were detected using the Viola-Jones haarcascade face detector [^lfw_website] [^lfw_survey]
-
Is considered the "most popular benchmark for face recognition" [^lfw_baidu]
-
Is "the most widely used evaluation set in the field of facial recognition" [^lfw_pingan]
-
Is used by several of the largest tech companies in the world including "Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong." [^lfw_pingan]
-
-
All images were copied from Yahoo News between 2002 - 2004 [^lfw_original_paper]
-
-
SenseTime, who has relied on LFW for benchmarking their facial recognition performance, is the leading provider of surveillance to the Chinese Government
-
-
former President George W. Bush
-
Colin Powell (236), Tony Blair (144), and Donald Rumsfeld (121)
People and Companies using the LFW Dataset
-
This section describes who is using the dataset and for what purposes. It should include specific examples of people or companies with citations and screenshots. This section is followed up by the graph, the map, and then the supplementary material.
-
The LFW dataset is used by numerous companies for benchmarking algorithms and in some cases training. According to the benchmarking results page [^lfw_results] provided by the authors, over 2 dozen companies have contributed their benchmark results.
-
According to BiometricUpdate.com [^lfw_pingan], LFW is "the most widely used evaluation set in the field of facial recognition, LFW attracts a few dozen teams from around the globe including Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong."
-
According to researchers at the Baidu Research – Institute of Deep Learning "LFW has been the most popular evaluation benchmark for face recognition, and played a very important role in facilitating the face recognition society to improve algorithm. [^lfw_baidu]."
-
In addition to commercial use as an evaluation tool, alll of the faces in LFW dataset are prepackaged into a popular machine learning code framework called scikit-learn.
-
"PING AN Tech facial recognition receives high score in latest LFW test results"
-
"Face Recognition Performance in LFW benchmark"
-
"The 1st place in face verification challenge, LFW"
In benchmarking, companies use a dataset to evaluate their algorithms which are typically trained on other data. After training, researchers will use LFW as a benchmark to compare results with other algorithms.
-
For example, Baidu (est. net worth $13B) uses LFW to report results for their "Targeting Ultimate Accuracy: Face Recognition via Deep Embedding". According to the three Baidu researchers who produced the paper:
-
Citations
-
Overall, LFW has at least 116 citations from 11 countries.
-
Conclusion
-
The LFW face recognition training and evaluation dataset is a historically important face dataset as it was the first popular dataset to be created entirely from Internet images, paving the way for a global trend towards downloading anyone’s face from the Internet and adding it to a dataset. As will be evident with other datasets, LFW’s approach has now become the norm.
-
For all the 5,000 people in this datasets, their face is forever a part of facial recognition history. It would be impossible to remove anyone from the dataset because it is so ubiquitous. For their rest of the lives and forever after, these 5,000 people will continue to be used for training facial recognition surveillance.
-
Code
-
#!/usr/bin/python
-
-import numpy as np
-from sklearn.datasets import fetch_lfw_people
-import imageio
-import imutils
-
-# download LFW dataset (first run takes a while)
-lfw_people = fetch_lfw_people(min_faces_per_person=1, resize=1, color=True, funneled=False)
-
-# introspect dataset
-n_samples, h, w, c = lfw_people.images.shape
-print(f'{n_samples:,} images at {w}x{h} pixels')
-cols, rows = (176, 76)
-n_ims = cols * rows
-
-# build montages
-im_scale = 0.5
-ims = lfw_people.images[:n_ims]
-montages = imutils.build_montages(ims, (int(w * im_scale, int(h * im_scale)), (cols, rows))
-montage = montages[0]
-
-# save full montage image
-imageio.imwrite('lfw_montage_full.png', montage)
-
-# make a smaller version
-montage_960 = imutils.resize(montage, width=960)
-imageio.imwrite('lfw_montage_960.jpg', montage_960)
-
If you are affected by disclosure of your identity in this dataset please do contact the authors. Many have stated that they are willing to remove images upon request. The authors of the LFW dataset provide the following email for inquiries:
-
You can use the following message to request removal from the dataset:
Subject: Request for Removal from LFW Face Dataset
-
Dear [researcher name],
-
I am writing to you about the "Labeled Faces in The Wild Dataset". Recently I discovered that your dataset includes my identity and I no longer wish to be included in your dataset.
-
The dataset is being used thousands of companies around the world to improve facial recognition software including usage by governments for the purpose of law enforcement, national security, tracking consumers in retail environments, and tracking individuals through public spaces.
-
My name as it appears in your dataset is [your name]. Please remove all images from your dataset and inform your newsletter subscribers to likewise update their copies.
VGG Face2 is the updated version of the VGG Face dataset and now includes over 3.3M face images from over 9K people. The identities were selected by taking the top 500K identities in Google's Knowledge Graph of celebrities and then selecting only the names that yielded enough training images. The dataset was created in the UK but funded by Office of Director of National Intelligence in the United States.
-
VGG Face2 by the Numbers
-
-
1,331 actresses, 139 presidents
-
3 husbands and 16 wives
-
2 snooker player
-
1 guru
-
1 pornographic actress
-
3 computer programmer
-
-
Names and descriptions
-
-
The original VGGF2 name list has been updated with the results returned from Google Knowledge
-
Names with a similarity score greater than 0.75 where automatically updated. Scores computed using import difflib; seq = difflib.SequenceMatcher(a=a.lower(), b=b.lower()); score = seq.ratio()
-
The 97 names with a score of 0.75 or lower were manually reviewed and includes name changes validating using Wikipedia.org results for names such as "Bruce Jenner" to "Caitlyn Jenner", spousal last-name changes, and discretionary changes to improve search results such as combining nicknames with full name when appropriate, for example changing "Aleksandar Petrović" to "Aleksandar 'Aco' Petrović" and minor changes such as "Mohammad Ali" to "Muhammad Ali"
-
The 'Description' text was automatically added when the Knowledge Graph score was greater than 250
-
-
TODO
-
-
create name list, and populate with Knowledge graph information like LFW
-
make list of interesting number stats, by the numbers
-
make list of interesting important facts
-
write intro abstract
-
write analysis of usage
-
find examples, citations, and screenshots of useage
-
find list of companies using it for table
-
create montages of the dataset, like LFW
-
create right to removal information
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/site/public/index.html b/site/public/index.html
index cb357e3f..62f78978 100644
--- a/site/public/index.html
+++ b/site/public/index.html
@@ -28,7 +28,7 @@