Labeled Faces in the Wild

Created

2007

Images

13,233

People

5,749

Created From

Yahoo News images

Search available

Searchable

Eighteen of the 5,749 people in the Labeled Faces in the Wild Dataset. The most widely used face dataset for benchmarking commercial face recognition algorithms.

Intro

Labeled Faces in The Wild (LFW) is among the most widely used facial recognition training datasets in the world and is the first of its kind to be created entirely from images posted online. The LFW dataset includes 13,233 images of 5,749 people that were collected between 2002-2004. Use the tools below to check if you were included in this dataset or scroll down to read the analysis.

Three paragraphs describing the LFW dataset in a format that can be easily replicated for the other datasets. Nothing too custom. An analysis of the initial research papers with context relative to all the other dataset papers.

From George W. Bush to Jamie Lee Curtis: all 5,749 people in the LFW Dataset sorted from most to least images collected.

LFW by the Numbers

LFW

Years

2002-2004

Images

13,233

Identities

5,749

Origin

Yahoo News Images

Funding

(Possibly, partially CIA*)

Eighteen of the 5,749 people in the Labeled Faces in the Wild Dataset. The most widely used face dataset for benchmarking commercial face recognition algorithms.

Labeled Faces in The Wild (LFW) is "a database of face photographs designed for studying the problem of unconstrained face recognition[^lfw_www]. It is used to evaluate and improve the performance of facial recognition algorithms in academic, commercial, and government research. According to BiometricUpdate.com[^lfw_pingan], LFW is "the most widely used evaluation set in the field of facial recognition, LFW attracts a few dozen teams from around the globe including Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong."

The LFW dataset includes 13,233 images of 5,749 people that were collected between 2002-2004. LFW is a subset of Names of Faces and is part of the first facial recognition training dataset created entirely from images appearing on the Internet. The people appearing in LFW are...

The Names and Faces dataset was the first face recognition dataset created entire from online photos. However, Names and Faces and LFW are not the first face recognition dataset created entirely "in the wild". That title belongs to the UCD dataset. Images obtained "in the wild" means using an image without explicit consent or awareness from the subject or photographer.

Analysis

Was first published in 2007
Developed out of a prior dataset from Berkely called "Faces in the Wild" or "Names and Faces" [^lfw_original_paper]
Includes 13,233 images and 5,749 different people [^lfw_website]
There are about 3 men for every 1 woman (4,277 men and 1,472 women)[^lfw_website]
The person with the most images is George W. Bush with 530
Most people (70%) in the dataset have only 1 image
Thre are 1,680 people in the dataset with 2 or more images [^lfw_website]
Two out of 4 of the original authors received funding from the Office of Director of National Intelligence and IARPA for their 2016 LFW survey follow up report
The LFW dataset includes over 500 actors, 30 models, 10 presidents, 24 football players, 124 basketball players, 11 kings, and 2 queens
In all the LFW publications provided by the authors the words "ethics", "consent", and "privacy" appear 0 times [^lfw_original_paper], [^lfw_survey], [^lfw_tech_report] , [^lfw_website]
There are about 3 men for every 1 woman (4,277 men and 1,472 women) in the LFW dataset[^lfw_www]
The person with the most images is George W. Bush with 530
There are about 3 George W. Bush's for every 1 Tony Blair
70% of people in the dataset have only 1 image and 29% have 2 or more images
The LFW dataset includes over 500 actors, 30 models, 10 presidents, 124 basketball players, 24 football players, 11 kings, 7 queens, and 1 Moby
In all 3 of the LFW publications [^lfw_original_paper], [^lfw_survey], [^lfw_tech_report] the words "ethics", "consent", and "privacy" appear 0 times
The word "future" appears 71 times

Facts

Synthetic Faces

To visualize the types of photos in the dataset without explicitly publishing individual's identities a generative adversarial network (GAN) was trained on the entire dataset. The images in this video show a neural network learning the visual latent space and then interpolating between archetypical identities within the LFW dataset.

Biometric Trade Routes

To understand how this dataset has been used, its citations have been geocoded to show an approximate geographic digital trade route of the biometric data. Lines indicate an organization (education, commercial, or governmental) that has cited the LFW dataset in their research. Data is compiled from SemanticScholar.

[add map here]

Citations

Browse or download the geocoded citation data collected for the LFW dataset.

[add citations table here]

Additional Information

(tweet-sized snippets go here)

Was created for the purpose of improving "unconstrained face recognition" [^lfw_original_paper]
All images in LFW were obtained "in the wild" meaning without any consent from the subject or from the photographer
The faces were detected using the Viola-Jones haarcascade face detector [^lfw_website] [^lfw_survey]
Is considered the "most popular benchmark for face recognition" [^lfw_baidu]
Is "the most widely used evaluation set in the field of facial recognition" [^lfw_pingan]
Is used by several of the largest tech companies in the world including "Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong." [^lfw_pingan]
-
All images were copied from Yahoo News between 2002 - 2004 [^lfw_original_paper]
+
The LFW dataset is considered the "most popular benchmark for face recognition" [^lfw_baidu]
The LFW dataset is "the most widely used evaluation set in the field of facial recognition" [^lfw_pingan]
All images in LFW dataset were obtained "in the wild" meaning without any consent from the subject or from the photographer
The faces in the LFW dataset were detected using the Viola-Jones haarcascade face detector [^lfw_website] [^lfw-survey]
The LFW dataset is used by several of the largest tech companies in the world including "Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong." [^lfw_pingan]
All images in the LFW dataset were copied from Yahoo News between 2002 - 2004 +<<<<<<< HEAD
In 2014, two of the four original authors of the LFW dataset received funding from IARPA and ODNI for their follow up paper Labeled Faces in the Wild: Updates and New Reporting Procedures via IARPA contract number 2014-14071600010
The dataset includes 2 images of George Tenet, the former Director of Central Intelligence (DCI) for the Central Intelligence Agency whose facial biometrics were eventually used to help train facial recognition software in China and Russia
SenseTime, who has relied on LFW for benchmarking their facial recognition performance, is the leading provider of surveillance to the Chinese Government
In 2014, 2/4 of the original authors of the LFW dataset received funding from IARPA and ODNI for their follow up paper "Labeled Faces in the Wild: Updates and New Reporting Procedures" via IARPA contract number 2014-14071600010
The LFW dataset was used Center for Intelligent Information Retrieval, the Central Intelligence Agency, the National Security Agency and National

TODO (need citations for the following)

SenseTime, who has relied on LFW for benchmarking their facial recognition performance, is one the leading provider of surveillance to the Chinese Government [need citation for this fact. is it the most? or is that Tencent?]
Two out of 4 of the original authors received funding from the Office of Director of National Intelligence and IARPA for their 2016 LFW survey follow up report

> 13d7a450affe8ea4f368a97ea2014faa17702a4c
+
+
+
+
+
+
+

former President George W. Bush

Colin Powell (236), Tony Blair (144), and Donald Rumsfeld (121)

People and Companies using the LFW Dataset

This section describes who is using the dataset and for what purposes. It should include specific examples of people or companies with citations and screenshots. This section is followed up by the graph, the map, and then the supplementary material.

The LFW dataset is used by numerous companies for benchmarking algorithms and in some cases training. According to the benchmarking results page [^lfw_results] provided by the authors, over 2 dozen companies have contributed their benchmark results.

According to BiometricUpdate.com [^lfw_pingan], LFW is "the most widely used evaluation set in the field of facial recognition, LFW attracts a few dozen teams from around the globe including Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong."

According to researchers at the Baidu Research – Institute of Deep Learning "LFW has been the most popular evaluation benchmark for face recognition, and played a very important role in facilitating the face recognition society to improve algorithm. [^lfw_baidu]."

In addition to commercial use as an evaluation tool, alll of the faces in LFW dataset are prepackaged into a popular machine learning code framework called scikit-learn.

"PING AN Tech facial recognition receives high score in latest LFW test results"

"Face Recognition Performance in LFW benchmark"

"The 1st place in face verification challenge, LFW"

In benchmarking, companies use a dataset to evaluate their algorithms which are typically trained on other data. After training, researchers will use LFW as a benchmark to compare results with other algorithms.

For example, Baidu (est. net worth $13B) uses LFW to report results for their "Targeting Ultimate Accuracy: Face Recognition via Deep Embedding". According to the three Baidu researchers who produced the paper:

Citations

Overall, LFW has at least 116 citations from 11 countries.

Conclusion

The LFW face recognition training and evaluation dataset is a historically important face dataset as it was the first popular dataset to be created entirely from Internet images, paving the way for a global trend towards downloading anyone’s face from the Internet and adding it to a dataset. As will be evident with other datasets, LFW’s approach has now become the norm.

For all the 5,000 people in this datasets, their face is forever a part of facial recognition history. It would be impossible to remove anyone from the dataset because it is so ubiquitous. For their rest of the lives and forever after, these 5,000 people will continue to be used for training facial recognition surveillance.

Code

Colin Powell (236), Tony Blair (144), and Donald Rumsfeld (121)

All 5,379 faces in the Labeled Faces in The Wild Dataset

Code

The LFW dataset is so widely used that a popular code library called Sci-Kit Learn includes a function called fetch_lfw_people to download the faces in the LFW dataset.

#!/usr/bin/python
 
 import numpy as np
@@ -87,26 +93,38 @@ lfw_people = fetch_lfw_people(min_faces_per_person=1, resize=1, color=True, funn
 
 # introspect dataset
 n_samples, h, w, c = lfw_people.images.shape
-print('{:,} images at {}x{}'.format(n_samples, w, h))
+print(f'{n_samples:,} images at {w}x{h} pixels')
 cols, rows = (176, 76)
 n_ims = cols * rows
 
 # build montages
 im_scale = 0.5
-ims = lfw_people.images[:n_ims
-montages = imutils.build_montages(ims, (int(w*im_scale, int(h*im_scale)), (cols, rows))
+ims = lfw_people.images[:n_ims]
+montages = imutils.build_montages(ims, (int(w * im_scale,   int(h * im_scale)), (cols, rows))
 montage = montages[0]
 
 # save full montage image
 imageio.imwrite('lfw_montage_full.png', montage)
 
 # make a smaller version
-montage_960 = imutils.resize(montage, width=960)
-imageio.imwrite('lfw_montage_960.jpg', montage_960)
+montage = imutils.resize(montage, width=960)
+imageio.imwrite('lfw_montage_960.jpg', montage)

Disclaimer

MegaPixels is an educational art project designed to encourage discourse about facial recognition datasets. Any ethical or legal issues should be directed to the researcher's parent organizations. Except where necessary for contact or clarity, the names of researchers have been subsituted by their parent organization. In no way does this project aim to villify researchers who produced the datasets.

Read more about MegaPixels Code of Conduct

Supplementary Material

Text and graphics ©Adam Harvey / megapixels.cc

Ignore text below these lines

Research

"In our experiments, we used 10000 images and associated captions from the Faces in the wilddata set [3]."
"This work was supported in part by the Center for Intelligent Information Retrieval, the Central Intelligence Agency, the National Security Agency and National Science Foundation under CAREER award IIS-0546666 and grant IIS-0326249."
From: "People-LDA: Anchoring Topics to People using Face Recognition" https://www.semanticscholar.org/paper/People-LDA%3A-Anchoring-Topics-to-People-using-Face-Jain-Learned-Miller/10f17534dba06af1ddab96c4188a9c98a020a459 and https://ieeexplore.ieee.org/document/4409055
This paper was presented at IEEE 11th ICCV conference Oct 14-21 and the main LFW paper "Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments" was also published that same year
10f17534dba06af1ddab96c4188a9c98a020a459
+
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010.
+
From "Labeled Faces in the Wild: Updates and New Reporting Procedures"

diff --git a/site/public/datasets/vgg_face2/index.html b/site/public/datasets/vgg_face2/index.html index b7ba5a4c..08b02cc7 100644 --- a/site/public/datasets/vgg_face2/index.html +++ b/site/public/datasets/vgg_face2/index.html @@ -4,7 +4,7 @@ MegaPixels - + @@ -27,35 +27,10 @@

VGG Faces2

Created

2018

Images

3.3M

People

9,000

Created From

Scraping search engines

Search available

[Searchable](#)

VGG Face2 is the updated version of the VGG Face dataset and now includes over 3.3M face images from over 9K people. The identities were selected by taking the top 500K identities in Google's Knowledge Graph of celebrities and then selecting only the names that yielded enough training images. The dataset was created in the UK but funded by Office of Director of National Intelligence in the United States.

VGG Face2 by the Numbers

VGG Face 2

Years

TBD

Images

TBD

Identities

TBD

Origin

TBD

Funding

IARPA

...

Analysis

1,331 actresses, 139 presidents
3 husbands and 16 wives
2 snooker player
1 guru
1 pornographic actress
3 computer programmer

Names and descriptions

The original VGGF2 name list has been updated with the results returned from Google Knowledge
Names with a similarity score greater than 0.75 where automatically updated. Scores computed using import difflib; seq = difflib.SequenceMatcher(a=a.lower(), b=b.lower()); score = seq.ratio()
The 97 names with a score of 0.75 or lower were manually reviewed and includes name changes validating using Wikipedia.org results for names such as "Bruce Jenner" to "Caitlyn Jenner", spousal last-name changes, and discretionary changes to improve search results such as combining nicknames with full name when appropriate, for example changing "Aleksandar Petrović" to "Aleksandar 'Aco' Petrović" and minor changes such as "Mohammad Ali" to "Muhammad Ali"
The 'Description' text was automatically added when the Knowledge Graph score was greater than 250

TODO

create name list, and populate with Knowledge graph information like LFW
make list of interesting number stats, by the numbers
make list of interesting important facts
write intro abstract
write analysis of usage
find examples, citations, and screenshots of useage
find list of companies using it for table
create montages of the dataset, like LFW
create right to removal information
The VGG Face 2 dataset includes approximately 1,331 actresses, 139 presidents, 16 wives, 3 husbands, 2 snooker player, and 1 guru

diff --git a/site/public/datasets_v0/index.html b/site/public/datasets_v0/index.html new file mode 100644 index 00000000..71147a64 --- /dev/null +++ b/site/public/datasets_v0/index.html @@ -0,0 +1,53 @@ + + + + MegaPixels + + + + + + + + + + + + +

+ +

MegaPixels

+ +

+ Datasets + Research + About +

+ +

Facial Recognition Datasets

Regular Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Summary

Found

275 datasets

Created between

1993-2018

Smallest dataset

20 images

Largest dataset

10,000,000 images

Highest resolution faces

450x500 (Unconstrained College Students)

Lowest resolution faces

16x20 pixels (QMUL SurvFace)

+ +

+ + + + + \ No newline at end of file diff --git a/site/public/datasets_v0/lfw/index.html b/site/public/datasets_v0/lfw/index.html new file mode 100644 index 00000000..b4ee82a3 --- /dev/null +++ b/site/public/datasets_v0/lfw/index.html @@ -0,0 +1,131 @@ + + + + MegaPixels + + + + + + + + + + + + +

+ +

MegaPixels

+ +

+ Datasets + Research + About +

+ +

Labeled Faces in the Wild

Created

2007

Images

13,233

People

5,749

Created From

Yahoo News images

Search available

Searchable

Eighteen of the 5,749 people in the Labeled Faces in the Wild Dataset. The most widely used face dataset for benchmarking commercial face recognition algorithms.