1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
|
------------
status: published
title: Labeled Faces in The Wild
desc: LFW: Labeled Faces in The Wild
slug: lfw
published: 2019-2-23
updated: 2019-2-23
authors: Adam Harvey
------------
# LFW
+ Years: 2002-2004
+ Images: 13,233
+ Identities: 5,749
+ Origin: Yahoo News Images
+ Funding: TBD

*Labeled Faces in The Wild* (LFW) is "a database of face photographs designed for studying the problem of unconstrained face recognition[^lfw_www]. It is used to evaluate and improve the performance of facial recognition algorithms in academic, commercial, and government research. According to BiometricUpdate.com[^lfw_pingan], LFW is "the most widely used evaluation set in the field of facial recognition, LFW attracts a few dozen teams from around the globe including Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong."
The LFW dataset includes 13,233 images of 5,749 people that were collected between 2002-2004. LFW is a subset of *Names of Faces* and is part of the first facial recognition training dataset created entirely from images appearing on the Internet. The people appearing in LFW are...
The *Names and Faces* dataset was the first face recognition dataset created entire from online photos. However, *Names and Faces* and *LFW* are not the first face recognition dataset created entirely "in the wild". That title belongs to the [UCD dataset](/datasets/ucd_faces/). Images obtained "in the wild" means using an image without explicit consent or awareness from the subject or photographer.
### Analysis
- There are about 3 men for every 1 woman (4,277 men and 1,472 women) in the LFW dataset[^lfw_www]
- The person with the most images is [George W. Bush](http://vis-www.cs.umass.edu/lfw/person/George_W_Bush_comp.html) with 530
- There are about 3 George W. Bush's for every 1 [Tony Blair](http://vis-www.cs.umass.edu/lfw/person/Tony_Blair.html)
- 70% of people in the dataset have only 1 image and 29% have 2 or more images
- The LFW dataset includes over 500 actors, 30 models, 10 presidents, 124 basketball players, 24 football players, 11 kings, 7 queens, and 1 [Moby](http://vis-www.cs.umass.edu/lfw/person/Moby.html)
- In all 3 of the LFW publications [^lfw_original_paper], [^lfw_survey], [^lfw_tech_report] the words "ethics", "consent", and "privacy" appear 0 times
- The word "future" appears 71 times
- Funding sources for the dataset has not included in the original paper. In a follow up survey created by the authors of the paper, funding was later received by IARPA and ODNI for
### Visualizations
To visualize the types of photos in the dataset without explicitly publishing individual's identities. We use a Generative Adversarial Network trained on the entire dataset to represent the archetypical looks within the dataset's visual latent space.

### Biometric Trade Routes
To understand how this dataset has been used, its citations have been geocoded to show an approximate geographic digital trade route of the biometric data. Lines indicate an organization (education, commercial, or governmental) that has cited the LFW dataset in their research. Data is compiled from [SemanticScholar](https://www.semanticscholar.org).
[add map here]
### Citations
Browse or download the geocoded citation data collected for the LFW dataset.
[add citations table here]
### Additional Information
(tweet-sized snippets go here)
- The LFW dataset is considered the "most popular benchmark for face recognition" [^lfw_baidu]
- The LFW dataset is "the most widely used evaluation set in the field of facial recognition" [^lfw_pingan]
- All images in LFW dataset were obtained "in the wild" meaning without any consent from the subject or from the photographer
- The faces in the LFW dataset were detected using the Viola-Jones haarcascade face detector [^lfw_website] [^lfw-survey]
- The LFW dataset is used by several of the largest tech companies in the world including "Google, Facebook, Microsoft Research Asia, Baidu, Tencent, SenseTime, Face++ and Chinese University of Hong Kong." [^lfw_pingan]
- All images in the LFW dataset were copied from Yahoo News between 2002 - 2004
- In 2014, 2/4 of the original authors of the LFW dataset received funding from IARPA and ODNI for their follow up paper "Labeled Faces in the Wild: Updates and New Reporting Procedures" via IARPA contract number 2014-14071600010
TODO (need citations for the following)
- SenseTime, who has relied on LFW for benchmarking their facial recognition performance, is one the leading provider of surveillance to the Chinese Government [need citation for this fact. is it the most? or is that Tencent?]
- Two out of 4 of the original authors received funding from the Office of Director of National Intelligence and IARPA for their 2016 LFW survey follow up report
- The dataset includes one intelligence chief, George Tenet, former Director of Central Intelligence (DCI) for the Central Intelligence Agency



## Code
The LFW dataset is so widely used that a popular code library called Sci-Kit Learn includes a function called `fetch_lfw_people` to download the faces in the LFW dataset.
```python
#!/usr/bin/python
# ------------------------------------------------------------
#
# Script to generate montage of LFW faces used in scikit-learn
#
# ------------------------------------------------------------
import numpy as np
from sklearn.datasets import fetch_lfw_people
import imageio
import imutils
# download LFW dataset (first run takes a while)
lfw_people = fetch_lfw_people(min_faces_per_person=1, resize=1, color=True, funneled=False)
# introspect dataset
n_samples, h, w, c = lfw_people.images.shape
print(f'{n_samples:,} images at {w}x{h} pixels')
cols, rows = (176, 76)
n_ims = cols * rows
# build montages
im_scale = 0.5
ims = lfw_people.images[:n_ims]
montages = imutils.build_montages(ims, (int(w * im_scale, int(h * im_scale)), (cols, rows))
montage = montages[0]
# save full montage image
imageio.imwrite('lfw_montage_full.png', montage)
# make a smaller version
montage = imutils.resize(montage, width=960)
imageio.imwrite('lfw_montage_960.jpg', montage)
```
### Supplementary Material
```
load_file assets/lfw_commercial_use.csv
name_display, company_url, example_url, country, description
```
Text and graphics ©Adam Harvey / megapixels.cc
-------
Ignore text below these lines
-------
Research
> This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010.
"Labeled Faces in the Wild: Updates and New Reporting Procedures"
[^lfw_www]: <http://vis-www.cs.umass.edu/lfw/results.html>
[^lfw_baidu]: Jingtuo Liu, Yafeng Deng, Tao Bai, Zhengping Wei, Chang Huang. Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. <https://arxiv.org/abs/1506.07310>
[^lfw_pingan]: Lee, Justin. "PING AN Tech facial recognition receives high score in latest LFW test results". BiometricUpdate.com. Feb 13, 2017. <https://www.biometricupdate.com/201702/ping-an-tech-facial-recognition-receives-high-score-in-latest-lfw-test-results>
|