summaryrefslogtreecommitdiff
path: root/site/content/pages/datasets/ijb_c/index.md
blob: 70c71f19bbb7f145a64efe51bcf9063f9e6120a3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
------------

status: draft
title: IJB-C
desc: IARPA Janus Benchmark C is a dataset of web images used
subdesc: The IJB-C dataset contains 21,294 images and 11,779 videos of 3,531 identities
slug: ijb_c
cssclass: dataset
image: assets/background.jpg
year: 2017
published: 2019-4-18
updated: 2019-4-18
authors: Adam Harvey

------------

## IARPA Janus Benchmark C (IJB-C)

### sidebar
### end sidebar

[ page under development ]

The IARPA Janus Benchmark C (IJB–C) is a dataset of web images used for face recognition research and development. The IJB–C dataset contains 3,531 people from 21,294 images and 3,531 videos. The list of 3,531 names are activists, artists, journalists, foreign politicians, and public speakers.

Key Findings:

- metadata annotations were created using crowd annotations on Mechanical Turk
- The dataset was creatd Nobilis
- made for intelligence analysts
- improve performance of face recognition tools
- by fusing the rich spatial, temporal, and contextual information available from the multiple views captured by today’s "media in the wild"


The dataset includes Creative Commons images


The name list includes

- 2 videos from CCC
	- yq6ZC-YLHZA.png
		- Katharina Nocun: Deine Rechte sind in diesen Freihandelsabkommen nicht verfügbar
	- fF2MxkDzlVg
		- Jillian York: "Technology companies now hold an unprecedented ability to shape the world around us by limiting our ability to access certain content and by crafting proprietary algorithm that bring us our daily streams of content. Matthew Stender, Jillian C. York"
	- Maya Zankoul. She's an old friend, a Lebanese web designer who's put out a couple of books locally and has a Wikipedia page, probably created by a Lebanese Wikipedia editor die-hard. Not famous. How on earth?
	- Melissa Gira Grant (also a journalist)
	- Nadezhda Tolokinnikova (Pussy Riot)
	- Derrick Ashong (activist and journalist)
	- Michael Anti
	- Lina Ben Mhenni
	- Manal al-Sharif
	- Juan Carlos de Martin (not an activist but not really famous either!)
	- Anita Sarkeesian
	- Amal Clooney (lawyer)
	- Anil Dash (startup guy)
	- Bruno Latour (philosopher)
	- Dan Gillmor (tech journalist)
	- Eben Upton (founder of raspberry pi)
	- Evgeny Morozov
	- Gabriella Coleman
	- Maria Popova
	- Molly Crabapple
	- Paola Antonelli
	- Seymour Hersh
	- Ta-Nehisi Coates

The first 777 are non-alphabetical. From 777-3531 is alphabetical



![caption: A visualization of the IJB-C dataset](assets/ijb_c_montage.jpg)


## Research notes

From original papers: https://noblis.org/wp-content/uploads/2018/03/icb2018.pdf

Collection for the dataset began by identifying CreativeCommons subject videos, which are often more scarce than Creative Commons subject images.   Search terms that re-sulted in large quantities of person-centric videos (e.g. “in-terview”) were generated and translated into numerous lan-guages including Arabic, Korean, Swahili, and Hindi to in-crease diversity of the subject pool. Certain YouTube userswho upload well-labeled, person-centric videos, such as the World  Economic  Forum  and  the  International  University Sports Federation were also identified. Titles of videos per-taining to these search terms and usernames were scrapedusing the YouTube Data API and translated into English us-ing the Yandex Translate API4. Pattern matching was per-formed to extract potential names of subjects from the trans-lated titles, and these names were searched using the Wiki-data  API  to  verify  the  subject’s  existence  and  status  as  a public figure,  and to check for Wikimedia Commons im-agery.  Age, gender, and geographic region were collectedusing the Wikipedia API.Using the candidate subject names, Creative Commonsimages  were  scraped  from  Google  and  Wikimedia  Com-mons,  and  Creative  Commons  videos  were  scraped  fromYouTube. After images and videos of the candidate subjectwere  identified,  AMT  Workers  were  tasked  with  validat-ing the subject’s presence throughout the video.  The AMTWorkers marked segments of the video in which the subjectwas present, and key frames 


IARPA funds Italian researcher https://www.micc.unifi.it/projects/glaivejanus/

{% include 'dashboard.html' %}

{% include 'supplementary_header.html' %}

{% include 'cite_our_work.html' %}

### Footnotes