site/content/pages/datasets/ijb_c/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

------------

status: draft
title: IJB-C
desc: IARPA Janus Benchmark C is a dataset of web images used
subdesc: The IJB-C dataset contains 21,294 images and 11,779 videos of 3,531 identities
slug: ijb_c
cssclass: dataset
image: assets/background.jpg
year: 2017
published: 2019-4-18
updated: 2019-4-18
authors: Adam Harvey

------------

## IARPA Janus Benchmark C (IJB-C)

### sidebar
### end sidebar

[ page under development ]

The IARPA Janus Benchmark C is a dataset created by 


![caption: A visualization of the IJB-C dataset](assets/ijb_c_montage.jpg)


## Research notes

From original papers: https://noblis.org/wp-content/uploads/2018/03/icb2018.pdf

Collection for the dataset began by identifying CreativeCommons subject videos, which are often more scarce thanCreative Commons subject images.   Search terms that re-sulted in large quantities of person-centric videos (e.g. “in-terview”) were generated and translated into numerous lan-guages including Arabic, Korean, Swahili, and Hindi to in-crease diversity of the subject pool. Certain YouTube userswho upload well-labeled, person-centric videos, such as the World  Economic  Forum  and  the  International  University Sports Federation were also identified. Titles of videos per-taining to these search terms and usernames were scrapedusing the YouTube Data API and translated into English us-ing the Yandex Translate API4. Pattern matching was per-formed to extract potential names of subjects from the trans-lated titles, and these names were searched using the Wiki-data  API  to  verify  the  subject’s  existence  and  status  as  a public figure,  and to check for Wikimedia Commons im-agery.  Age, gender, and geographic region were collectedusing the Wikipedia API.Using the candidate subject names, Creative Commonsimages  were  scraped  from  Google  and  Wikimedia  Com-mons,  and  Creative  Commons  videos  were  scraped  fromYouTube. After images and videos of the candidate subjectwere  identified,  AMT  Workers  were  tasked  with  validat-ing the subject’s presence throughout the video.  The AMTWorkers marked segments of the video in which the subjectwas present, and key frames 


IARPA funds Italian researcher https://www.micc.unifi.it/projects/glaivejanus/

{% include 'dashboard.html' %}

{% include 'supplementary_header.html' %}

{% include 'cite_our_work.html' %}

### Footnotes