The Microsoft Celeb dataset is a face recognition training site made entirely of images scraped from the Internet. According to Microsoft Research who created and published the dataset in 2016, MS Celeb is the largest publicly available face recognition dataset in the world, containing over 10 million images of 100,000 individuals.
But Microsoft's ambition was bigger. They wanted to recognize 1 million individuals. As part of their dataset they released a list of 1 million target identities for researchers to identity. The identities
In 2019, Microsoft CEO Brad Smith called for the governmental regulation of face recognition, an admission of his own company's inability to control their surveillance-driven business model. Yet since then, and for the last 4 years, Microsoft has willingly and actively played a significant role in accelerating growth in the very same industry they called for the government to regulate. This investigation looks look into the MS Celeb dataset and Microsoft Research's role in creating and distributing the largest publicly available face recognition dataset in the world to both.
to spur growth and incentivize researchers, Microsoft released a dataset called MS Celeb, or Microsft Celeb, in which they developed and published a list of exactly 1 million targeted people whose biometrics would go on to build
This bar chart presents a ranking of the top countries where dataset citations originated. Mouse over individual columns to see yearly totals. These charts show at most the top 10 countries.
To help understand how Microsoft Celeb has been used around the world by commercial, military, and academic organizations; existing publicly available research citing Microsoft Celebrity Dataset was collected, verified, and geocoded to show the biometric trade routes of people appearing in the images. Click on the markers to reveal research projects at that location.
The dataset citations used in the visualizations were collected from Semantic Scholar, a website which aggregates and indexes research papers. Each citation was geocoded using names of institutions found in the PDF front matter, or as listed on other resources. These papers have been manually verified to show that researchers downloaded and used the dataset to trainĀ or test machine learning algorithms. If you use our data, please cite our work.