# megapixels dev ## installation ``` conda create -n megapixels python=3.7 pip install urllib3 pip install requests pip install simplejson pip install click pip install pdfminer.six pip install csvtool npm install ``` ## workflow ``` Paper in spreadsheet -> paper_name -> S2 Search API -> paper_id -> S2 Paper API -> citations -> S2 Dataset -> full records with PDF URLs, authors, more citations -> wget -> .pdf files -> pdfminer.six -> pdf text -> Stanford NER -> named entities (organizations) -> Geocoding service -> lat/lngs ``` To begin, export `datasets/citations.csv` from the Google doc. ## Extracting data from S2 / ORC The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here: http://labs.semanticscholar.org/corpus/ ### s2-search.py Loads titles from citations file and queries the S2 search API to get paper IDs. ### s2-papers.py Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc. ### s2-dump-ids.py Extract all the paper IDs and citation IDs from the queried papers. ### s2-extract-papers.py Extracts papers from the ORC dataset which have been queried from the API. ### s2-dump-pdf-urls.py Dump PDF urls (and also IEEE urls etc) to pdfs.json, ieee.json, .... ### s2-fetch-pdfs.py Fetch the files listed in pdfs.json and process them. ### s2-fetch-ieee.py Fetch the files listed in ieee.json and process them. ### s2-extract-first-page.py pdfminer the first page from the dumped PDFs. ### s2-extract-entities.py Extract named entities from the mined text. ### s2-geocode.py Geocode known entities from the database.