# megapixels dev ## installation ``` conda create -n megapixels python=3.7 pip install urllib3 pip install requests pip install simplejson pip install click pip install pdfminer.six pip install csvtool npm install ``` ## workflow ``` Paper in spreadsheet -> paper_name -> S2 Search API -> paper_id -> S2 Paper API -> citations -> S2 Dataset -> full records with PDF URLs, authors, more citations -> wget -> .pdf files -> pdfminer.six -> pdf text -> text mining -> named entities (organizations) -> Geocoding service -> lat/lngs ``` To begin, export `datasets/citations.csv` from the Google doc. --- ## Extracting data from S2 / ORC The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here: http://labs.semanticscholar.org/corpus/ ### s2-search.py Loads titles from citations file and queries the S2 search API to get paper IDs. ### s2-papers.py Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc. ### s2-dump-ids.py Extract all the paper IDs and citation IDs from the queried papers. ### s2-extract-papers.py Extracts papers from the ORC dataset which have been queried from the API. ### s2-raw-papers.py Some papers are not in the ORC dataset and must be scraped from S2 directly. --- ## Extracting data from Google Scholar Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done. Still requires work to process the output (crossreference with S2 and dump the PDFs). --- ## Scraping Institutions Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them. ### s2-dump-pdf-urls.py Dump PDF urls (and also IEEE urls etc) to CSV files. ### s2-fetch-pdf.py Fetch the PDFs. ### s2-fetch-doi.py Fetch the files listed in ieee.json and process them. ### pdf_dump_first_page.sh Use pdfminer.six to extract the first page from the PDFs. ### s2-pdf-report.py report_first_pages Perform initial extraction of university-like terms, to be geocoded. ### s2-doi-report.py Extract named entities from the scraped DOI links (IEEE, ACM, etc). ### s2-geocode.py Geocode lists of entities using Nominativ. ### s2-citation-report.py For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations. For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map. --- ## Cleaning the Data After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve: - Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset. - Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something - Match across multiple lines - Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help. - Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long. ### expand-uni-lookup.py At this point in the process, I had divided the task of scraping and geocoding between 4 different machines, so I reduced down the output of these scripts into the file `reports/all_institutions.csv`. I got increased accuracy from my paper classifier using just university names, so I wrote this script to group the rows using the extracted university names, and show me which address they geocode to. This file must be gone through manually. This technique geocoded around 47% of papers. ### s2-pdf-report.py report_geocoded_papers Perform initial extraction of university-like terms, to be geocoded. --- ## Useful scripts for batch processing ### split-csv.py Shuffle and split a CSV into multiple files. ### merge-csv.py Merge a folder of CSVs into a single file, deduping based on the first column.