diff options
Diffstat (limited to 'scraper/README.md')
| -rw-r--r-- | scraper/README.md | 34 |
1 files changed, 24 insertions, 10 deletions
diff --git a/scraper/README.md b/scraper/README.md index 993dbfa2..e19a6920 100644 --- a/scraper/README.md +++ b/scraper/README.md @@ -74,7 +74,7 @@ Included in the content-script folder is a Chrome extension which scrapes Google --- -## Scraping Institutions +## Mapping papers to locations Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them. @@ -98,22 +98,34 @@ Use pdfminer.six to extract the first page from the PDFs. Perform initial extraction of university-like terms, to be geocoded. +### s2-pdf-report.py + +Generates reports of things from the PDFs that were not found. + ### s2-doi-report.py -Extract named entities from the scraped DOI links (IEEE, ACM, etc). +Extract named entities from the scraped DOI links (IEEE, ACM, etc), as well as unknown entities. This is technically the cleanest data, since we know 99% of it is institutions, but it's also quite noisy. ### s2-geocode.py -Geocode lists of entities using Nominativ. +Geocode lists of unknown entities using Google. By default, it tries to geocode everything that was not recognized by the DOI report. + +### s2-geocode-spreadsheet.py + +To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one. ### s2-citation-report.py For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations. For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map. +### s2-final-report.py + +Generate the final JSON files containing the final, raw Megapixels dataset. Includes data on the papers, with merged citations, as well as the corresponding data from the spreadsheets. Suitable for making custom builds for other people. + --- -## Cleaning the Data +## Notes on the geocoding process After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve: @@ -123,23 +135,25 @@ After scraping these universities, we got up to 47% match rate on papers from th - Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help. - Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long. +These scripts are files for getting an initial set of universities to dedupe cleanly. + ### expand-uni-lookup.py By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions). This file must be gone through manually. This technique geocoded around 47% of papers. At this point I moved `reports/all_institutions.csv` into the Google Sheets. All further results use the CSV on Google Sheets. -### s2-pdf-report.py +--- -Generates reports of things from the PDFs that were not found. +## Dumping all the PDF text -### s2-geocode-spreadsheet.py +### s2-extract-full-pdf-txt.py -To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one. +Dumps all the PDF text and images to `datasets/s2/txt/*/*/paper.txt` using PDFMiner. -### s2-citation-report.py +### rm-txt-images.sh -Generate the main report with maps and citation lists. +The images dumped by PDFMiner include `*.img` files which seem to be some sort of raw image file. ImageMagick doesn't recognize them and they take a lot of space so I just delete them and leave the `jpg`/`bmp` files. --- |
