scraper/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168

# megapixels dev

## installation

```
conda create -n megapixels python=3.6
pip install urllib3
pip install requests
pip install simplejson
pip install click
pip install pdfminer.six
pip install csvtool
npm install
```

## simplified workflow

If you are just updating the scrape, run `s2-scrape.sh` to run just the scripts you need.

## workflow

```
Paper in spreadsheet -> paper_name
  -> S2 Search API -> paper_id
  -> S2 Paper API -> citations
  -> S2 Dataset -> full records with PDF URLs, authors, more citations
  -> wget -> .pdf files
  -> pdfminer.six -> pdf text
  -> text mining -> named entities (organizations)
  -> Geocoding service -> lat/lngs
```

To begin, export `datasets/citations.csv` from the Google doc.

---

## Extracting data from S2 / ORC

The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here:

http://labs.semanticscholar.org/corpus/

We do a two-stage fetch process as only about 66% of their papers are in this dataset.

### s2-search.py

Loads titles from citations file and queries the S2 search API to get paper IDs, then uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc.  This will overwrite the `citations_lookup.csv` so maybe don't run this again.

### s2-papers.py

Of course, searching is not totally accurate, so run the s2-papers.py script to build a report of all the papers, so you can correct any papers that did not resolve. Also reports papers without a location.

### s2-dump-ids.py

Dump all the paper IDs and citation IDs from the queried papers.

### s2-extract-papers.py

Extracts papers from the ORC dataset which have been queried from the API.

### s2-dump-missing-paper-ids.py

Dump the citation IDs that were not found in the ORC dataset.

### s2-raw-papers.py

Some papers are not in the ORC dataset and must be scraped from S2 directly.

---

## Extracting data from Google Scholar

Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done.  Still requires work to process the output (crossreference with S2 and dump the PDFs).

---

## Mapping papers to locations

Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them.

### s2-dump-db-pdf-urls.py

Dump PDF urls (and also DOI urls etc) to CSV files.

### s2-fetch-pdf.py

Fetch the PDFs.

### s2-fetch-doi.py

Fetch the files listed in ieee.json and process them.

### s2-extract-pdf-txt.py

Use pdfminer.six to extract the first page from the PDFs.

### s2-pdf-first-pages.py

Perform initial extraction of university-like terms, to be geocoded.

### s2-pdf-report.py

Generates reports of things from the PDFs that were not found.

### s2-doi-report.py

Extract named entities from the scraped DOI links (IEEE, ACM, etc), as well as unknown entities.  This is technically the cleanest data, since we know 99% of it is institutions, but it's also quite noisy.

### s2-geocode.py

Geocode lists of unknown entities using Google.  By default, it tries to geocode everything that was not recognized by the DOI report.

### s2-geocode-spreadsheet.py

To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty.  Then run this script and anything missing a lat/lng will get one.

### s2-citation-report.py

For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations.
For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map.

### s2-final-report.py

Generate the final JSON files containing the final, raw Megapixels dataset.  Includes data on the papers, with merged citations, as well as the corresponding data from the spreadsheets.  Suitable for making custom builds for other people.

---

## Notes on the geocoding process

After scraping these universities, we got up to 47% match rate on papers from the dataset.  However there is still more to solve:

- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset.
- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something
- Match across multiple lines
- Empty addresses - some papers need to be gone through by hand?  Maybe we can do digram/trigram analysis on the headings.  Just finding common words would help.
- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long.

These scripts are files for getting an initial set of universities to dedupe cleanly.

### expand-uni-lookup.py

By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions).  This file must be gone through manually.  This technique geocoded around 47% of papers.

At this point I moved `reports/all_institutions.csv` into the Google Sheets.  All further results use the CSV on Google Sheets.

---

## Dumping all the PDF text

### s2-extract-full-pdf-txt.py

Dumps all the PDF text and images to `datasets/s2/txt/*/*/paper.txt` using PDFMiner.

### rm-txt-images.sh

The images dumped by PDFMiner include `*.img` files which seem to be some sort of raw image file.  ImageMagick doesn't recognize them and they take a lot of space so I just delete them and leave the `jpg`/`bmp` files.

---

## Useful scripts for batch processing

### split-csv.py

Shuffle and split a CSV into multiple files.

### merge-csv.py

Merge a folder of CSVs into a single file, deduping based on the first column.