summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJules Laplace <julescarbon@gmail.com>2020-04-08 13:28:52 +0200
committerJules Laplace <julescarbon@gmail.com>2020-04-08 13:28:52 +0200
commit577cb8d33e8c190f5f23abf560fdab385a6b745e (patch)
treed9d9004344530b38a6c0bc492d95962280a5e559
parentaebb91b47cad8aa70403eb6dec9dbe49ef6267fb (diff)
readme
-rw-r--r--README.md72
-rw-r--r--cli/commands/bridge/naive.py2
-rw-r--r--requirements.txt9
3 files changed, 82 insertions, 1 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9674bf9
--- /dev/null
+++ b/README.md
@@ -0,0 +1,72 @@
+# Thesaurus
+
+This library spiders the University of Glasgow's [Historical Thesaurus of English](https://ht.ac.uk/) to create linkages between words. Give it two words, and it will find a connection of a specified depth using synonyms.
+
+The Historical Thesaurus collects words under categories, which we can use to get from word to word. Many words have multiple meanings, which allows us to traverse vast stretches of linguistic terrain efficiently.
+
+## Installation
+
+First, install [Miniconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). Next, clone the repo and set up the Python environment.
+
+```
+git clone https://github.com/julescarbon/thesaurus.git
+cd thesaurus
+conda create -n thesaurus python=3.7
+conda activate thesaurus
+pip install -r requirements.txt
+```
+
+You should then have everything you need to run the scripts. The scripts are all accessible through `cli.py` and are self-documenting:
+
+```
+cd cli
+python cli.py --help
+python cli.py bridge --help
+python cli.py bridge words --help
+```
+
+## Bridging two words
+
+The algorithm starts from each word and builds two trees. When the trees overlap, it checks to see whether the linkage is of an appropriate depth. If it is, it prints the chain, and gives you the option to keep searching, or to eliminate problematic words.
+
+For example, bridge two words:
+
+```
+python cli.py bridge words --a light --b dark
+```
+
+### Command-line flags
+
+| ----------------------------- | -------------------------------------- | ------------ |
+| Flag | Description | Default |
+| ----------------------------- | -------------------------------------- | ------------ |
+| --a TEXT | Starting word | (required) |
+| --b TEXT | Ending word | (required) |
+| --include_oe | Include OE/archaic words | (off) |
+| --include_slang | Include slang/colloquial words | (off) |
+| --words_per_step INTEGER | Number of words to check per step | 20 |
+| --categories_per_word INTEGER | Number of categories to check per word | 3 |
+| --min_depth INTEGER | Minimum depth of matches | 10 |
+| --use_shortest_path | Use shortest path between words | (off) |
+| --help | Show this message and exit. | |
+| ----------------------------- | -------------------------------------- | ------------ |
+
+### Resolution and depth
+
+Using `--min_depth` you can specify how many hops you want. This counts words plus categories, so for 10 hops we will get at least 4 words and 6 categories in between.
+
+You can vary the search resolution by specifying how many words/categories are searched at a time. With each pass, we shuffle the list of unseen words and categories, so you can force a broader or deeper search by tweaking these numbers.
+
+When computing the chain of synonyms, we descend the category tree from word to word, based on when the words were seen. This process can be short-circuited by finding the deepest word in the category, rather than the word adjacent from the traversal. This ignores the `--min_depth` parameter. But this can make the paths very short, so instead we prefer finding longer paths.
+
+### Breaking the chain
+
+Finally, after printing the chain, you have the option to cut the chain or to keep searching. Each word in the chain will have a number assigned to it, and you just type in this number, or hit enter to find new chains.
+
+Cutting the chain will clear the trees and restart the process. If the combination is found again, it will be ignored. For example, the word "crash" has many categories involving dogs, which you might ignore. Or you might want to eliminate 16th century words that come up - for example, "swarf" for unconsciousness - but many of these are still in use, so we don't provide an option for skipping them.
+
+All the words and categories are cached in the `data_store` directory, which speeds up the process of finding new synonyms. The data is stored as JSON. This folder can grow to the hundreds of megabytes, if you've done a lot of searches, but it can be safely removed when you're finished.
+
+### Finally
+
+Please do not use this script to spider the entire Historical Thesaurus, as this violates their terms of service! Take what you need, then delete the data when you're done. This script just does the searches that you could have done on your own, but more efficiently.
diff --git a/cli/commands/bridge/naive.py b/cli/commands/bridge/naive.py
index 4014279..96122c2 100644
--- a/cli/commands/bridge/naive.py
+++ b/cli/commands/bridge/naive.py
@@ -1,5 +1,5 @@
"""
-Find connections between two words
+Find connections between two words (naive implementation)
"""
import click
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..0305dab
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,9 @@
+certifi==2019.11.28
+chardet==3.0.4
+click==7.1.1
+colorlog==4.1.0
+idna==2.9
+requests==2.23.0
+simplejson==3.17.0
+tqdm==4.44.1
+urllib3==1.25.8