I made a gloVe embedding model based on my game book collection – 7000 odd, of which 6000 or so managed to make it through a first pass pdf extraction pipeline
This framework is quite good https://github.com/NRCan/geoscience_language_models/tree/main/project_tools
https://github.com/NRCan/geoscience_language_models/tree/main/project_tools and parallelises, which is important for big books
The C version of gloVe is superior:
https://github.com/stanfordnlp/GloVe
With some work you can get a python version going, but I wouldn’t recommend for large numbers.
e.g. https://pypi.org/project/glove-py
and associated hacks..
The Notebook associated with this is here: https://github.com/bluetyson/RPG-gloVe-Model
These days microsoft probably won’t let you see something that big online, so will make a series of post excerpts.