I made a gloVe embedding model based on my game book collection – 7000 odd, of which 6000 or so managed to make it through a first pass pdf extraction pipeline
This framework is quite good https://github.com/NRCan/geoscience_language_models/tree/main/project_tools