A project I have been meaning to do for ages is extract all games that are digital for searchability reasons. Can then do lots of fun NLP things with them of course.
This is a terrible job because of the planet’s love for that presentation format, the PDF. So some things are scans, some are a hack combo, some are a 4th generation format transfer. So lots of those will not work very well, so will have to do some sort of classification.
For example, the 1st edition AD&D DMG extracted fine first past but the Player’s Handbook did not. That sort of problem, then the OCR problem and others.
So an interest place to start going back the other way will be games that have actual text versions whether html [eg epub and websites], mobipocket, text files because of their age like FUDGE and others.
Some that spring to mind – Sine Nomine – Stars Without Number et al., Eclipse Phase, Dungeon World.
There are also on the web SRDs of various games so that would also be interesting.
On the NLP front you could end up with a multi-game version of ‘what is the general advice for a GM doing X’ answer capability.
Thanks to the AD&D Random Dungeon Generator I wrote https://github.com/bluetyson/ADnD1e-Random-Dungeon-Generator, I can parallelise, so doing this only takes less than a minute.
The basic random walk theory these talk is ‘Ahead’ is the y positive direction and always follow exits – e.g. stairs down (and can go up sometimes too).
The first 1000 dungeons I have done with 10 Periodic Checks as the DMG table calls them. e.g. a roll on the main table.
Here you can have the situation if you find an empy room it can have secret doors – beyond which are more rooms, which can have secret doors if empty – I have it so it follows that stack down, then goes back – e.g. Rooms are the interesting thing.
So with 10 checks, a couple of rooms is likely.
In fact, here are the medians for this batch of 1000: