This is a cool project – attempting to get models to caption and analyse comics – https://github.com/emanuelevivoli/CoMix
https://github.com/emanuelevivoli/CoMix
Visual – Language models should be continually improving.
Try Gemma and Qwen for example and see how they go in a simple chatbot interface.