The context window of a given model may be a factor -- each LLM has a context window for the number of tokens taken in. A 700 to 1000 word page is not a challenge, but each model may have different enough token association to allow some overlook or hallucination. I don't have a white paper to back this up, but I am sure finding one on token evaluation would help explain the prompt accuracy. Another factor may be if a RAG is associated with the model, to inject relevant data to the prompt and maintain accuracy. This is a terrific highlight you shared.