The evaluation illusion of large language models in medicine