espeak default voice backend is synthesized without using actually real voice samples. So it doesn't require downloading a huge package for each language, which is convenient in some cases, but the outcome is extremely robotic.
You can use MBROLA as backend for espeak so that it uses some voice samples and the result should be less jarring (it'd still be easy to tell it's not natural voice, but at least you'd be able to understand it better). There's a tutorial on this here: https://github.com/espeak-ng/espeak-ng/blob/master/docs/mbrola.md
Or you can try piper (https://github.com/rhasspy/piper) it's one of the most natural-sounding TTS (here are some samples).