Thanks for the feedback and the questions.
The interesting thing is that the stash list theoretically wouldn't need to be shared. Because of how partitions and partition keys work, a consumer will never have an Aggregate ID in its stash list that another consumer might read. However, stash lists must be resilient (withstanding restarts) so for that reason alone they should be persisted. The overhead would be very light; simply write to the persisted stash list when we encounter a non-recoverable error, and read from it once at startup. In addition, we might as well simply use one shared stash list. Consumer A might wind up with Aggregate IDs in its local stash list that "belong" to Consumer B, but there's no harm there.
So long story short, for resilience, we indeed probably will need a shared database/cache for the consumer group.
Now, keep in mind when we deploy the consumer after fixing it, we want a) the consumer to begin reading directly from the stash topic, and b) its stash list empty so that it will re-consume the messages that had previously failed. That will allow Zoë to be (correctly) consumed from the stash topic before Zoiee was consumed from the main topic.
Regarding ordering in the stash topic, if we partition the stash topic the same way we partition the main topic, then ordering should still be maintained with multiple consumers. That's a miss on my part; I'd made that assumption but did not explicitly state it in the article.
Again, I really appreciate your feedback, and your challenging the design. It's not simple by any stretch; there are certainly areas for improvement.