Hi Eero, thanks for the response; it’s much appreciated. Determining whether an error is recoverable or non-recoverable can indeed be tricky. Problems with one of many shards is an interesting scenario. I’d probably consider that to be recoverable, but as you point out, that’s to the detriment of other messages whose intended shard is just fine.
Alternatively, assuming we use the same Aggregate ID as a shard key that we use for our partition key, we could route it to the stash topic and still maintain ordering. If the shard issue is immediately resolved, then presumably our team can soon thereafter “drain” the stash topic… we won’t have to wait for hours. But if this all happens in the middle of the night? Then maybe not… unless this shard has caused a site incident and everyone has been paged and woken up!
So indeed, it’s a difficult topic, and no silver bullet exists. Each org needs to craft a solution based on its needs.