Resolved -
We have doubled baseline capacity, and will have 5x more aggressive auto-scaling in place next week.
May 9, 11:25 PDT
Monitoring -
Some Ring Trees failed due to capacity. Autoscale was not able to keep up. We manually scaled the ring tree infrastructure, and errors subsided by 14:27 UTC. We have increased baseline capacity and will continue to determine how to adjust our scaling to better handle quick spikes.
Initial errors were seen at ~14:07, peak was ~3.5 minutes from 14:12:30 - 14:15:50, no errors seen after 14:27.
May 9, 08:18 PDT
Identified -
Some Ring Trees are failing with the "Ring Tree process expired" error.
May 9, 07:41 PDT