Post-mortem — BL08 / GAME-NODE-20 (M1K02)
Summary
Node BL08, hosted in chassis M1K02 at DC2SCALE PAR3, experienced a network outage caused by hardware instability following a recent memory upgrade. Service has now been restored.
Issue 1 — Memory instability (DIMM)
Following a recent RAM addition, the server's management controller reported a critical DIMM error, which led to a system crash and an unexpected reboot. The most likely cause is a faulty memory module or a compatibility issue with the existing configuration.
Issue 2 — Network link instability (10G mezzanine)
After the memory crash, the blade's mezzanine card entered a degraded state. The ixgbe driver reported repeated link_config FAILED -22 errors at boot, and the 10G uplink to the upstream switch kept flapping between up and down, preventing stable connectivity. This behavior is consistent with a NIC left in an inconsistent state following the abrupt crash, and possibly aggravated by the same underlying memory instability.
Resolution
A full power cycle of the blade was performed, allowing the mezzanine card to reinitialize properly and the network link to come back up in a stable state. The node is now back online and serving traffic normally.
Next steps
An on-site maintenance window will be scheduled at DC2SCALE PAR3 to perform an in-depth diagnostic of the memory modules and replace any faulty DIMM if confirmed. The network interface will also be monitored closely during this period; if instability persists after the memory issue is resolved, the mezzanine card will be replaced as well.
We apologize for the inconvenience and thank you for your patience.