Cleanup torch amp usage to avoid cuda specific calls, merge support for Ascend (NPU) devices from MengqingCao that should work now in PyTorch 2.5 w/ new device ...