[XLA] Improve the region-based analysis algorithm in copy insertion, by more precisely combine the runtime ordering relations of different instructions in the region. Also refactored and patched up the implementation to correctly handle corner cases regarding concurrent instructions (e.g., CollectivePermute) and aliasing of parameter and root instructions of a computation, among others.