Zend certified PHP/Magento developer

Euclidean square and accumulate as a single CPU/GPU instruction

Not sure if this is the correct place to ask this, but according to this answer, hardware-related questions go here.

I’m trying to compute the number of MACs (multiply-accumulate operations) required for a certain algorithm (since this is now the standard in my field of research rather than the FLOP counts that were once used instead). This algorithm includes the computation of the Euclidean distance and I’m trying to argue that computing the Euclidean distance between two vectors of length n requires n MACs. This isn’t strictly true, since Euclidean distance requires n multiplications (squares), n additions, and n subtractions. Since the Euclidean distance is in the computational bottleneck of the algorithm, the constant actually matters.

I found this website, which claims that one of many MAC instructions is actually EDAC, which does a subtraction, square, and addition as a single operation. However, the website doesn’t specify which CPU/GPU/TPU architectures actually include this instruction in their instruction sets. Is this actually a standard instruction that is used in modern processing units? If so, is there something I can refer to to make this claim (such as an IEEE standard)? Otherwise, how many operations would a modern CPU/GPU actually require to compute a Euclidean distance?