Using loopnz in modern x86 is not faster than a well-written loop using separate decrement and jnz operations. It may not be slower either, but using the generic version is still better, since the compiler can generate the loop using ANY register, rather than being forced to use ECX. Restricting register usage will make the register allocation more complicated (and the compiler will have to know at the beginning that it should use loopnz, so that when it gets to the end, it has the right value in ECX). In this case, there's no benefit. If something like this makes loops MUCH more efficient, then the compiler will be tuned to do that, but modern x86 processors are very efficient at simple instructions, so in most cases, the complex alternative instructions are either slower or equal to the simple variants.
Perhaps the loops were slower due to a misunderstanding of tabstop's algorithm? In particular, although tabstop says there's nothing wrong with nested loops, his algorithm doesn't actually have any - just a single loop that goes through both containers at the same time, taking advantage of their being sorted and unique.
I myself have previously needed to do this, and not long ago I might add. set_symmetric_difference didn't cut it because it puts both differences in the same set. set_difference works but needs two passes.
The approach I suggest is to go into your algorithm header, find the code for set_symmetric difference, copy it and modify it to put the A-B and B-A into seperate sets.
It logically should be about twice as fast as two set_difference calls. If not then you've probably made a mistake.