Well look at it this way: if they don't conform to theorectical performance requirements, your STL implementation is not standards compliant.This covers the part that they must have the same theoretical performance requirements. I simply don't trust an implementation to conform to that when it comes to tests.
If you are capable of making such an evaluation, precisely nothing prevents you from retesting, to get your personal results.To make sure the implementation is good and not some poor rip-off implementation.
If someone's good implementation comes up with wildly different results, you have a point, but I just don't see why that would be. Particularly with the items being tested... why would memory management, which is only different in architecture, be a significant factor in results among STL implementations?