alright, heres my 2 cents, just for the sake of argument.
while at the CPU level floats and doubles may have (essentially) equal performance, doubles still take twice the memory space. thus, when memory is accessed only half the number of doubles will make it over the BUS to cache as will floats. In short, to access the same number of floats as doubles you will have to access memory more times (assuming you access enough to cause more than one BUS ride)
heres a simple example to clarify what im saying,
Lets say BUS is 16 bytes wide. One memory access will get 2 doubles or 4 floats. If you access 4 doubles from an array and 4 floats from an array, the float access will cause one BUS transaction while the double access will cause 2. More trips to memory will amount to slower performance. If there are little blue dots on your bread or little green ones on your cheese you shouldnt eat it.