Ok I think I've come up with a solution.
Instead of thinking old school tile maps I need to break it down into large primitives. Direct3D is very good at doing large primitives with one texture.
No tiles at all
Only use tiles in the editor. When I save the tile maps, don't save them as data, but save them as 1024x1024 or 512x512 textures. In other words, save each portion of the map as one texture and stick that in the data file. At run-time I simply render a quad 1024x1024 (or whatever resolution we decide on) and use a quad-tree system to see which quads are on-screen. For those that are within screen limits, draw the whole quad. Hardware is very good at clipping large quads.
So this breaks the DrawPrimitive and SetTexture down from (Tilesize/ScreenWidth)^2 to 4 max for any one screen since it's possible that you could be at the intersection of 4 of these quads.
Tile-specific effects can still be done by using decaling, render to texture, etc, etc. I will still know where in the world the tiles are by just doing a little math based on the world scroll values.
I think this will be a major speed up and will take less memory than storing one instance of each tile. This will free up a lot of resources. It's an odd-way to do pixel perfect scrolling but other methods just don't work anymore.
Scrolling using texture transform flags
Using one large texture won't work either since you cannot fit a 4096x4096 texture in video memory and yet you could have a world size that large. The alternative to this would be to cache in portions of the texture at a time, but again, this requires locking surfaces which is what I want to avoid. I would love to just transform texture u,v's to scroll the picture because it would be uber fast......but I cannot do this. I would have to cache in a portion of the data, use D3DXCreateTexture to make it a texture, and then render it. This would happen at 1024x1024 boundaries and I believe it would cause slow downs. I may even try to store a 4096x4096 texture in a file and then create several smaller textures from it using my tile extraction code. They would just be larger tiles. Then perhaps I may try this image cache scheme and just scroll the u,v's.
For those of you interested in the tile extraction code:
bool CTextureManager::AddFromFileEx(std::string File,DWORD dwWidth,DWORD dwHeight,
DWORD &dwOutWidth,DWORD &dwOutHeight)
//Create base texture
//Now lock the surface
//Get surface desc
//Pointer to surface (buffer)
DWORD *pSourceBits=(DWORD *)sourceRect.pBits;
//X and Y counters
//Create new texture object
CTexture *pTexture=new CTexture;
//Create texture from buffer
//Add texture to vector
//Update loop variables
} while (dwX<(sourceRect.Pitch/4));
//Move down one cell size in source texture
} while (dwY<sourceDesc.Height);
//Grabs a texture of width,height from pSourceBuffer
void CTexture::CreateFromImageEx(IDirect3DDevice9 *pDevice,
//Create a blank texture
//Lock it's surface and get a pointer to it
DWORD *ptrBits=(DWORD *)rect.pBits;
//Start the copying
for (DWORD i=0;i<desc.Height;i++)
for (DWORD j=0;j<desc.Width;j++)
//Copy from source to this texture's surface