I'm not sure if this is particularly a windows-specific problem, but I have an application which uses creates multiple open gl windows (on multiple threads) in the same process (in an explorer shell extension actually). There's no issue with multithreading as each window essentially hosts it's own individual context, but I'm hitting a roadblock in the number of instances that I can run at any given time before the whole program slows to a complete crawl.
In the window initialization process, we create a standard open gl window (using the wgl bindings), and on destruction, we make sure to destroy the render/device contexts, shutting down gdi, free all textures and display lists, and destroying the atl window as well (all of this is per window, the process keeps running in the backgroudn). Unless wgl is limiting the number of possible contextes a process can create, or something similar, I am completely confused as to why this slowdown is occuring.
Some interesting notes: it appears that the size of the window, as well as the pixel format affects the number of windows I can create without slow down. For example, running full screen with multisampling will cause the lag within six window instances (opened and closed sequentially, no instances running in parallel), where as creating a basic open gl window will yield roughly twenty window instances before slowing to a crawl.
Could it be possible that it is somehow falling back to software mode? And if so, how would I tell? I'm not so sure this is the case because we are only drawing boxes (maybe 100 instances at most, also using display lists) on a Core2duo 2.0 ghz/ati x1600... which should be more than enough horsepower.
What other limits could I be hitting? Would this be a win32/opengl/wgl/ or hardware limit?
I am also a developer on this project and I would like to shed some more details about this. We are running an openGL context inside an explorer window as a shell extension. Regardless of what it does (since were both under an NDA), all we can say is that its just like any other 3D application, except it just resides inside the explorer shell view.
I have a hunch that it might be a garbage collection issue. On average the "slowdown" occurs after you run our shell view 5 times. If the window is much smaller, we can get more instances run before the "slowdown." Also, the longer we wait in between the more times we can run before the "slowdown."
Suggestions are more then welcome!
Are you disabling buffers that you aren't using?
Double buffering, stencil, etc.
Yes, we are disabling the buffers (wouldn't wglDestroyContext also take care of that for us?)
but the wierdest thing is that glDeleteTextures() appears to generating GL_INVALID_OPERATION errors (they are definitely not being called between glBegin/glEnd).
Getting the GL_RENDERER string also shows the correct ati renderer id, but I'm not sure if that's reporting the current system's drivers as opposed to the current renderer.
Another interesting thing is that after the 5th/6th window instance, multisampling no longer works even though the pixel format can be set correctly? (maybe points to the fallback to software mode?)
Are you sure it's not as simple as "this is as much as the graphics processor can do?".
Running multiple OpenGL instances will load the GPU much more than running one large window, and if you are multisampling (meaning antializing?) you would produce for example 4x the data for the same window - so that would explain why a "basic" setup allows 20 windows and the multisampling slows down after about 1/4th of the number of windows.
Obviusly, you'll also produce more load for the CPU, so using profiling on your code would perhaps also shine some light on the situation. Using VTune you should be able to see where in the system most of the time is spent.
Another thought: Many graphics processors can only work with one stream of input at a time, and the process of changing from one stream to another usually means a "flush", which involves sending a "flush" or "fence" command to the graphics processor to make sure that what it's currently doing is finished before it can get on with the next task. Doing this every few milliseconds to keep multiple windows running can seriously harm the performance of the graphics processor.
I would agree if we were running the multiple windows simultaneously, however, even when doing this sequentially (ie. open/close window1, open/close window2, etc. in which the contextes, windows, etc. are destroyed on close), it still exhibits this behaviour. Unfortunately, since the windows are created as a result of the explorer instance, there's no way that we can do this per-process as opposed to per-thread.
Profiling shows our rendering code (for the initial runs) taking less time than the debug output we print out (100+ fps)... but once we hit the 5th/6th run (like clockwork), then the call (and a complete main loop step) takes upwards to 100% of one of the cpus (and produces ~1 frame/10-15s).
Ah, ok, so I missed that point (obviously).
So either you are not actually cleaning up EVERYTHING you meant to clean up, or the driver is messed up.
I work with graphics drivers, partly OpenGL, so I know a bit about the subject, but I can't say I'm a super-expert.
Have you done a profile run with VTune to see where the CPU is spending it's time? Is it in your code or the driver?
Have you tried a different model of graphics card [preferrably using a GPU from another vendor, e.g. switch nVidia to AMD/ATI or the other way round] to see if the card and/or driver is causing a problem?
I know that at least one OpenGL implementation I've seen keeps some of it's resources (surfaces and such) in a linked list - if you have a really long linked list, it can seriously affect the performance.
My vote goes to "You've not actually cleaned everything up as you think you have". Sometimes doing things in the right order is important, so check that, and try some different variations of releasing things. Make sure you release EVERYTHING, even if some things are supposed to be released by some other function.
The second choice would be "graphcis driver bug".
Best of luck in the bug-hunt.
Multi-threading the core 3D rendering in OGL or D3D is really never a good idea. Far too many things have to be protected and mutexed or critical sectioned.
A critical section normally takes about 10 CPU cycles to enter and exit but there are times when it takes over 100. As well WaitForSingle/MultipleObject(s) requires 100 CPU instructions (according to MSDN) to execute. If you are multi-threading the renders then you must be using some type of Win32 based sychronous object. After you compute the down-time of all your threads it is actually better to just render in one thread and then do non-render specific tasks in another thread. So each window could have it's own graphics processing thread which is then rendered by one main thread. You may get better performance with this type of approach.
There are several optimizations that, on Windows platforms, are the same for both APIs.
1. Primitive batching - make sure you batch your primitives. If you batch them the video card can draw while you are processing the next batch in the vertex buffer. If you do not do this the video card can only draw when you are completely done with the vertex buffer - essentially the card is blocked until you are done processing. It is waiting on the CPU.
2. Make sure you are only drawing what you need to. Drawing items that are out of the frustum or too far from the camera too matter is just wasteful.
3. Make sure you are doing a minimal amount of texture changes in the render. This should be highly optimized.
4. Make sure you are not doing a lot of transformation pipeline changes. In other words don't set the projection matrix 10 times in one frame. You can also make use of matrix stacks that will help you minimize duplicating matrices and multiplies. They are normally used for model's composed of several different models but you can also use them to maximize your use of matrices in the render.
5. Make sure you are only doing as many render state changes as you need. All render state changes have significant overhead under the hood.
6. Make sure your core render loop is highly optimized and does not have significant loop overhead.
7. Make sure your textures are all a power of 2. This has more to do with memory usage than speed but it can affect speed. Most cards will give you the nearest power of 2 for your texture sizes anyways so anything that is not a power of 2 is a waste.
8. Make sure your texture resolution and filtering quality is correct. The higher the texture resolution (as in the case of tiled textures) the more the card has to work. The higher the filtering quality the more the card has to do.
Anisotropic looks nice but may be too much for your program.
9. Make sure your anti-aliasing settings are correct. This can be a major slow down especially when coupled with number 8.
10. If not much changes from frame to frame some type of frame coherency system would help a lot. The point is that if the scene does not change then it could be cached and only those items that changed would need to be rendered. This is a big savings.
What you are describing sounds like you are leaking memory somewhere or leaking a hardware resource. This can cause big slow downs and eventual crashes. If you are using memory mapped file I/O make sure your size is correct. You could also be swapping back and forth from the drive too much causing the slowdowns.
Any number of issues could be causing what you are describing.