Thread: non-blocking send/recv

  1. #1
    Registered User
    Join Date
    Sep 2004
    Posts
    80

    non-blocking send/recv

    Hi, I'm having some trouble understanding non-blocking socket operations.

    If I have a non-blocking server (select based) and use a non-blocking send from the client, select will "wake up" and the server will have a chance to check which socket in the set is ready to receive info. But since both the client and server is non-blocking wouldn't the client send() return an error and set the errorcode to WSAEWOULDBLOCK (or EWOULDBLOCK depending on the system..) during the time it takes for the server to start its recv()?
    Last edited by antex; 05-22-2007 at 02:47 AM.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Both send() and recv() have associated buffers behind them, even when both are non-blocking.
    It's not like passing a parameter to a function where the send (aka caller) instantaneously invokes the receiver.

    So send() would block when it's buffer is full, and recv() would block when it's buffer is empty.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Sep 2004
    Posts
    80
    So when the buffer is full send() blocks even when it's in non-blocking mode? What happens in a case like the one I described in my first post if the buffer doesn't get full, like if I try to send some very limited amount of bytes to the server? Does it fail in such a case? I'm sorry if I look incredibly stupid now but I don't think I get what you really mean. I would really appreciate a more detailed explanation about non-blocking send/recv.

    Thanks for the reply btw!

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > So when the buffer is full send() blocks even when it's in non-blocking mode?
    No, it will return EWOULDBLOCK.

    If the send buffer is full when you call send(), then it will either block or return immediately with EWOULDBLOCK (as appropriate).
    Left alone for long enough, the operating system will send all the queued data to the receiver.

    If there is room in the buffer for your data, then it doesn't matter whether it's a blocking or non-blocking socket, the data will be queued, and the call will return immediately.

    If your socket is blocking, then you can use the select() system call to determine when there is room for at least one byte in an otherwise full buffer.

    > I would really appreciate a more detailed explanation about non-blocking send/recv
    It's exactly the same as normal, except they don't block when the respective buffers become full or empty.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Sep 2004
    Posts
    80
    Ah okey, thank you very much for this info!

  6. #6
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    In a multithreaded application, theres usually no reason to use non-blocking unless the same thread is handling both send() and recv(). AFAIK when a call blocks, it releases its time slice to the next thread.
    Last edited by abachler; 05-29-2007 at 04:05 PM.

  7. #7
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by abachler View Post
    In a multithreaded application, theres usually no reason to use non-blocking unless the same thread is handling both send() and recieve(). AFAIK when a call blocks, it releases its time slice to the next thread.
    Yeah, but that only works when you have only a few clients. When you've got 10,000 clients, are you really going to create 10,000 threads? Is it even possible to create that many threads on the OS of your choice?

    Writing truly scalable server software is more difficult than just throwing threads at it.

  8. #8
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    yes, you can spawn 10,000 threads, it just takes adjusting the default stack size, otherwise you run up against the 2GB process limit. Reducing the default stack size to 128k lets you spawn up to around 16k threads, although at that point task switching will probably be a major hog on processor cycles. But for games that only have say 32 players, 2 threads per player (one for send() one for recv() ) this model is fine. If you need to service 10k+ clients then yes, async is the only way to do it effecively, but you will still be creating multiple threads, as you will need more than one processor most likely.

    Although a 10k+ client app would fall outside my qualifier of "usually".
    Last edited by abachler; 05-29-2007 at 04:12 PM.

  9. #9
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    When you have more threads when processors - thread switching will kill performance... (Context switch on Windows can take upto 10 ms)
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  10. #10
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by vart View Post
    When you have more threads when processors - thread switching will kill performance... (Context switch on Windows can take upto 10 ms)
    I find this statement to be out of sync with my experience. My applications routinely use 13+ threads and dont have any problems caused by task switching. (dual 3.2GHz xeons with HT enabled; ie 4 logical processors). Im sure that HT has something to do with it, since I get about 30% more performance with HT enabled than without it. BTW my application specifically spawns just as many worker threads, as there are logical processors. So when I tested it with HT disabled it only spawned 2 worker threads.

    I think that network bandwidth is going to be more of an issue, since 10k clients will only get about 1 KB/s over a 100baseT (crappy even for dialup), or 10KB/s over a 1000baseT. That of course assumes 100% network efficiency, something IVE never seen. Expect about 80-95% in real world situations.
    Last edited by abachler; 05-30-2007 at 08:59 AM.

  11. #11
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    When you have 2 threads - one of which is in the suspended state (has nothing to process) context switch from this hread to another thread - that should start processing will take no time...

    When both threads are in active state - OS should store the Register states, Command cash and DATA cache of the CPU for the current thread in some place, restore the the above data for the new thread and resume it from the point there it was paused... These operation, as I said can take upto 10 ms on Windows...

    So when you have 2 threads per CPU, one of which is working and second IO - you will see no problem (IO thread will be mostly waiting for the IO operation to finish) so will be very little overhead due to context switch... When the number of IO threads is rising - the chance that the IO thread is in active state when OS decides to switch context is higher, so overhead will grow with every new thread you create for the current CPU...

    ----------
    About Hyperthreading...
    The physical CPU emulates two logical CPUs that have each its own context (so no context switch delay arrises due to switching from one logical CPU to another). The switching is done when the active logical CPU has nothing to process (for example due to wrong branch prediction the Command cache contains wrong sequence of commands, in this case the CPU commands cache is cleared and new sequence of commands is loaded... During this time the CPU is idleing - so the CPU power is given to the second logical CPU that uses the idle cycles of the first logical CPU to process its own command cache) When the active thread is programmed in the optimal way - and uses all the CPU cycles at the maxixum possible rate - the second logical CPU has no chance to do its work... So HT is good only for not too well optimized code execution.
    Last edited by vart; 05-30-2007 at 10:06 PM.
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  12. #12
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by vart View Post
    When you have 2 threads - one of which is in the suspended state (has nothing to process) context switch from this hread to another thread - that should start processing will take no time...
    Well, since my threads are all operating on seperate sections of an image pipeline, they are all always 'busy', at least in theory. there are 4 threads (1 per logical cpu) that operate on the bottleneck section of the pipeline, and they are constantly busy, the other threads are probably not as busy, since they perform much more limited processing of the images (like conversion from JPG to BMP, color plane splitting, etc.)

    Quote Originally Posted by vart View Post
    When both threads are in active state - OS should store the Register states, Command cash and DATA cache of the CPU for the current thread in some place, restore the the above data for the new thread and resume it from the point there it was paused... These operation, as I said can take upto 10 ms on Windows...
    This is not how the hardware operates, and how it operates has nothing to do with whether it is running windows or linux or some flavor of the week OS. A hardware interupt from the timer causes the OS to execute the ISR, which pushes the TSS of the current thread. It then loads the new TSS for the next thread. Now, even on the newest processors a TSS is less than 1K in size, so unless you are running my old VIC-20, 10mS is just plain wrong. I doubt it even takes 10uS, probably closer to 10nS. The processor does not restore the command/data cache prior to resuming execution. Since it would see no performance increase from doing so, and would possibly waste memory fetches on instructions or data that would never be used.

    Quote Originally Posted by vart View Post
    So when you have 2 threads per CPU, one of which is working and second IO - you will see no problem (IO thread will be mostly waiting for the IO operation to finish) so will be very little overhead due to context switch... When the number of IO threads is rising - the chance that the IO thread is in active state when OS decides to switch context is higher, so overhead will grow with every new thread you create for the current CPU...
    yes yes, but so what. I/O, particularly network I/O is not particularly compute bound to begin with. An 'idle state' task switch is only free in the sense that you have already paid for it.

    Quote Originally Posted by vart View Post
    ----------
    About Hyperthreading...
    The physical CPU emulates two logical CPUs that have each its own context (so no context switch delay arrises due to switching from one logical CPU to another). The switching is done when the active logical CPU has nothing to process (for example due to wrong branch prediction the Command cache contains wrong sequence of commands, in this case the CPU commands cache is cleared and new sequence of commands is loaded... During this time the CPU is idleing - so the CPU power is given to the second logical CPU that uses the idle cycles of the first logical CPU to process its own command cache) When the active thread is programmed in the optimal way - and uses all the CPU cycles at the maxixum possible rate - the second logical CPU has no chance to do its work... So HT is good only for not too well optimized code execution.
    So your point is that VC++ sucks at optimizing, well thats nothing new. I agree that hand optimizing the command stream in assembly will almost always produce faster code (unless you arent good at it). But the problem comes about that with a single thread per CPU, a problem that locks up the execution of one I/O connection would stall or halt a huge proportion of the overall workload. With thread-per-client, a bad connection can only stall its thread.
    Last edited by abachler; 05-31-2007 at 10:00 AM.

  13. #13
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    VC++ sucks at optimizing? That is news to me. In the x86 world, it's actually very good, especially for AMD CPUs where Intel's compiler doesn't work as well.

    Context switching takes time. Not only because the context must be switched, but also because kernel code is invoked to determine the next thread to run.
    Having many threads in a CPU-bound operation is not an advantage. If the threads never block, then having more than the number of CPUs only introduces overhead. Context switching in itself is not the only thing to consider. Another is cache coherency: another thread means another stack, and that other thread's stack takes up precious memory in the on-chip caches - possibly pushing out another thread's stack, which will have to be re-fetched from memory on the next context switch.
    In I/O-bound operations, it's a different matter, of course. A blocking thread and an active thread don't disturb each other. As long as the blocking thread still blocks, the active thread has the CPU to itself.
    Of course, having many parallel I/O-blocking threads has its limits, too. First, threads take up resources, whether they're blocking or not. It's just not useful to create more and more threads. Second, just as the CPU has limits, so has I/O. If one thread is blocking on a huge HDD read operation, adding another thread to read a file on the same HDD isn't useful: the limit is the IDE/SATA bandwidth.
    The ideal number of threads for a program is always: as many as the I/O can take plus as many as the CPUs can take. Of course, finding out how large this number is is mostly a matter of experimentation, and highly dependent on the target machine and the program.

    Now, threads that are neither using the CPU nor using up I/O bandwidth, such as threads idling on network connections, those are really just a waste. Async I/O is a better solution there, as it accomplishes the same goal using fewer resources.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  14. #14
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    I doubt it even takes 10uS, probably closer to 10nS.
    You can beleive what you want. I'm talking about what I've seen in the real life. Take Thread Profiler and see by yourself...

    VC++ sucks at optimizing? That is news to me. In the x86 world, it's actually very good, especially for AMD CPUs where Intel's compiler doesn't work as well.
    On Intel CPUs Intel compiler can make a code that is really faster than generated by VC...
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  15. #15
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Im not saying it takes zero time, but it DOES NOT take 10ms. If you think profiler is telling you that, then either your profiler is misreporting or you are reading it wrong. 10ms is like 32 million clock cycles (at 3.2 GHz). I could maybe accept 10uS (32 thousand cycles), although I doubt even Microsoft could bloat their code that much, particularly since most modern processors average over 2 ops per cycle. Seriously, a context switch shoudlnt use more than a few hundred ops at most.

    If you dont believe me check out pages 6-5,6-12 to 6-15 of:

    IA-32 Intel® Architecture Software Developer's Manual

    Volume 3: System Programming Guide


    in fact chapter 6 as a whole covers task switching.
    Last edited by abachler; 05-31-2007 at 01:20 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to initialize a non blocking socket using only winsock library
    By *DEAD* in forum Networking/Device Communication
    Replies: 4
    Last Post: 01-18-2008, 07:03 AM
  2. Non Blocking output..
    By vasanth in forum C++ Programming
    Replies: 3
    Last Post: 08-27-2004, 09:35 AM
  3. Socket Blocking Trouble!
    By LearningMan in forum Windows Programming
    Replies: 6
    Last Post: 01-09-2003, 10:09 AM
  4. Non blocking listen() with winsock
    By manwhoonlyeats in forum C Programming
    Replies: 1
    Last Post: 12-08-2002, 07:00 PM
  5. non blocking recv
    By rohit in forum Linux Programming
    Replies: 4
    Last Post: 03-05-2002, 09:35 PM