Thread: winsock splitting msg

  1. #1
    Registered User
    Join Date
    Dec 2008
    Posts
    104

    winsock splitting msg

    Hello,
    In my application, I request a file from an HTTP server in order to parse a certain type of information. The problem is, since I have a char[] buffer of 512 indices, my application sometimes receives only a part of the string I need to parse, while I need the WHOLE string.

    The worst thing is that the server does not always split the string that I need to parse in the same place. Sometimes it splits it in half, sometimes a quarter, etc.
    This disables me from predicting the splitting of the string.

    My current solution is very, very, very nasty. I have a char[] buffer of 80k indices, that way, the server won't have the need to split the body into parts and I can receive the whole string I need to parse. This makes my application very slow and a HUGE memory occupant.

    Any solutions that cross your mind?

    Thank you,
    abraham2119

  2. #2
    Registered User carrotcake1029's Avatar
    Join Date
    Apr 2008
    Posts
    404
    You can choose how much you want to receive every time. Just change the argument of recv() where you specify how many bytes you want to receive.

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > The worst thing is that the server does not always split the string that I need to parse in the same place.
    It's nothing to do with the server, it's all about the nature of a TCP/IP connection.

    It is a stream protocol. Messages can be fragmented on transmission as well as reception. And it's your job to deal with that at both ends (depending on whether you're the transmitter and/or receiver).
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User
    Join Date
    Dec 2008
    Posts
    104
    Quote Originally Posted by carrotcake1029 View Post
    You can choose how much you want to receive every time. Just change the argument of recv() where you specify how many bytes you want to receive.
    Exactly, when I said that I had a char[] buffer with 80k indices, I also meant that I was calling recv() with 80k bytes.

    However, since I don't know how much I need to receive in order for them not to split, -because I want my application to have the ability to function with various HTTP servers, not just the same one- receiving 80k bytes was the only solution in my eyes.

    Any other solutions?

    EDIT: Salem, I am aware of that and that is why I came here; in order to receive help to find a correct design to my application.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    What's special about 80K?

    You have 2 buffers, large enough to contain a single record (say a line) of the information you want to parse.

    One buffer holds a complete line.
    The other buffer holds a fragment of a line. When the record boundary is found, the first part of the buffer becomes a whole line buffer, and the tail end of it becomes a fragment for the next line.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    The correct way is to call recv() in a loop until you have received all of the data. The protocol you are using (in this case it appears to be HTTP) should tell you how much data you need to receive, so you just keep calling recv() until you have gotten that amount. In the case of HTTP, the data size is usually stored in the "Content-Length" header. This is not always the case though, because if the server is used chunked encoding, there will be no Content-Length header, and instead you need to decode the chunked encoding. Also keep in mind that the Content-Length does not include the HTTP header in the size, so you have to parse that out first (The header should always end with a double CRLF).

    None of this is trivial, and that's why people usually will use a library like libcurl to do HTTP transactions.

  7. #7
    Registered User
    Join Date
    Dec 2008
    Posts
    104
    Quote Originally Posted by Salem View Post
    What's special about 80K?

    You have 2 buffers, large enough to contain a single record (say a line) of the information you want to parse.

    One buffer holds a complete line.
    The other buffer holds a fragment of a line. When the record boundary is found, the first part of the buffer becomes a whole line buffer, and the tail end of it becomes a fragment for the next line.
    If I understand what you are saying, you are telling me to 'join' the split lines that I need to parse.

    This is a good solution IF I knew when the string would split, which I don't.

    Imagine this was the line I am looking to parse:
    Code:
    <Test>Test</Test>
    The line could split ANYWHERE in the string; this means that I can't predict what I have to look for in order to know that the line has been split.

    Because sometimes, the line IS sent together, and sometimes it is not.

    Note: There are more than one of the lines I need to parse that are sent from the server. Meaning, I have to parse more than one line separately.

    EDIT: bithub, the server sends the data using a chunked encoding. The total size of the data to be received is 80k bytes. That is why I had a buffer which could hold 80k bytes. Although, this is not essentially the way I wanted to do this. The application runs much faster when receiving 512 (for example) bytes at a time than 80k~. This is because I do a lot of string manipulation with the data received and manipulating an 80k~ length string is slower than that of a 512 length string.
    Last edited by abraham2119; 06-09-2009 at 10:56 AM.

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    So what's the difference between
    Code:
    <Test>Test</Test>
    
    <Test>
    Test</Test>
    
    <Test>
    Test
    </Test>
    It's all valid HTML, and IIRC it means the same thing as well.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    Dec 2008
    Posts
    104
    Quote Originally Posted by Salem View Post
    So what's the difference between
    Code:
    <Test>Test</Test>
    
    <Test>
    Test</Test>
    
    <Test>
    Test
    </Test>
    It's all valid HTML, and IIRC it means the same thing as well.
    Ugh, that is not the point. I need to PARSE something from the HTML. In this case, I'd have to parse whatever is between the <Test> tags. My whole problem is getting the WHOLE string..

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    What about a buffer-less approach, and use a state machine?

    Code:
    while ( (ch=getNextChar()) ) {
      switch ( ch ) {
        case '<':
            tagString[tagStringLen++] = ch;
            state = inTag;
            break;
        case '>':
            tagString[tagStringLen++] = ch;
            tagString[tagStringLen] = '\0';
            process( tagString );  // set some state on seeing <test>, clear it on seeing </test>
            state = outTag;
            break;
        // and so on
      }
    }
    That's a simplified view.
    More generally, you would compare both 'state' and 'ch' to determine what 'newstate' should be, and perform any additional processing along the way.

    Adding detection of say comments is pretty easy.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Registered User
    Join Date
    Jul 2009
    Posts
    3
    You can recieve as much data as you want.
    So, you could recieve the data one byte at a time if you like. Not very efficient but you can parse the data as it comes in and stop the recieve calls when you wish.

    You are not locked into recieving a preset amount of data.

    As a side note (as mentioned before) TCP/IP does not preserve the data boundaries. Whereas UDP does but you are not guaranteed to recieve the data in at all.

    More information on this can be found here - Winsock

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Winsock issues
    By tjpanda in forum Windows Programming
    Replies: 3
    Last Post: 12-04-2008, 08:32 AM
  2. Winsock, weird sideeffect
    By Magos in forum Networking/Device Communication
    Replies: 9
    Last Post: 05-02-2005, 01:46 PM
  3. Winsock Messaging Program
    By Morgul in forum Windows Programming
    Replies: 13
    Last Post: 04-25-2005, 04:00 PM
  4. Where do I initialize Winsock and catch messages for it?
    By Lithorien in forum Windows Programming
    Replies: 10
    Last Post: 12-30-2004, 12:11 PM
  5. winsock
    By pode in forum Networking/Device Communication
    Replies: 2
    Last Post: 09-26-2003, 12:45 AM