Thread: Why do I get partial web-pages with recv?

  1. #1
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131

    Why do I get partial web-pages with recv?

    OS: Linux Ubuntu 6.10


    I am trying to receive the full html page of www.google.com, but I only get a partial page. http://cboard.cprogramming.com/showthread.php?t=86814 describes a similiar problem and the "solution" was to use curl.

    I don't want to use curl, I want to get the page by the recv() function.

    I am sending this request to google:

    Code:
     GET / HTTP/1.1 <crlf>
    Host: www.google.com <crlf>
    Connection: close <crlf>
    <crlf>

    The <crlf>-s are substituted with \r\n ofcourse.

    Here's the preparation code:
    Code:
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    int status = 1;
    ioctl(sock, FIONBIO, &status); //put the socket in non-blocking mode.
    
    sockaddr_in socket_address;
    //I am doing the evaluation here(port, host and addr family).
    //Connect the socket...

    Here's the main loop:
    Code:
    char buffer[1000];
    string whole_page;
    
    while (1)
    {
        int bytes = recv(sock, buffer, 1000);
        if (bytes == -1)
        {
            if (errno == EAGAIN) continue; //would block
            else return;
        }
        if (bytes == 0)
        {   //Google disconnected me?
            return;
        }
        whole_page += buffer;
    }
    So why does this code only get a partial source code?


    does "bytes == 0" mean that google.com terminated the connection?
    If not, then how can I know when the connection gets terminated.

  2. #2
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131
    A quote from http://www.madwizard.org/view.php?pa...pter6&lang=cpp
    Recv too will block if no data is available immediately and return if some has arrived. The return value of recv is either 0, SOCKET_ERROR or the number of bytes read. SOCKET_ERROR of course indicates a socket error, 0 indicates closure of the connection.
    It's a tutorial on winsock. Does that == 0 rule apply to linux sockets too?

    A quote from:
    http://www.hmug.org/man/2/recv.php

    These calls return the number of bytes received, or -1 if an error
    occurred.
    It says nothing about connection closure. Can I assume that bytes == 0 means that the connection is terminated?

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > Can I assume that bytes == 0 means that the connection is terminated?
    I think getting a return of 0 means you need to go check the value of errno.

    If you're using a non-blocking socket, then you should get EAGAIN indicating that the connection is still alive, but there is no data at the moment.

    A zero return on a blocking socket is end of connection.

    Also, look at select() to help you determine if there is any data to be read, before you read it.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    I believe that if recv() returns 0, the connection is closed:
    Quote Originally Posted by man recv(2)
    If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)) in which case the value -1 is returned and the external variable errno set to EAGAIN.
    There are bigger errors, however.
    Code:
    int bytes = recv(sock, buffer, 1000);
    ...
    whole_page += buffer;
    First, recv() takes four arguments, not three. The last one is flags for recv(), usually 0.

    Second, you cannot just append your buffer to a C++ string like that. std::string's += operator appends a C string, and said string must be nul terminated. recv() does not append any null to the buffer, so you must do so yourself before passing the buffer to anything that expects C strings. (Which also means that you should pass 1 byte less than the total size of your buffer to recv(), to save room for the null you will append.) Something like:
    Code:
    ret = recv(my_socket, my_buffer, my_buffers_size - 1, 0);
    // Error checking.
    my_buffer[ret] = 0;
    my_cppstring += my_buffer;
    And be sure you return your string somehow when you're done... Google will disconnect you after sending the data.
    Last edited by Cactus_Hugger; 12-27-2006 at 08:03 PM.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  5. #5
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131
    Quote Originally Posted by Cactus_Hugger
    I believe that if recv() returns 0, the connection is closed:


    There are bigger errors, however.
    Code:
    int bytes = recv(sock, buffer, 1000);
    ...
    whole_page += buffer;
    First, recv() takes four arguments, not three. The last one is flags for recv(), usually 0.

    Second, you cannot just append your buffer to a C++ string like that. std::string's += operator appends a C string, and said string must be nul terminated. recv() does not append any null to the buffer, so you must do so yourself before passing the buffer to anything that expects C strings. (Which also means that you should pass 1 byte less than the total size of your buffer to recv(), to save room for the null you will append.) Something like:
    Code:
    ret = recv(my_socket, my_buffer, my_buffers_size - 1, 0);
    // Error checking.
    my_buffer[ret] = 0;
    my_cppstring += my_buffer;
    And be sure you return your string somehow when you're done... Google will disconnect you after sending the data.
    I wrote this code here in the forum without compiling, thus the 4th argument was accidentally left out. np with that.

    But I didn't know the the buffer does not contain a '\0' character in the end of it - thanks for pointing that out

  6. #6
    Registered User
    Join Date
    Mar 2005
    Location
    Juneda
    Posts
    291
    On Winsocks, when the nº of bytes received are 0 or 'WSAECONNRESET', means that the transfer has ended; I suppose that will be similar on Linux sockets. Also theres something to get some last unexpected bytes while closing the connection (I don't know if is your problem, but maybe it will help) http://tangentsoft.net/wskfaq/exampl...cs/ws-util.cpp, take a look at the function 'ShutdownConnection(socket)'.

    Niara

  7. #7
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131
    Quote Originally Posted by Niara
    On Winsocks, when the nº of bytes received are 0 or 'WSAECONNRESET', means that the transfer has ended; I suppose that will be similar on Linux sockets. Also theres something to get some last unexpected bytes while closing the connection (I don't know if is your problem, but maybe it will help) http://tangentsoft.net/wskfaq/exampl...cs/ws-util.cpp, take a look at the function 'ShutdownConnection(socket)'.

    Niara
    I don't think that will help, as I specified in the GET request "Connection: Close". Thus, when all the data has been sent, google should terminate the connection, but that doesn't happen.

  8. #8
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131
    Ok it's all right now. I got the loop working. Thanks everyone

    But now I have another problem.

    I was able to receive the full source code of www.google.ee, but there are some weird lines that I think should not be there.
    I put the source to http://haxxx.hyena.pri.ee/crap.txt

    As you can see, the 9th and the last line contain respectively "aae" and "0".
    But when I open www.google.ee in my browser, I don't get such lines.


    Here's the loop:
    Code:
    char source[2000];
    string full_source;
    
    ioctl(sock_, FIONBIO, &status);
        while (1)
        {
            usleep(100000); //sleep 100 milliseconds
            int ret = recv(sock_, source, 2000 - 1, 0);
            if (ret == -1)
            {
                if (errno == EAGAIN)
                {   //would block
                    cout << "------Would block------" << endl;
                    continue;
                }
                else
                {
                    cout << "------A fatal error occurred-----" << endl;
                    cout << "Errno = " << errno << " - " << strerror(errno) << endl;
                    break;
                }
            }
            if (ret == 0)
            {   //the connection was shut down
                cout << "------Connection was shut down------" << endl;
                break;
            }
            //We are here if ret > 0, thus we got sama data!
            source[ret] = '\0'; //Add a '\0' to the end of the received data.
            cout << "------I got some data------:" << endl;
            cout << source << endl;
            
            
            full_source += source;
        }
    Btw: crap.txt contains the data from full_source.txt not frome the console output.
    Last edited by hardi; 12-28-2006 at 07:47 AM.

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    That would be the chunked transfer encoding. Read the HTTP spec for more information.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Chucked transfer encodings gave me a fun time when I first encountered them. (Except I was working with JPEGs, so they completely corrupted the result until decoded.)

    To elaborate, see this section (and perhaps the one above it) of the HTTP Protocol spec.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  11. #11
    Registered User
    Join Date
    Jan 2005
    Location
    Estonia
    Posts
    131
    Isn't there a good tutorial on this?
    Those specifications are so complicated and there is a tremendeous lack of (good) examples.

  12. #12
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    I don't think there is. That's why there are libraries such as cURL.

    To put it bluntly, either you understand specifications, or you have no business trying to implement them.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Partial web page downloading
    By god_of_war in forum C++ Programming
    Replies: 12
    Last Post: 08-14-2006, 12:19 PM
  2. embedding web pages
    By Devil Panther in forum Windows Programming
    Replies: 9
    Last Post: 01-14-2005, 09:37 AM
  3. Layout of web pages whilst browsing.
    By Fountain in forum Tech Board
    Replies: 9
    Last Post: 11-19-2003, 09:24 PM
  4. creating a user login system for web pages
    By Nutshell in forum A Brief History of Cprogramming.com
    Replies: 1
    Last Post: 07-04-2002, 11:02 PM