Thread: Winsock HTML Page Source Dump...

  1. #31
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    All your attempts so far fail to hit the last byte of the array with a \0. In one way or another, you wrote past the end of the array.

    It is an obvious bug which needs to be fixed before even beginning to discuss what else may (or may not) need to be looked at.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  2. #32
    Registered User
    Join Date
    Nov 2007
    Posts
    17
    okay, it's added as you showed. now what

  3. #33
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    I dunno, what are you seeing in the rest of the code (observations on what is, or isn't happening), that kind of thing.

    Rather than a single massive main(), consider a more layered approach where each function performs one specific task (managing the connection, managing the transfer). A bit like this.
    Code:
    void doTransfer( SOCKET D2JSP, char *request ) {
        char dbuf[513]
        int dRecv;
    
        dSend(D2JSP, request);
        
        ofstream dparse ("d2jsp.txt");
        do {
           dRecv = recv(D2JSP, dbuf, sizeof dbuf - 1, 0);
           if (dRecv == SOCKET_ERROR) {
              cout<<"Failed to recieve data through D2JSP... " << WSAGetLastError() << endl;
              break;
           }
           dbuf[dRecv] = '\0';
           if (!dparse.is_open()) { cout<<"Failed to open d2jsp.txt...\n"; break; }
           else dparse << dbuf;
        } while (dRecv > 0);
                           
        dparse.close();
    }
    
    void doConnection ( ) {
        SOCKET D2JSP;
        sockaddr_in D2;
        
        hostent* dHost;
        
        char *dIP, *request;
        unsigned short dline = 0, cwrite = 0;
        request = "GET /index.php?showforum=168 HTTP/1.1\r\nHost: forums.d2jsp.org\r\n\r\n";
    
        D2JSP = socket(AF_INET, SOCK_STREAM, 0);
        if (D2JSP == INVALID_SOCKET) {
           cout<<"Failed to make D2JSP socket... " << WSAGetLastError() << endl;
           return;
        }
        
        dHost = gethostbyname("forums.d2jsp.org");
        dIP = inet_ntoa (*(in_addr*) dHost->h_addr);
        cout<<"D2JSP IP: " << dIP << endl;
        
        D2.sin_family = AF_INET;
        D2.sin_addr.s_addr = inet_addr (dIP);
        D2.sin_port = htons (80);
        
        if (connect(D2JSP, (sockaddr*) &D2, sizeof(D2)) == INVALID_SOCKET) {
           cout<<"Failed to connect to D2JSP socket... " << WSAGetLastError() << endl;
           shutdown(D2JSP, 2);
           closesocket(D2JSP);
           return;
        }
    
        doTransfer( D2JSP, request );
    
        shutdown(D2JSP, 2);
        closesocket(D2JSP); 
    }
    
    
    
    int main()
    { 
        WSADATA WsaDat;
        
        if ( WSAStartup(MAKEWORD(2, 0), &WsaDat) != 0 ) {
            cout<<"WSAStartup failed to initalize...\n";
        } else {
            doConnection();
            WSACleanup();
        }
        return 0;
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #34
    Registered User
    Join Date
    Nov 2007
    Posts
    17
    Thanks for the info but I'll show you what usually happens when I dump the html source into a text file...

    Original Line: <td><a href="index.php?showuser=387288">Villaloboos</a><br><span class="desc">Mon, Nov 19 2007, 10:04pm</span></td>

    Cut Line:
    Line 1: <td><a href="index.php?showuser=387288">Villaloboos</a><br><span class=
    Line 2: 35e
    Line 3: "desc">Mon, Nov 19 2007, 10:04pm</span></td>

    On line 2, there is always some kind of random numbers with letters on line 2 before line 3. And when this happens, Line 2 and Line 3 are placed right under line 1. Have you ever seen or encountered this before?

  5. #35
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    I've no idea, but if it's any consolation, I see the same effect.

    If I do this though
    dparse << dbuf << "--\n--\n";
    I can see that the random data has nothing to do with the boundaries of the buffer say (it aways seems to be in the middle somewhere).

    I've seen the same page in Firefox, and there's no sign of those extra chars.

    Try to trace the communications with wireshark and compare your code with a standard browser.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #36
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Do you know what encoding the server is sending back? If I had to guess, I'd say it's sending back chunked encoding, and that's what you're seeing. (Firefox has already decoded it by the time you hit View Source) You'll have to look it up in the HTTP RFC, and decode it. (chucked encoding has to be the #1 reason why I usually opt for libcurl when doing HTTP work.)

    You should be able to verify if you're getting chucked-encoding - it'll show up in the response headers. (I think as "Content-Encoding: chunked\r\n") See this

    As an aside note, I think servers can send back other encodings, like gzip compression, etc.
    Last edited by Cactus_Hugger; 11-19-2007 at 09:07 PM.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  7. #37
    Registered User
    Join Date
    Nov 2007
    Posts
    17
    Thanks for the response Cactus_Hugger.. Yes it is in fact chunked and your response helps me understand more of my problem which I was looking for.. Thanks to the guys for the help on cleaning up my code also.. I'll have to look into this and check it out.. Thanks again.
    Last edited by blake_; 11-19-2007 at 11:05 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. why page based I/O can improve performance?
    By George2 in forum C Programming
    Replies: 1
    Last Post: 06-12-2006, 07:42 AM
  2. Request for comments
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-02-2004, 10:33 AM
  3. requesting html source from a server
    By threahdead in forum Linux Programming
    Replies: 2
    Last Post: 08-01-2003, 07:52 PM