Thread: read html page text file

  1. #1
    Registered User
    Join Date
    Nov 2008
    Posts
    222

    Thumbs up read html page text file

    Hi

    How can I read text from a webpage asynchronously,from within C++. After you first connect to the internet using usual means, you enter the string

    e.g. "http://www.bbc.co.uk/test/run.txt" and the C++ program reads the information: "<title>BBC Website</title><body>This is the BBC website...." from the internet.


    Here I want to connect to website every 60 minutes and fetch only first 100 lines of the run.txt file as mentioned above. How can I do this using C++ and asynchronous winsock ???

    Code done so far is given below...please help me


    Code:
    
    
    #define WIN_OS
    #define _DEBUG_PRINT(X)   /* X */
    
    //For commn
    #include <iostream>
    #include <string>
    #include <stdlib.h>
    #include <assert.h>
    
    #ifdef WIN_OS
     #include <Winsock2.h>
    #endif
    
    
    #define SEND_RQ(MSG) \
                    /*cout<<send_str;*/ \
      send(sock,MSG,strlen(MSG),0);
    
    
    using namespace std;
    //<exe> hostname api parameters
    int request (char* hostname, char* api, char* parameters, string& message)
    {
    
    	#ifdef WIN_OS
    	{
    		WSADATA	WsaData;
    		WSAStartup (0x0101, &WsaData);
    	}
    	#endif
    
        sockaddr_in       sin;
        int sock = socket (AF_INET, SOCK_STREAM, 0);
        if (sock == -1) {
    		return -100;
    	}
        sin.sin_family = AF_INET;
        sin.sin_port = htons( (unsigned short)80);
    
        struct hostent * host_addr = gethostbyname(hostname);
        if(host_addr==NULL) {
          _DEBUG_PRINT( cout<<"Unable to locate host"<<endl );
          return -103;
        }
        sin.sin_addr.s_addr = *((int*)*host_addr->h_addr_list) ;
        _DEBUG_PRINT( cout<<"Port :"<<sin.sin_port<<", Address : "<< sin.sin_addr.s_addr<<endl);
    
        if( connect (sock,(const struct sockaddr *)&sin, sizeof(sockaddr_in) ) == -1 ) {
         _DEBUG_PRINT( cout<<"connect failed"<<endl ) ;
         return -101;
        }
    
     string send_str;
    
     SEND_RQ("POST ");
     SEND_RQ(api);
     SEND_RQ(" HTTP/1.0\r\n");
     SEND_RQ("Accept: */*\r\n");
     SEND_RQ("User-Agent: Mozilla/4.0\r\n");
    
     char content_header[100];
     sprintf(content_header,"Content-Length: %d\r\n",strlen(parameters));
     SEND_RQ(content_header);
     SEND_RQ("Accept-Language: en-us\r\n");
     SEND_RQ("Accept-Encoding: gzip, deflate\r\n");
     SEND_RQ("Host: ");
     SEND_RQ("hostname");
     SEND_RQ("\r\n");
     SEND_RQ("Content-Type: application/x-www-form-urlencoded\r\n");
     
     //If you need to send a basic authorization
     //string Auth        = "username:password";
     //Figureout a way to encode test into base64 !
     //string AuthInfo    = base64_encode(reinterpret_cast<const unsigned char*>(Auth.c_str()),Auth.length());
     //string sPassReq    = "Authorization: Basic " + AuthInfo;
     //SEND_RQ(sPassReq.c_str());
    
     SEND_RQ("\r\n");
     SEND_RQ("\r\n");
     SEND_RQ(parameters);
     SEND_RQ("\r\n");
    
     _DEBUG_PRINT(cout<<"####HEADER####"<<endl);
     char c1[1];
     int l,line_length;
     bool loop = true;
     bool bHeader = false;
    
     while(loop) {
       l = recv(sock, c1, 1, 0);
       if(l<0) loop = false;
       if(c1[0]=='\n') {
           if(line_length == 0) loop = false;
    
           line_length = 0;
           if(message.find("200") != string::npos)
    	       bHeader = true;
    
       }
       else if(c1[0]!='\r') line_length++;
       _DEBUG_PRINT( cout<<c1[0]);
       message += c1[0];
     }
    
     message="";
     if(bHeader) {
    
         _DEBUG_PRINT( cout<<"####BODY####"<<endl) ;
         char p[1024];
         while((l = recv(sock,p,1023,0)) > 0)  {
             _DEBUG_PRINT( cout.write(p,l)) ;
    	     p[l] = '\0';
    	     message += p;
         }
    
         _DEBUG_PRINT( cout << message.c_str());
     } else {
    	 return -102;
     }
    
    
     #ifdef WIN_OS
       WSACleanup( );
     #endif
    
     return 0;
    }
    
    
    int main(){
      string message;
      int request ("www.somesite.com", "/post_url.pl", "search=hello&date=todat", string& message);
      // message contains response!
    
    }

  2. #2
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    do I need to download that .txt file from html page and then read it? suggest some ideas.

  3. #3
    Registered User
    Join Date
    Aug 2005
    Location
    Austria
    Posts
    1,990
    Quote Originally Posted by leo2008 View Post
    do I need to download that .txt file from html page and then read it? suggest some ideas.
    Yes you have to download the .txt, otherwise you don't have anyting to read.
    Kurt

  4. #4
    Registered User
    Join Date
    Nov 2008
    Posts
    222

    Thumbs up

    Quote Originally Posted by ZuK View Post
    Yes you have to download the .txt, otherwise you don't have anyting to read.
    Kurt
    how do I asynchronously connect to html page and download .txt file and then read it?? some timer needs to be set?

    any code for reference?pls suggest

  5. #5
    Registered User
    Join Date
    Aug 2005
    Location
    Austria
    Posts
    1,990
    You have posted a lot of code. Please explain, How doesn't your code do what you want it to do ?
    Kurt

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > Here I want to connect to website every 60 minutes and fetch only first 100 lines of the run.txt file as mentioned above. How can I do this using C++ and asynchronous winsock ???
    A better question would be why do you need async winsock?
    It's not like you're writing a high performance program, or have anything else better to do while you're waiting for the web content to arrive.

    Does the code you have fetch the required information just once?
    Because you may as well turn your main into
    Code:
    int main(){
      string message;
      while ( true ) {
        request ("www.somesite.com", "/post_url.pl", "search=hello&date=todat", string& message);
        process( message );
        Sleep( 60 * 60 * 1000 );  // 1 hour, in milliseconds
      }
    }
    Or even just this, and use your OS task scheduler to run it once every hour.
    Code:
    int main(){
      string message;
      request ("www.somesite.com", "/post_url.pl", "search=hello&date=todat", string& message);
      process( message );
      return 0;
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    Quote Originally Posted by Salem View Post
    > Here I want to connect to website every 60 minutes and fetch only first 100 lines of the run.txt file as mentioned above. How can I do this using C++ and asynchronous winsock ???
    A better question would be why do you need async winsock?
    It's not like you're writing a high performance program, or have anything else better to do while you're waiting for the web content to arrive.

    Does the code you have fetch the required information just once?
    Because you may as well turn your main into
    Code:
    int main(){
      string message;
      while ( true ) {
        request ("www.somesite.com", "/post_url.pl", "search=hello&date=todat", string& message);
        process( message );
        Sleep( 60 * 60 * 1000 );  // 1 hour, in milliseconds
      }
    }
    Or even just this, and use your OS task scheduler to run it once every hour.
    Code:
    int main(){
      string message;
      request ("www.somesite.com", "/post_url.pl", "search=hello&date=todat", string& message);
      process( message );
      return 0;
    }
    I am not able to fetch text file from html page. what is the reson? is it because I use POST instead of GET?

  8. #8
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    output when run the code is below...please help me.

    Result of query :-102
    Server returned :

  9. #9
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > I am not able to fetch text file from html page. what is the reson? is it because I use POST instead of GET?
    It might have something to do with blind copy/pasting of someone else's code.

    Wireshark ยท Go Deep.
    Use this to compare and contrast what your program does with a URL, and what an actual browser does with the same URL.

    For example, you might wonder
    > SEND_RQ("POST ");
    > SEND_RQ(api);
    Where is the \r\n here?


    > SEND_RQ("Host: ");
    > SEND_RQ("hostname");
    Why are you sending a literal string, and not the parameter hostname?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #10
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    after making some modifications here, I am still stuck after line 75, while debugging. the output i get is only

    Port :20480, Address : 988743242
    ####HEADER####

    what could be the reason here?

  11. #11
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    there seems some probs with HTML header here. i now get output as only
    Port :20480, Address : 988743242
    ####HEADER####
    ╠Result of query :-102
    Server returned :

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need to auto-format a text file to HTML
    By Jeff Mitchell in forum Tech Board
    Replies: 3
    Last Post: 04-09-2011, 10:51 PM
  2. Getting links from html page
    By g0_sh in forum C Programming
    Replies: 9
    Last Post: 09-30-2007, 03:07 PM
  3. Trying to grab the HTML from a Page...
    By rloveless in forum Networking/Device Communication
    Replies: 6
    Last Post: 05-05-2007, 01:03 AM
  4. My program needs to read from a synchronized html file
    By istheman5 in forum Windows Programming
    Replies: 8
    Last Post: 11-30-2005, 04:51 PM
  5. Downloading HTML Files from Web Page
    By Unregistered in forum A Brief History of Cprogramming.com
    Replies: 13
    Last Post: 07-18-2002, 05:59 AM