Thread: Partial web page downloading

  1. #1
    Registered User
    Join Date
    Mar 2006
    Posts
    10

    Partial web page downloading

    I’m trying to download part of the html code of a web site. I have attempted to use the libcurl library and have managed to achieve this for some web pages using the range command. However I cannot get it to work with the web page that I want to do this to, it would seem that the server does not support ranges. Does anyone know any way of getting round this.

    The part of the web page I want to download is at the beginning of the file. If it’s not possible to get round the range issue, is it possible to terminate the download after a specified number of bytes has been received. I have looked in the libcurl library but cannot determine how to do this.

    Any help would be much appreciated.

    Regards god_of_war

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > it would seem that the server does not support ranges
    IIRC, ranges appeared in a later spec for HTTP, so check the version number returned by the server, and check which RFC (www.rfc-editor.org) ranges got implemented in the HTTP protocol.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    It may also simply be that the server won't do ranges for dynamic pages.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  4. #4
    Registered User
    Join Date
    Mar 2006
    Posts
    10
    Thank you for the replies.

    From further research i have come to the conclusion that the server i am trying to download from doesn't support ranges for downloading.

    For this reason i am now focusing on trying to download only part of the web page from the beginning, for instance the first 1024 bytes, or downloading one byte at a time and analysing it then telling the program whether to download the next byte or not. However i have had no success. If someone could suggest a method, or a library to use, the help would be much appreciated.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    If the server doesn't do it, then you're stuck. No library change is going to fix what parts of the protocol are implemented by the server.

    I suppose you could hassle the site owner to upgrade their server to support ranges, which might work if you're paying for this information.

    Maybe there's some other field which may interest you, like last-modified ?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Mar 2006
    Posts
    10
    I agree I am stuck because the server will not allow me to control the download from its end. But is their not some way to detect what i've downloaded as it downloads and the kill the connection when I've received the data that I want?

    An example of what i am aiming for is: when downloading a file the data is recieved sequencially from the beginning. If your connection goes before the whole of the file has been downloaded then you only get part of it. Is there a way to do this in a controlled manner? i.e to terminate the connection after a certain number of bytes have been downloaded or a certain series of characters have been downloaded?

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    So how big are these files you're downloading, and over what kind of connection?

    A few KB or 10's of KB over broadband just isn't worth the effort - just take what you need and trash the rest until the end of the stream. Processing that isn't going to take up too much CPU time or network bandwidth.

    A few MB over dialup would be another matter, if all you wanted was the head of the file.

    Whilst I'm sure you can unilaterally decide to close the connection, you have to consider that this might have a negative impact on the server at the other end (say waiting for a connection to timeout after a long period rather than closing it cleanly after a short period). Consider that if you do this a lot that the site may regard this as a denial of service attack on your part.

    The network is robust enough that connections can drop for random network breakages, but to exploit that for your own sense of urgency is probably not on.

    I would suggest you read the RFC in detail to find out what is and is not allowed within the normal HTTP protocol.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    C++ Developer XSquared's Avatar
    Join Date
    Jun 2002
    Location
    Ontario, Canada
    Posts
    2,718
    Obligatory RFC links, since I had them up anyway:

    HTTP/1.0
    HTTP/1.1
    Naturally I didn't feel inspired enough to read all the links for you, since I already slaved away for long hours under a blistering sun pressing the search button after typing four whole words! - Quzah

    You. Fetch me my copy of the Wall Street Journal. You two, fight to the death - Stewie

  9. #9
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    The libcurl documentation says:
    CURLOPT_WRITEFUNCTION

    Function pointer that should match the following prototype: size_t function( void *ptr, size_t size, size_t nmemb, void *stream); This function gets called by libcurl as soon as there is data received that needs to be saved. The size of the data pointed to by ptr is size multiplied with nmemb, it will not be zero terminated. Return the number of bytes actually taken care of. If that amount differs from the amount passed to your function, it'll signal an error to the library and it will abort the transfer and return CURLE_WRITE_ERROR.
    Whilst I'm sure you can unilaterally decide to close the connection, you have to consider that this might have a negative impact on the server at the other end (say waiting for a connection to timeout after a long period rather than closing it cleanly after a short period). Consider that if you do this a lot that the site may regard this as a denial of service attack on your part.
    I press the Stop button all the time. I doubt that any HTTP server is going to have a problem with stopping partially complete downloads, given that this functionality is built into every browser.

  10. #10
    Registered User
    Join Date
    Mar 2006
    Posts
    10
    Thank you all for the replies.

    Salem I understand what you mean about cutting the connection maybe having a negative impact on the server, especially if I did it multiple consecutive times, and certainly don’t want to do that. However as Anonytmouse says all web browsers have the ability to stop connections built in, and the ability to click on a different link before all the data has been recieved from the server. In any case is it not just possible to let the server send the data, but just not actually receive it all at my end however tell the server I did, or tell it to stop sending the data part way through the transfer, thus avoiding a time out.

    Anonytmouse thank you for the information on the Libcurl CURLOPT_WRITEFUNCTION function however I could not get it to work. In the end I decided I would just try to download the file using sockets to allow me to be more flexible, and so wrote the following code, however I cannot get it to work.

    Code:
    //Projected linked with libwsock32.a
    
    #include <winsock2.h>
    #include <string>
    #include <iostream>
    
    ////////////////////////////////////////////////////////////
    
    int main()
    {
    	char buff[512];
    
    	WSADATA wsaData;
    	struct hostent *hp;
    	unsigned int addr;
    	struct sockaddr_in server;
    	const char servername[] = "www.google.co.uk";
    //	const char filepath[] = "/index.html";
    	const char filepath_send[] = "GET /index.html \n";
    	
    	int wsaret=WSAStartup(0x101,&wsaData);
    	if(wsaret)	
    		return 0;
    
    	SOCKET conn;
    	conn=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
    	if(conn==INVALID_SOCKET)
    	{
            std::cout << "socket() error: " << WSAGetLastError() << std::endl;
            system ("PAUSE");
    		return 0;
        }
    
    	if(inet_addr(servername)==INADDR_NONE)
    	{
    		hp=gethostbyname(servername);
    	}
    	else
    	{
    		addr=inet_addr(servername);
    		hp=gethostbyaddr((char*)&addr,sizeof(addr),AF_INET);
    	}
    	if(hp==NULL)
    	{
            std::cout << "gethostbyname() error: " << WSAGetLastError() << std::endl;
            system ("PAUSE");
    		closesocket(conn);
    		return 0;
    	}
    
    	server.sin_addr.s_addr=*((unsigned long*)hp->h_addr);
    	server.sin_family=AF_INET;
    	server.sin_port=htons(80);
    	
    	int y;
    	y = connect(conn,(struct sockaddr*)&server,sizeof(server));
    	if( y == SOCKET_ERROR)
    	{
    		closesocket(conn);
    		std::cout << "connect() error: " << WSAGetLastError() << std::endl;
    		system ("PAUSE");
    		return 0;	
    	}
    	
    	
    //	std::cout << filepath_send << std::endl;
    	y = send(conn,filepath_send,strlen(filepath_send),0);
    	if( y == SOCKET_ERROR)
    	{
    		closesocket(conn);
    		std::cout << "send() error: " << WSAGetLastError() << std::endl;
    		system ("PAUSE");
    		return 0;	
    	}
    	
    	while(1)
    	{
            std::cout << "GOT HERE" << std::endl;
    		y = recv(conn,buff,512,0);
    		if (y == SOCKET_ERROR)
    		{
    			std::cout << "recv() error: " << WSAGetLastError() << std::endl;
    			break;
    		}
    
    		std::cout << "Bytes recieved: " << y << std::endl;
    
    		if (y == 0)
    			break;
    			
    		std::cout << buff << std::endl;
    
    	}
    
    	closesocket(conn);
    
    	WSACleanup();
    
    
    system ("PAUSE");
    
    return 0;
    }
    If I try www.cprogramming.com/begin.html I get 404 Not Found
    If I try www.google.co.uk/index.html I get an error from the recv() part of the code: 10053 which is “Software caused connection abort. A connection was aborted by the software in your machine, possibly due to a TCP/IP configuration error, data transmission time-out or protocol error.”

    Any help would be much appreciated.

  11. #11
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    The request should look something like this:
    Code:
    	const char filepath_send[] = "GET /index.html HTTP/1.1\r\nHost: www.google.com\r\n\r\n";
    Note that each line is terminated with a CR-LF and that the request is terminated with a blank line. Although a host field was not required with HTTP1.0, ip addresses which house multiple web sites will fail without it.

  12. #12
    Registered User
    Join Date
    Mar 2006
    Posts
    10
    Anonytmouse again than you for the response, the code now works well, and allows me to download a web page.

    However any help from anybody on achieving my initial aim would be much appreciated. Which was to be able to download part of the beginning of a web page, which is on a server that doesn’t support ranges.

    From the code is does seem possible to download say the first 1KB of a web page by adding the following code to the while loop.

    Code:
    if (y >= 1024)
    	break;
    However would this cause the server to time out, as Salem suggested previously? Instead of just closing the connection part way through the transfer is it possible to send a message to the server to tell it to close the connection before it has sent all the data, or send it a new GET message that overrides the previous GET message.

    I have been reading about persistent connections and the "Connection: close" header so it does seem that it is possible to tell a server to close the connection. Though this is not exactly the same as I am trying to achieve would it be possible to use this some how, or something similar.

  13. #13
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    I agree with anonytmouse. Just close the connection on your side. The server will cope.

    You cannot send any message to the server that will make it stop sending midway through the document. The connection header is a hint about what to do with the connection when the current request is completed; whether the server should close it or should keep it open, waiting for another request.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Web Page Organization
    By Khelder in forum C# Programming
    Replies: 4
    Last Post: 05-06-2004, 10:24 AM
  2. Downloading HTML Files from Web Page
    By Unregistered in forum A Brief History of Cprogramming.com
    Replies: 13
    Last Post: 07-18-2002, 05:59 AM