Read HTTP headers
I'm writing a small app which uses some of the HTTP protocol. I've read the HTTP 1.1 RFC, looked at examples, and it doesn't mention anything about a max header size (eg how large the HTTP headers can be in bytes) - and a lot of the examples allocate different amounts of size for the headers (from 512b to 2k), So I was wondering, what is the best way to read say HTTP headers from a client? Read until you encounter CRLFCRLF? If so how should you read from the socket to efficiently achieve that?
Reading char-by-char from a socket seems rather "inefficient".
> Reading char-by-char from a socket seems rather "inefficient".
Perhaps, but then again trying something else might be premature optimisation disease.
I suppose you could create a wrapper class which does block recv() calls and splits the data into consecutive CRLF lines, and maintains any residual data at the end (which would be the first part of the content) in an internal (to the class) buffer.
Technically there is no limit to the header size, although in practice 2k is probably plenty. What specifically are you trying to do? If you just want to read a file from a server, look into InternetReadFile() in the win32 api. It handles all the header info and just returns the decoded file data.
It is. But consider that the header is not TOO big. It might be inefficient, but you're only reading the header part of the response character-by-character. Once you see the header terminator CR-LF-CR-LF, you know that the header is complete, and can switch to reading large blocks for the remainder of the response.
Originally Posted by zacs7
A 3 kilobyte header would imply 3000 calls to recv(), but in the grand scheme, that's not so bad. If you had to read an entire 20 megabyte transmission character-by-character we'd have a different story.
Also remember that the network driver already does buffering, so it's not like it's receiving the data byte by byte.
True, but on many systems a call to recv() still implies a context switch in and out of kernel mode. As the data comes in, the network stack places it in a buffer, but if the application only accesses it one byte at a time, it still has to ratchet around quite a bit.
Originally Posted by CornedBee
I'd say ignore the issue for now -- if you start seeing problems, you'll have to come up with some sort of thin buffering layer.
I see, thanks everyone - very helpful. The fact is I'm only really interested in the header, one part even "Host: " :)
I've decided to read it into a 2K buffer on the stack, find CR-LF-CR-LF and lop the end off. Is that wise?
Or I was thinking of something a little more complex,
Good or crappy way?
int r = 0;
char * header = NULL;
while((r = recv(sock, buf, sizeof(buf), 0)) == sizeof(buf))
/* add buf to header (realloc and strcat) */
/* search through 'header' (from the last buf addon) if we find \r\n\r\n stop */
Seems more trouble than necessary. Read in a buffer of any size, then parse it for headers. If you reach the end of the buffer but not the end of the headers, copy the unprocessed parts to the beginning of the buffer and fill the rest of it with new data. Continue until you reach the end of the headers.
The only problem with this approach is when a single header is larger than the buffer. If you're not interested, you can just discard it, otherwise you must collect (and thus introduce state into your loop).
I don't really see how that is more trouble than my suggestion, is it faster or less resource intensive?
Less resource intensive. You need less memory because you're not collecting all headers at once.
Hmm, thanks for that.
Why didn't they implement some sort of payload into the HTTP protocol? They do for content-length yet not for headers? Bah :(
It would be extremely complicated to account for all the different header fields when computing the header size, since not all of them are necessarily generated by the web server. A CGI or other web application might insert its own headers into the response without the server's knowledge. So in general, the server cannot know how big the header is going to be, although it can usually tell how big the content is, if it's serving up a simple file.
Originally Posted by zacs7
And if it can't, there's the chunked transfer encoding.
Headers are even more complicated. Proxies can insert, change and remove headers. Not that that matters that much - if it's already parsing, it might as well insert the new length.
But I think, in essence, it wasn't considered necessary. It's not that complicated. If you use Boost.Asio, for example, there's the read_until call that implements my pattern pretty much exactly.