Integration with Web?

**dluthcke** · 09-29-2009

How would i go about creating a program that will "read" a webpage and find and save a list from that webpage into variables. i'm looking to create a program that does that and i'm not even sure where to start.

**nadroj** · 09-29-2009

the only way to do it is with a web browser. basically, a web browser (ill just call "client") is just a program that requests and sends information to/from a socket, using (in general) the HTTP protocol.

example: you open your web browser (firefox, internet explorer, whatever it may be) and you go to http://www.google.ca. this is what happens in the background:

- your "browser" creates a socket on your computer and connects to the "web server" socket, in this case: www.google.ca:80 (a socket is an IP address or DNS name and a port, separated by ":").
- the web server (www.google.ca:80) sees your request for a connection, accepts it, and waits for you to tell it something
- your web browser uses HTTP's "GET" command to request a resource on that web server, implicitly it asks for "index.html" (google.ca =~ google.ca/index.html). it sends this HTTP GET command over the socket
- the web server reads your request from the socket and, among other things, it tries to find the resource "/index.html" and if it exists, it sends it back along with other information (i.e. HTTP header)
- your web browser reads from this socket until theres nothing else to read (or it sees some delimiter, i forget). what this socket is reading is the HTTP header, then the resource. so the web browser separates this header and resource, and displays the resource (the html file).

you can do the exact same thing, and its surprisingly easy (at least to send a request and get the response with the file; rendering it's HTML or whatever other language is quite a bit more complicated). also note that the "resource" can be any type of file, an image for example (though it will be binary information and not plain-text, but i dont think you said you are interested in these files).

so, to start: learn basic HTTP protocol and basic C++ socket creation, read/write and your done. once you get the file from the socket ("web server"), you have a copy of the file so theres nothing special working with it (i.e. looking for a list and creating variables for it, whatever).

best thing to do is to learn the 2 basics above, make an attempt and let us know your specific problems if you run into any.

**dluthcke** · 09-29-2009

Okay, thank you for getting me started on the right foot

**brewbuck** · 09-29-2009

Downloading a HTML page and parsing it, using C++? Sounds like a recipe for intense pain.

**rags_to_riches** · 09-29-2009

Oh GOD, is brewbuck ever right. This here's a job for a higher-level language.

**Zach_the_Lizard** · 09-29-2009

Originally Posted by brewbuck

Downloading a HTML page and parsing it, using C++? Sounds like a recipe for intense pain.

I don't think it'd be so horrible if he's looking for, say, a specific tag, such as <ul>. It's certainly doable, considering Webkit and Gecko are written in C++, and they have to do far more than simply read in what amounts to XML.

**dluthcke** · 09-29-2009

Yeah there is a specific tag. will it really be that intense?

**Zach_the_Lizard** · 09-29-2009

You can probably use a regular expressions library to find a specific tag. The new C++ standard will have a Regex library, which is actually the Boost.Regex library. VS 2010 probably has it, and GCC 4.3 has it, too. If your compiler doesn't have it, you can install the Boost library and use it.

**Sebastiani** · 09-29-2009

Originally Posted by rags_to_riches

Oh GOD, is brewbuck ever right. This here's a job for a higher-level language.

Not really. But a higher level library than raw sockets, for sure. I can download a web page *and* parse it into a tree in less than three lines.

**Dae** · 09-29-2009

Originally Posted by Sebastiani

I can download a web page *and* parse it into a tree in less than three lines.

With which library, Sebastiani?

**Sebastiani** · 09-29-2009

My own.

**Dae** · 09-29-2009

Originally Posted by Sebastiani

My own.

Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.

**nadroj** · 09-29-2009

Originally Posted by dluthcke

Yeah there is a specific tag. will it really be that intense?

it really isnt that difficult, especially if youre using some regular expression library. if you know exactly the possible characters within the tags then you can use basic string comparisons without any libraries or regex. doing it "manually" like this of course may not be as fast/efficient as using a regular expression.

for example, if you know that the format you are expecting is exactly something like:

Code:

some stuff
you dont care about
<tag>
value1
value2
</tag>
other stuff

then iterating over every line until you see the first (assuming thats what your looking for, the first) line that is exactly "<tag>\n" then save all lines until the first "</tag>\n". the problem is when you dont know exactly what can be in it, embedding "tag"s within "tag"s or whatever.

if your restricted to C++ then of course you cant get around using a different language. if you arent restricted to any language, then if you think youre more comfortable with something else then use that. others have suggested it can be done faster and more straightforward using a higher-level language (im biased and would suggest perl).

of course all of these implementation details are up to you.

**Dae** · 09-29-2009

Originally Posted by nadroj

if your restricted to C++ then of course you cant get around using a different language.

Interesting to note many higher level languages are written in C or C/C++ and the open source libraries can be used from within C/C++. I can use Python, PHP, LUA (and probably Perl) to download/parse a website in very few lines from within C/C++. I never have though, since I've never found found myself restricted and would simply switch languages instead. Plus that seems cryptic and inefficient. Just a note.

**Sebastiani** · 09-29-2009

Originally Posted by Dae

Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.

Internally, it uses raw sockets, but unfortunately it doesn't handle SSL either. It's on my TODO list, though...so I should have that up and running sometime within the next decade or so.

I've heard libcurl is pretty good, though. Any drawbacks to it?

Thread: Integration with Web?

Thread Tools

Search Thread

Display

Integration with Web?

Similar Threads

MS Web Services Question

Consuming same Web Service multiple times

SWEBS Web Server

Further developing C for the web