Thread: Integration with Web?

  1. #1
    Registered User
    Join Date
    Sep 2009
    Posts
    4

    Integration with Web?

    How would i go about creating a program that will "read" a webpage and find and save a list from that webpage into variables. i'm looking to create a program that does that and i'm not even sure where to start.

  2. #2
    Registered User
    Join Date
    Oct 2006
    Location
    Canada
    Posts
    1,243
    the only way to do it is with a web browser. basically, a web browser (ill just call "client") is just a program that requests and sends information to/from a socket, using (in general) the HTTP protocol.

    example: you open your web browser (firefox, internet explorer, whatever it may be) and you go to http://www.google.ca. this is what happens in the background:

    - your "browser" creates a socket on your computer and connects to the "web server" socket, in this case: www.google.ca:80 (a socket is an IP address or DNS name and a port, separated by ":").
    - the web server (www.google.ca:80) sees your request for a connection, accepts it, and waits for you to tell it something
    - your web browser uses HTTP's "GET" command to request a resource on that web server, implicitly it asks for "index.html" (google.ca =~ google.ca/index.html). it sends this HTTP GET command over the socket
    - the web server reads your request from the socket and, among other things, it tries to find the resource "/index.html" and if it exists, it sends it back along with other information (i.e. HTTP header)
    - your web browser reads from this socket until theres nothing else to read (or it sees some delimiter, i forget). what this socket is reading is the HTTP header, then the resource. so the web browser separates this header and resource, and displays the resource (the html file).

    you can do the exact same thing, and its surprisingly easy (at least to send a request and get the response with the file; rendering it's HTML or whatever other language is quite a bit more complicated). also note that the "resource" can be any type of file, an image for example (though it will be binary information and not plain-text, but i dont think you said you are interested in these files).

    so, to start: learn basic HTTP protocol and basic C++ socket creation, read/write and your done. once you get the file from the socket ("web server"), you have a copy of the file so theres nothing special working with it (i.e. looking for a list and creating variables for it, whatever).

    best thing to do is to learn the 2 basics above, make an attempt and let us know your specific problems if you run into any.
    Last edited by nadroj; 09-29-2009 at 03:09 PM.

  3. #3
    Registered User
    Join Date
    Sep 2009
    Posts
    4
    Okay, thank you for getting me started on the right foot

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Downloading a HTML page and parsing it, using C++? Sounds like a recipe for intense pain.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Registered User
    Join Date
    Dec 2007
    Posts
    2,675
    Oh GOD, is brewbuck ever right. This here's a job for a higher-level language.

  6. #6
    Registered User
    Join Date
    Sep 2009
    Posts
    63
    Quote Originally Posted by brewbuck View Post
    Downloading a HTML page and parsing it, using C++? Sounds like a recipe for intense pain.
    I don't think it'd be so horrible if he's looking for, say, a specific tag, such as <ul>. It's certainly doable, considering Webkit and Gecko are written in C++, and they have to do far more than simply read in what amounts to XML.

  7. #7
    Registered User
    Join Date
    Sep 2009
    Posts
    4
    Yeah there is a specific tag. will it really be that intense?

  8. #8
    Registered User
    Join Date
    Sep 2009
    Posts
    63
    You can probably use a regular expressions library to find a specific tag. The new C++ standard will have a Regex library, which is actually the Boost.Regex library. VS 2010 probably has it, and GCC 4.3 has it, too. If your compiler doesn't have it, you can install the Boost library and use it.

  9. #9
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Quote Originally Posted by rags_to_riches View Post
    Oh GOD, is brewbuck ever right. This here's a job for a higher-level language.
    Not really. But a higher level library than raw sockets, for sure. I can download a web page *and* parse it into a tree in less than three lines.

  10. #10
    Deprecated Dae's Avatar
    Join Date
    Oct 2004
    Location
    Canada
    Posts
    1,034
    Quote Originally Posted by Sebastiani View Post
    I can download a web page *and* parse it into a tree in less than three lines.
    With which library, Sebastiani?
    Warning: Have doubt in anything I post.

    GCC 4.5, Boost 1.40, Code::Blocks 8.02, Ubuntu 9.10 010001000110000101100101

  11. #11
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    My own.

  12. #12
    Deprecated Dae's Avatar
    Join Date
    Oct 2004
    Location
    Canada
    Posts
    1,034
    Quote Originally Posted by Sebastiani View Post
    My own.
    Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

    Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.
    Warning: Have doubt in anything I post.

    GCC 4.5, Boost 1.40, Code::Blocks 8.02, Ubuntu 9.10 010001000110000101100101

  13. #13
    Registered User
    Join Date
    Oct 2006
    Location
    Canada
    Posts
    1,243
    Quote Originally Posted by dluthcke View Post
    Yeah there is a specific tag. will it really be that intense?
    it really isnt that difficult, especially if youre using some regular expression library. if you know exactly the possible characters within the tags then you can use basic string comparisons without any libraries or regex. doing it "manually" like this of course may not be as fast/efficient as using a regular expression.

    for example, if you know that the format you are expecting is exactly something like:
    Code:
    some stuff
    you dont care about
    <tag>
    value1
    value2
    </tag>
    other stuff
    then iterating over every line until you see the first (assuming thats what your looking for, the first) line that is exactly "<tag>\n" then save all lines until the first "</tag>\n". the problem is when you dont know exactly what can be in it, embedding "tag"s within "tag"s or whatever.

    if your restricted to C++ then of course you cant get around using a different language. if you arent restricted to any language, then if you think youre more comfortable with something else then use that. others have suggested it can be done faster and more straightforward using a higher-level language (im biased and would suggest perl).

    of course all of these implementation details are up to you.

  14. #14
    Deprecated Dae's Avatar
    Join Date
    Oct 2004
    Location
    Canada
    Posts
    1,034
    Quote Originally Posted by nadroj View Post
    if your restricted to C++ then of course you cant get around using a different language.
    Interesting to note many higher level languages are written in C or C/C++ and the open source libraries can be used from within C/C++. I can use Python, PHP, LUA (and probably Perl) to download/parse a website in very few lines from within C/C++. I never have though, since I've never found found myself restricted and would simply switch languages instead. Plus that seems cryptic and inefficient. Just a note.
    Warning: Have doubt in anything I post.

    GCC 4.5, Boost 1.40, Code::Blocks 8.02, Ubuntu 9.10 010001000110000101100101

  15. #15
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Quote Originally Posted by Dae View Post
    Haha, damn. I was curious if you used an open source library. I'm working on a higher level library for http "browsing" but haven't gotten around to parsing the HTML tree (nodes/etc); and I doubt I will. I'll probably just keep using Boost.Regex from time to time.

    Does yours use raw sockets directly or Asio (Boost.Asio)? I ask because a while ago I had trouble getting Asio working with SSL websites, so I gave up and switched to libcurl. If I could find someone (Google didn't really provide much info) using Asio and SSL successfully it would give me the confidence to try again.
    Internally, it uses raw sockets, but unfortunately it doesn't handle SSL either. It's on my TODO list, though...so I should have that up and running sometime within the next decade or so.

    I've heard libcurl is pretty good, though. Any drawbacks to it?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. MS Web Services Question
    By mercury529 in forum Windows Programming
    Replies: 0
    Last Post: 11-14-2006, 06:36 PM
  2. Consuming same Web Service multiple times
    By cfriend in forum C# Programming
    Replies: 2
    Last Post: 01-10-2006, 09:59 AM
  3. SWEBS Web Server
    By nickname_changed in forum A Brief History of Cprogramming.com
    Replies: 6
    Last Post: 09-22-2003, 02:46 AM
  4. Further developing C for the web
    By bjdea1 in forum C Programming
    Replies: 24
    Last Post: 12-25-2002, 01:49 PM