Thread: web spider

  1. #1
    Tweaking master Aslaville's Avatar
    Join Date
    Sep 2012
    Location
    Rogueport
    Posts
    528

    web spider

    If one would like to develop a web spider,one that exhibits fast speeds at run-time the project should also be done at the minimum time possible, would C++ be a language of choice?There are some languages that offer HTTP libraries and web/HTML parsing and I was kind of feeling implementing the same functionality with C++ would proof even difficult than even learning such a language as Perl with the sole aim of developing the web spider.I would like to hear what you think yourselves.

  2. #2
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    What language(s) do you know?
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  3. #3
    Tweaking master Aslaville's Avatar
    Join Date
    Sep 2012
    Location
    Rogueport
    Posts
    528
    C++ and python

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    The bandwidth of your internet connection is likely to be the limiting factor. If your DSL connection maxes out at say 1MB/Sec, then my guess would be any language would suffice if the only thing to be done is parsing HTML files for text (for indexing) and links (for fetching).

    But if you've got a server farm with multiple giga-bit connections, a different analysis would be appropriate.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    EDIT: As for Salem's concern, I'm guessing that since you are posting this question here, a server farm with multiple gigabits of bandwidth are not at your disposal.

    It sounds like you are more concerned with having a solid, effective and efficient end product, instead of this being a vehicle to learn a new language. In that light, here's what I think:

    It is usually much easier to learn to use a library for a language you are already proficient at, than it is to learn a new language well enough to tackle larger problems (like a web spider) well. So in your case, I would only consider C++ or Python for implementation languages.

    Next step I would do is consider how well I know each of those languages. If you're a C++ guru, and only so-so at Python, then C++ is the way to go, hands down. Any benefits Python may have in offering faster development time will likely be offset by you being slower at development, and making more errors and bad design decisions. If you are a Python guru, and so-so at C++, then develop in Python.

    Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free). It was originally a C library, but it has C++ and Python bindings. I'm sure there are plenty of others if you Google. Python probably even has a native library. EDIT: Just checked, looks like it does.

    The only other thing you might need for the HTML parsing (libcurl only works at the HTTP level) is some sort of HTML/XML library. Python has it's own DOM-based XML library, and a quick Google search will turn up plenty of options for C++.

  6. #6
    Tweaking master Aslaville's Avatar
    Join Date
    Sep 2012
    Location
    Rogueport
    Posts
    528
    I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Aslaville
    I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.
    I daresay that the IDEs have less of a role to play concerning a "very conducive environment for development of a web spider" than the availability and ease of use of relevant libraries.[1] In this area, Perl supposedly has an advantage as I have heard it touted for its text processing capabilities. Of course, if you know C++ considerably better than Python, and don't know Perl at all, then for the short term C++ would be the best option even if an experienced developer in those other languages can get started faster with stronger relevant library support.

    [1] That said, there are fairly well developed IDEs for Python, and I presume Perl as well.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    Tweaking master Aslaville's Avatar
    Join Date
    Sep 2012
    Location
    Rogueport
    Posts
    528
    Then C++ must be the answer!

  9. #9
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > Then C++ must be the answer!
    What was the question again?

    There are no end of perl modules for processing HTML - eg.
    HTML::Miner - search.cpan.org
    +
    Code:
    #!/usr/bin/perl -w
    use strict;
    use warnings; 
    use HTML::Miner;
    
    my $dig = HTML::Miner->new(CURRENT_URL => "cboard.cprogramming.com");
    my $links = $dig->get_links();
    foreach my $i ( @{$links} ) {
        print ${$i}{"URL"} . "\n";
    }
    =
    Code:
    $ perl foo.pl
    http://cboard.cprogramming.com/register.php
    http://cboard.cprogramming.com/search.php?do=getdaily&contenttype=vBForum_Post
    javascript://
    javascript://
    http://cboard.cprogramming.com/showgroups.php
    http://cboard.cprogramming.com/search.php
    http://cboard.cprogramming.com/
    http://cboard.cprogramming.com/general-programming-boards/
    http://cboard.cprogramming.com/cplusplus-programming/
    http://cboard.cprogramming.com/cplusplus-programming/150980-help-math-algorithm.html
    http://cboard.cprogramming.com/cplusplus-programming/150980-help-math-algorithm-post1124983.html#post1124983
    // and so on.....
    Put that into a loop which feeds itself, and you have a crawler.

    Now in C++, if you want to do it all yourself and not use any libraries except the basic socket API, you're looking at hundreds (if not thousands) of lines of code to achieve the same thing. The time to write and debug such a thing is probably measured in weeks.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #10
    Registered User
    Join Date
    Jul 2012
    Location
    Australia
    Posts
    242
    Quote Originally Posted by anduril462 View Post

    Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free).
    Only a few minutes ago I started writing a web spider in C to download entire forums, using libcurl. Still need to parse and search for links. I think there are C libraries(written years ago) for parsing HTML, but webpages nowadays are made up of HTML, CSS, javascript and god knows what else. Can anyone recommend a good C library? I am probably just gonna write code to do the parsing, probably way quicker than learning some new library.

    Code:
    #include <curl/curl.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    #define FILE_SIZE 10000
    
    int main(void)
    {
        curl_global_init(CURL_GLOBAL_ALL);
    
        CURL * myHandle;
        CURLcode setop_result;
        FILE *file;
    
        if((file = fopen("webpage.html", "wb")) == NULL)
        {
            perror("Error");
            exit(EXIT_FAILURE);
        }
    
        if((myHandle = curl_easy_init()) == NULL)
        {
            perror("Error");
            exit(EXIT_FAILURE);
        }
    
        if((setop_result = curl_easy_setopt(myHandle, CURLOPT_URL, "http://cboard.cprogramming.com/")) != CURLE_OK)
        {
            perror("Error");
            exit(EXIT_FAILURE);
        }
    
        if((setop_result = curl_easy_setopt(myHandle, CURLOPT_WRITEDATA, file)) != CURLE_OK)
        {
            perror("Error");
            exit(EXIT_FAILURE);
        }
    
        if((setop_result = curl_easy_perform(myHandle)) != 0)
        {
            perror("Error");
            exit(EXIT_FAILURE);
        }
        curl_easy_cleanup(myHandle);
        fclose(file);
        puts("Webpage downloaded successfully to webpage.html");
    
        return 0;
    }
    IDE: Code::Blocks | Compiler Suite for Windows: TDM-GCC (MingW, gdb)

  11. #11
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    A proper HTML parser (even an "old" one) should skip tags that it doesn't understand.

    A regex library is another way to go. If you don't know regular expressions, it's something you have to learn sometime.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  12. #12
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    Quote Originally Posted by laserlight View Post
    there are fairly well developed IDEs for Python
    eclipse has an excellent python plugin, called pydev.

  13. #13
    Registered User
    Join Date
    Jul 2012
    Location
    Australia
    Posts
    242
    Quote Originally Posted by oogabooga View Post
    A proper HTML parser (even an "old" one) should skip tags that it doesn't understand.

    A regex library is another way to go. If you don't know regular expressions, it's something you have to learn sometime.
    Never heard of the term "regex library" or "regular expressions" before. Seems like a very useful concept. Thanks.
    IDE: Code::Blocks | Compiler Suite for Windows: TDM-GCC (MingW, gdb)

  14. #14
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    "regex" is shorthand for "regular expression", so they're the same thing (just in case you were wondering ). It's one of the most useful things you'll ever learn. It's for matching complex patterns in strings and extracting substrings.

    E.g., this Perl program
    Code:
    my $line = '<a href="whatever.html">Something</a>';
    my ($href, $text) = $line =~ /<a +href="([^"]+)">([^<]+)<\/a>/;
    print "$line\n";
    print "$href\n$text\n";
    scans this line

    <a href="whatever.html">Something</a>

    and "captures" the href string and the tag contents string. The output is:

    <a href="whatever.html">Something</a>
    whatever.html
    Something

    The regex is:
    /<a +href="([^"]+)">([^<]+)<\/a>/

    which may look a little complicated, but once you learn the language it's extraordinarily useful.

    The latest C++ standard has regular expressions in the standard library.
    Even Javascript has regexes.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  15. #15
    Registered User
    Join Date
    Jul 2012
    Location
    Australia
    Posts
    242
    Will I go mad?

    Coding Horror: Parsing Html The Cthulhu Way

    "Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes."
    IDE: Code::Blocks | Compiler Suite for Windows: TDM-GCC (MingW, gdb)

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Web Spider
    By darren78 in forum C++ Programming
    Replies: 8
    Last Post: 09-29-2010, 01:24 AM