web spider

**Aslaville** · 09-27-2012

If one would like to develop a web spider,one that exhibits fast speeds at run-time the project should also be done at the minimum time possible, would C++ be a language of choice?There are some languages that offer HTTP libraries and web/HTML parsing and I was kind of feeling implementing the same functionality with C++ would proof even difficult than even learning such a language as Perl with the sole aim of developing the web spider.I would like to hear what you think yourselves.

**oogabooga** · 09-27-2012

What language(s) do you know?

**Aslaville** · 09-27-2012

C++ and python

**Salem** · 09-27-2012

The bandwidth of your internet connection is likely to be the limiting factor. If your DSL connection maxes out at say 1MB/Sec, then my guess would be any language would suffice if the only thing to be done is parsing HTML files for text (for indexing) and links (for fetching).

But if you've got a server farm with multiple giga-bit connections, a different analysis would be appropriate.

**anduril462** · 09-27-2012

EDIT: As for Salem's concern, I'm guessing that since you are posting this question here, a server farm with multiple gigabits of bandwidth are not at your disposal.

It sounds like you are more concerned with having a solid, effective and efficient end product, instead of this being a vehicle to learn a new language. In that light, here's what I think:

It is usually much easier to learn to use a library for a language you are already proficient at, than it is to learn a new language well enough to tackle larger problems (like a web spider) well. So in your case, I would only consider C++ or Python for implementation languages.

Next step I would do is consider how well I know each of those languages. If you're a C++ guru, and only so-so at Python, then C++ is the way to go, hands down. Any benefits Python may have in offering faster development time will likely be offset by you being slower at development, and making more errors and bad design decisions. If you are a Python guru, and so-so at C++, then develop in Python.

Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free). It was originally a C library, but it has C++ and Python bindings. I'm sure there are plenty of others if you Google. Python probably even has a native library. EDIT: Just checked, looks like it does.

The only other thing you might need for the HTML parsing (libcurl only works at the HTTP level) is some sort of HTML/XML library. Python has it's own DOM-based XML library, and a quick Google search will turn up plenty of options for C++.

**Aslaville** · 09-28-2012

I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.

**laserlight** · 09-28-2012

Originally Posted by Aslaville

I think I am going to consider using C++ because its IDE is very developed unlike Python and Perl which in turn offer very conducive environment for development of a web spider.I am also more well versed with C++ than with Python.

I daresay that the IDEs have less of a role to play concerning a "very conducive environment for development of a web spider" than the availability and ease of use of relevant libraries.[1] In this area, Perl supposedly has an advantage as I have heard it touted for its text processing capabilities. Of course, if you know C++ considerably better than Python, and don't know Perl at all, then for the short term C++ would be the best option even if an experienced developer in those other languages can get started faster with stronger relevant library support.

[1] That said, there are fairly well developed IDEs for Python, and I presume Perl as well.

**Aslaville** · 09-28-2012

Then C++ must be the answer!

**Salem** · 09-28-2012

> Then C++ must be the answer!
What was the question again?

There are no end of perl modules for processing HTML - eg.
HTML::Miner - search.cpan.org
+

Code:

#!/usr/bin/perl -w
use strict;
use warnings; 
use HTML::Miner;

my $dig = HTML::Miner->new(CURRENT_URL => "cboard.cprogramming.com");
my $links = $dig->get_links();
foreach my $i ( @{$links} ) {
    print ${$i}{"URL"} . "\n";
}

=

Code:

$ perl foo.pl
http://cboard.cprogramming.com/register.php
http://cboard.cprogramming.com/search.php?do=getdaily&amp;contenttype=vBForum_Post
javascript://
javascript://
http://cboard.cprogramming.com/showgroups.php
http://cboard.cprogramming.com/search.php
http://cboard.cprogramming.com/
http://cboard.cprogramming.com/general-programming-boards/
http://cboard.cprogramming.com/cplusplus-programming/
http://cboard.cprogramming.com/cplusplus-programming/150980-help-math-algorithm.html
http://cboard.cprogramming.com/cplusplus-programming/150980-help-math-algorithm-post1124983.html#post1124983
// and so on.....

Put that into a loop which feeds itself, and you have a crawler.

Now in C++, if you want to do it all yourself and not use any libraries except the basic socket API, you're looking at hundreds (if not thousands) of lines of code to achieve the same thing. The time to write and debug such a thing is probably measured in weeks.

**cfanatic** · 10-11-2012

Originally Posted by anduril462

Now, pick a HTTP library. libcurl comes to mind, it's very efficient, stable, widely used and well documented (and free).

Only a few minutes ago I started writing a web spider in C to download entire forums, using libcurl. Still need to parse and search for links. I think there are C libraries(written years ago) for parsing HTML, but webpages nowadays are made up of HTML, CSS, javascript and god knows what else. Can anyone recommend a good C library? I am probably just gonna write code to do the parsing, probably way quicker than learning some new library.

Code:

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>

#define FILE_SIZE 10000

int main(void)
{
    curl_global_init(CURL_GLOBAL_ALL);

    CURL * myHandle;
    CURLcode setop_result;
    FILE *file;

    if((file = fopen("webpage.html", "wb")) == NULL)
    {
        perror("Error");
        exit(EXIT_FAILURE);
    }

    if((myHandle = curl_easy_init()) == NULL)
    {
        perror("Error");
        exit(EXIT_FAILURE);
    }

    if((setop_result = curl_easy_setopt(myHandle, CURLOPT_URL, "http://cboard.cprogramming.com/")) != CURLE_OK)
    {
        perror("Error");
        exit(EXIT_FAILURE);
    }

    if((setop_result = curl_easy_setopt(myHandle, CURLOPT_WRITEDATA, file)) != CURLE_OK)
    {
        perror("Error");
        exit(EXIT_FAILURE);
    }

    if((setop_result = curl_easy_perform(myHandle)) != 0)
    {
        perror("Error");
        exit(EXIT_FAILURE);
    }
    curl_easy_cleanup(myHandle);
    fclose(file);
    puts("Webpage downloaded successfully to webpage.html");

    return 0;
}

**oogabooga** · 10-11-2012

A proper HTML parser (even an "old" one) should skip tags that it doesn't understand.

A regex library is another way to go. If you don't know regular expressions, it's something you have to learn sometime.

**Elkvis** · 10-11-2012

Originally Posted by laserlight

there are fairly well developed IDEs for Python

eclipse has an excellent python plugin, called pydev.

**cfanatic** · 10-11-2012

Originally Posted by oogabooga

A proper HTML parser (even an "old" one) should skip tags that it doesn't understand.

A regex library is another way to go. If you don't know regular expressions, it's something you have to learn sometime.

Never heard of the term "regex library" or "regular expressions" before. Seems like a very useful concept. Thanks.

**oogabooga** · 10-11-2012

"regex" is shorthand for "regular expression", so they're the same thing (just in case you were wondering

). It's one of the most useful things you'll ever learn. It's for matching complex patterns in strings and extracting substrings.

E.g., this Perl program

Code:

my $line = '<a href="whatever.html">Something</a>';
my ($href, $text) = $line =~ /<a +href="([^"]+)">([^<]+)<\/a>/;
print "$line\n";
print "$href\n$text\n";

scans this line

<a href="whatever.html">Something</a>

and "captures" the href string and the tag contents string. The output is:

<a href="whatever.html">Something</a>
whatever.html
Something

The regex is:
/<a +href="([^"]+)">([^<]+)<\/a>/

which may look a little complicated, but once you learn the language it's extraordinarily useful.

The latest C++ standard has regular expressions in the standard library.
Even Javascript has regexes.

**cfanatic** · 10-11-2012

Will I go mad?

Coding Horror: Parsing Html The Cthulhu Way

"Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes."

Thread: web spider

Thread Tools

Search Thread

Display

web spider

Similar Threads

Web Spider