Thread: Extracting data from a webpage

  1. #1
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485

    Extracting data from a webpage

    Hey.

    I am looking to extract some data from a few websites for the purpose of some statistics work. I know I could do that by looking that the websites in question and then copy past the information I want into a file, but its not the most interesting thing in the world to do, so I though this would be a great chance for me to bring out my programming skills.

    But I have never done anything like this before so I was wondering if anyone knew here to start? I started fooling around using the JavaScript console in Chrome, but I quickly found out that your not able to save data to a text file, which is essential to me .

    So I was wondering if there is anyway I can write a script (language not very important, like learning new things) and just give it a start webpage and it will do what I want it to. All I need is being able to scan through the source of the webpage looking for keywords and save them to an text file. It would also be nice if I could open new links and have it running as a recursive processes. And the last thing which I would like, but it is not that important is the ability to save images as well.

    Any ideas where I should start? Just to be clear, I not asking for anyone to write any code, I just asking for a push in the right direction when it comes to choosing what do use. I would also like to do it myself, even though I am sure there are lots of programs that already does this for you.

    Regards,

  2. #2
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    Well, something I'd do in Perl, Bash or Python. You may want to look into (any of) `wget`, curl and/or some sort of string parser (perhaps `awk` or `sed`) if you're not using Perl or Python that is...

    Personally I'd pick Python

  3. #3
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Always wanted to use python, guess this is the time. Thanks

    Which version on python is the one with best support, Or should I just get the latest one?
    And, what is a good IDE for it? If you dont have anyone you really recommend, feel free to ignor the last two questions, im sure google will bring up a few good and long threads arguing about which is best.

  4. #4
    Reverse Engineer maxorator's Avatar
    Join Date
    Aug 2005
    Location
    Estonia
    Posts
    2,318
    I've usually done this in PHP + string functions/regex.
    "The Internet treats censorship as damage and routes around it." - John Gilmore

  5. #5
    &TH of undefined behavior Fordy's Avatar
    Join Date
    Aug 2001
    Posts
    5,793
    Quote Originally Posted by h3ro View Post
    Which version on python is the one with best support, Or should I just get the latest one?
    And, what is a good IDE for it? If you dont have anyone you really recommend, feel free to ignor the last two questions, im sure google will bring up a few good and long threads arguing about which is best.
    I'd go with 2.5 - it has great OS support. I use 2.6, but it means I end up recompiling some stuff every now and again, and if you can get away with 2.5 then that's the one to use.

    As far as an IDE, there are loads. I like Komodo Edit - it's free and pretty lightweight

  6. #6
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Sounds like wget will do everything you want, including recursive downloading (following links), which would get ugly if you try to implement it yourself.

  7. #7
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    Perhaps this is a bit late, but if you're using Python then its definitely worth checking out Beautiful Soup for scraping content.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Extracting data from a string..
    By lautarox in forum C Programming
    Replies: 17
    Last Post: 09-23-2008, 01:59 PM
  2. Replies: 3
    Last Post: 04-18-2008, 10:06 AM
  3. Extracting data from an array in C
    By Sparkle1984 in forum C Programming
    Replies: 5
    Last Post: 10-07-2003, 03:14 PM
  4. All u wanted to know about data types&more
    By SAMSAM in forum Windows Programming
    Replies: 6
    Last Post: 03-11-2003, 03:22 PM
  5. posting data to webpage...
    By Turek in forum C++ Programming
    Replies: 2
    Last Post: 07-29-2002, 11:50 AM