Thread: get keyword and category of a page

  1. #1
    Registered User
    Join Date
    Aug 2008
    Posts
    27

    get keyword and category of a page

    I am going to write a Demo to extract keyword and category of a web page, does anyone know any open sources/samples/documents/book for me to start with?

    BTW: I need to deal with some multi-language pages, like Japanese, which I know nothing about. It is good if you could recommend me some stuff which could deal with multi-language problem. Thanks.

  2. #2
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    http://evanjones.ca/unicode-in-c.html < unicode in C
    http://curl.haxx.se/ < curl (for getting the webpage)

  3. #3
    Registered User
    Join Date
    Aug 2008
    Posts
    27
    Sorry, I do not think you answered what I asked. :-)

    I am asking what technologies could be used to extract keywords and category information from web page, not how to get the web page (including multi-language pages).

    Quote Originally Posted by zacs7 View Post
    http://evanjones.ca/unicode-in-c.html < unicode in C
    http://curl.haxx.se/ < curl (for getting the webpage)

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    You asked for multi-language stuff, which is what zacs was giving you, I would think.

    As for reading the webpage, I would guess you would have to read it. If you're looking for the meta keywords, then you can just look for that tag when you read the page; I'm not sure what you mean by category.

  5. #5
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    You also just asked for references to start you off, are you writing the demo or looking for a program to do what you describe?

    More info please

  6. #6
    Registered User
    Join Date
    Aug 2008
    Posts
    27
    Any reference, code/paper/turotials are fine. My purpose is just to extract keywords from a web page. Another job is to category the web page -- for example, identify it as financial web page or sports web page automatically.

    Quote Originally Posted by zacs7 View Post
    You also just asked for references to start you off, are you writing the demo or looking for a program to do what you describe?

    More info please

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Checker1977 View Post
    Any reference, code/paper/turotials are fine. My purpose is just to extract keywords from a web page. Another job is to category the web page -- for example, identify it as financial web page or sports web page automatically.
    The way forums work, in general, and this one is no exception, is that if you ask DETAILED questions, you get good answers. If you ask "open ended" questions that would need dozens of pages to be answered even nearly in full, then you are going to get very short answers with links to basic functionality that does roughly what you want - if you are lucky.

    Apparently you already know how to use libcurl.

    So what exactly is it you want to have help with?

    When you say "keywords" do you mean the content itself? And if so, what constitutes a keyword? Do you have a list of keywords, or is the application supposed to determine what is a keyword by itself.

    Reading any foreign language is pretty difficult if you don't know at least some of the language yourself, and for many non-european languages probably require that you have some help from someone who knows the language, as the buildup of words is much more of "one symbol -> one word".

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User
    Join Date
    Aug 2008
    Posts
    27

    Talking

    I mean the semantics meaning of a web page. For example, when you browse MSN money pages, you got keywords like financial, stock, debts, layoff, auto industry, something like this.

    Just like when we made a site to contract with Google, we need to provide keyword for each page to bid, now I just want to have some automatic way to generate keyword for each page I have.

    Not sure whether this time my point is clear.

    Quote Originally Posted by matsp View Post
    The way forums work, in general, and this one is no exception, is that if you ask DETAILED questions, you get good answers. If you ask "open ended" questions that would need dozens of pages to be answered even nearly in full, then you are going to get very short answers with links to basic functionality that does roughly what you want - if you are lucky.

    Apparently you already know how to use libcurl.

    So what exactly is it you want to have help with?

    When you say "keywords" do you mean the content itself? And if so, what constitutes a keyword? Do you have a list of keywords, or is the application supposed to determine what is a keyword by itself.

    Reading any foreign language is pretty difficult if you don't know at least some of the language yourself, and for many non-european languages probably require that you have some help from someone who knows the language, as the buildup of words is much more of "one symbol -> one word".

    --
    Mats

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Well, your point is fairly clear - however, the solution is non-trivial, I'm pretty sure. How do you, when you have the conent of the MSN Money page, determine that THOSE words are keywords. Yes, we can easily remove "the", "it", "is", "are" and other common words. But the content is still fairly complex.

    And how do you deal with "sport and finance intermingled":
    The Honda Formula1 team leaves the sport, claiminig credit crunch and return on investment as reasons
    (By the way, I just made that particular text up - but something similar has been published recently).

    So, what I'm saying is that you either have to make a list of words that YOU think will categorize a page, and then use that list for your categorization.
    Or you need some really clever algorithms to figure out which words are "important" and which aren't.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Registered User
    Join Date
    Aug 2008
    Posts
    27
    Thanks. I understand the solution is not trivial. I just want to know whether there are any existing solution which could be used extract keywords from a specific page, like open source?

    I have no idea about this area, just ping people here for experienced advice. :-)

    Quote Originally Posted by matsp View Post
    Well, your point is fairly clear - however, the solution is non-trivial, I'm pretty sure. How do you, when you have the conent of the MSN Money page, determine that THOSE words are keywords. Yes, we can easily remove "the", "it", "is", "are" and other common words. But the content is still fairly complex.

    And how do you deal with "sport and finance intermingled":
    (By the way, I just made that particular text up - but something similar has been published recently).

    So, what I'm saying is that you either have to make a list of words that YOU think will categorize a page, and then use that list for your categorization.
    Or you need some really clever algorithms to figure out which words are "important" and which aren't.

    --
    Mats

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I personally am not aware of any open source project that solves this problem. I'm sure that if someone else knew about it, they would have posted so.

    I'm sure google is (at least partly) able to do this - but that's their business, and I'm sure they do not publish much of their know-how.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User
    Join Date
    Aug 2008
    Posts
    27
    Thanks! Let us just see if anyone else have good ideas. :-)

    Quote Originally Posted by matsp View Post
    I personally am not aware of any open source project that solves this problem. I'm sure that if someone else knew about it, they would have posted so.

    I'm sure google is (at least partly) able to do this - but that's their business, and I'm sure they do not publish much of their know-how.

    --
    Mats

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Usages of const keyword
    By @nthony in forum C Programming
    Replies: 2
    Last Post: 09-16-2007, 11:45 PM