Thread: Support vector machines

  1. #1
    Registered User
    Join Date
    Jun 2007
    Posts
    24

    Support vector machines

    Hi !

    I applied svm-train on a data set N of features and have
    found out the cross validation accuracy
    and then i have applied svm-train again on a data set
    which is the subset of having K (K<N) features
    and have found out the cross validation accuracy.
    The accuracies in both cases are the same
    and in fact for any subset with the root as original(N feature)
    the accuracy is coming out to be the same,irrespective of number of features.
    Should'nt the accuracy improve with reducing number of features??
    Why the anomaly?

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by megastar View Post
    I applied svm-train on a data set N of features and have
    found out the cross validation accuracy
    and then i have applied svm-train again on a data set
    which is the subset of having K (K<N) features
    and have found out the cross validation accuracy.
    The accuracies in both cases are the same
    and in fact for any subset with the root as original(N feature)
    the accuracy is coming out to be the same,irrespective of number of features.
    Should'nt the accuracy improve with reducing number of features??
    Why the anomaly?
    It's not guaranteed that reducing the dimensionality increases accuracy -- that's only true if by doing so you can throw out some noise. If the noise is partially correlated you can reduce it by performing PCA -- not sure if you're doing that to reduce your feature space or just selecting a subset of the original dimensions.

    Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...

    EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance.
    Last edited by brewbuck; 07-09-2007 at 01:57 PM.

  3. #3
    Registered User
    Join Date
    Jun 2007
    Posts
    24
    Yup! i got that point later...anyways reduction of noise can be done in many ways..
    as to your second statement
    Code:
      Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...
    
    EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance
    ..
    why do u think exactly same cross validation accuracies are not possible for different subsets?
    In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...in fact i think it is necessary to look into what exactly is affecting the classifiers accuracy..one is number of informative features ..and anything else?

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by megastar View Post
    why do u think exactly same cross validation accuracies are not possible for different subsets?
    For any two given subsets I wouldn't be surprised. For it to be the same on ALL subsets is very suspicious I think.

    In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...
    Not sure I understand this, can you rephrase?

  5. #5
    Registered User
    Join Date
    Jun 2007
    Posts
    24
    i guess it was'nt very clear..sorry about that..
    consider the following scenario :
    There are let us say 15000 features for each sample.
    The subsets we can generate have 4000-8000 features i.e
    subset1:4500 , subset2: 7900.... Now in case the actual informative features i.e those
    which are required by the classifier to classify the sample is very less compared to the number of features may be something like 10-15 then the subsets generated will contain lots of irrelevant features ..noisy data..so even when the subset size is different when we try to classify the sample SINCE THE NOISE REDUCTION IS STILL INSIGNIFICANT AS COMPARED TO THE NUMBER OF INFORMATIVE FEATURES there should not be any improvement in accuracy..
    which is why i guess my accuracies have been the same...THERE SHOULD BE A SIGNIFICANT REDUCTION IN NOISE WITH RETENTION OF THE REQUISITE FEATURES so as to improve the classifiers accuracy.When there are smaller subsets which have hundreds of features informative ones sometimes may or may not be retained which results in different accuracies.
    This was one of my reasonings but i still suspect this may not be the only one...
    Would you like to suggest anything?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Errors including <windows.h>
    By jw232 in forum Windows Programming
    Replies: 4
    Last Post: 07-29-2008, 01:29 PM
  2. failure to import external C libraries in C++ project
    By nocturna_gr in forum C++ Programming
    Replies: 3
    Last Post: 12-02-2007, 03:49 PM
  3. Dev-cpp - compiler options
    By tretton in forum C Programming
    Replies: 7
    Last Post: 01-06-2006, 06:20 PM
  4. Replies: 1
    Last Post: 09-25-2003, 08:25 AM