Support vector machines

**megastar** · 07-03-2007

Hi !

I applied svm-train on a data set N of features and have
found out the cross validation accuracy
and then i have applied svm-train again on a data set
which is the subset of having K (K<N) features
and have found out the cross validation accuracy.
The accuracies in both cases are the same
and in fact for any subset with the root as original(N feature)
the accuracy is coming out to be the same,irrespective of number of features.
Should'nt the accuracy improve with reducing number of features??
Why the anomaly?

**brewbuck** · 07-09-2007

Originally Posted by megastar

I applied svm-train on a data set N of features and have
found out the cross validation accuracy
and then i have applied svm-train again on a data set
which is the subset of having K (K<N) features
and have found out the cross validation accuracy.
The accuracies in both cases are the same
and in fact for any subset with the root as original(N feature)
the accuracy is coming out to be the same,irrespective of number of features.
Should'nt the accuracy improve with reducing number of features??
Why the anomaly?

It's not guaranteed that reducing the dimensionality increases accuracy -- that's only true if by doing so you can throw out some noise. If the noise is partially correlated you can reduce it by performing PCA -- not sure if you're doing that to reduce your feature space or just selecting a subset of the original dimensions.

Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...

EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance.

**megastar** · 07-12-2007

Yup! i got that point later...anyways reduction of noise can be done in many ways..
as to your second statement

Code:

  Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...

EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance

..
why do u think exactly same cross validation accuracies are not possible for different subsets?
In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...in fact i think it is necessary to look into what exactly is affecting the classifiers accuracy..one is number of informative features ..and anything else?

**brewbuck** · 07-13-2007

Originally Posted by megastar

why do u think exactly same cross validation accuracies are not possible for different subsets?

For any two given subsets I wouldn't be surprised. For it to be the same on ALL subsets is very suspicious I think.

In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...

Not sure I understand this, can you rephrase?

**megastar** · 07-19-2007

i guess it was'nt very clear..sorry about that..
consider the following scenario :
There are let us say 15000 features for each sample.
The subsets we can generate have 4000-8000 features i.e
subset1:4500 , subset2: 7900.... Now in case the actual informative features i.e those
which are required by the classifier to classify the sample is very less compared to the number of features may be something like 10-15 then the subsets generated will contain lots of irrelevant features ..noisy data..so even when the subset size is different when we try to classify the sample SINCE THE NOISE REDUCTION IS STILL INSIGNIFICANT AS COMPARED TO THE NUMBER OF INFORMATIVE FEATURES there should not be any improvement in accuracy..
which is why i guess my accuracies have been the same...THERE SHOULD BE A SIGNIFICANT REDUCTION IN NOISE WITH RETENTION OF THE REQUISITE FEATURES so as to improve the classifiers accuracy.When there are smaller subsets which have hundreds of features informative ones sometimes may or may not be retained which results in different accuracies.
This was one of my reasonings but i still suspect this may not be the only one...
Would you like to suggest anything?

Thread: Support vector machines

Thread Tools

Search Thread

Display

Support vector machines

Similar Threads

Errors including <windows.h>

failure to import external C libraries in C++ project

Dev-cpp - compiler options

what's a good TWAIN library/SDK? Also, JPG/TIF/GIF/PNG support...