C Board  

Go Back   C Board > Cprogramming.com and AIHorizon.com's Artificial Intelligence Boards > General AI Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 07-03-2007, 02:42 AM   #1
Registered User
 
Join Date: Jun 2007
Posts: 24
Support vector machines

Hi !

I applied svm-train on a data set N of features and have
found out the cross validation accuracy
and then i have applied svm-train again on a data set
which is the subset of having K (K<N) features
and have found out the cross validation accuracy.
The accuracies in both cases are the same
and in fact for any subset with the root as original(N feature)
the accuracy is coming out to be the same,irrespective of number of features.
Should'nt the accuracy improve with reducing number of features??
Why the anomaly?
megastar is offline   Reply With Quote
Old 07-09-2007, 01:53 PM   #2
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by megastar View Post
I applied svm-train on a data set N of features and have
found out the cross validation accuracy
and then i have applied svm-train again on a data set
which is the subset of having K (K<N) features
and have found out the cross validation accuracy.
The accuracies in both cases are the same
and in fact for any subset with the root as original(N feature)
the accuracy is coming out to be the same,irrespective of number of features.
Should'nt the accuracy improve with reducing number of features??
Why the anomaly?
It's not guaranteed that reducing the dimensionality increases accuracy -- that's only true if by doing so you can throw out some noise. If the noise is partially correlated you can reduce it by performing PCA -- not sure if you're doing that to reduce your feature space or just selecting a subset of the original dimensions.

Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...

EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance.

Last edited by brewbuck; 07-09-2007 at 01:57 PM.
brewbuck is offline   Reply With Quote
Old 07-12-2007, 08:16 AM   #3
Registered User
 
Join Date: Jun 2007
Posts: 24
Yup! i got that point later...anyways reduction of noise can be done in many ways..
as to your second statement
Code:
  Are the accuracies EXACTLY the same in both cases? I would suspect a bug in the code...

EDIT: Either that, or you validation set is small enough that the accuracies are lining up by random chance
..
why do u think exactly same cross validation accuracies are not possible for different subsets?
In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...in fact i think it is necessary to look into what exactly is affecting the classifiers accuracy..one is number of informative features ..and anything else?
megastar is offline   Reply With Quote
Old 07-13-2007, 11:17 AM   #4
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by megastar View Post
why do u think exactly same cross validation accuracies are not possible for different subsets?
For any two given subsets I wouldn't be surprised. For it to be the same on ALL subsets is very suspicious I think.

Quote:
In fact i think this tendency will be more prevalent in subsets with larger features which have lot of noise rather than smaller ones so that a different 'large subset' will eventually result in the same cross validation accuracy...
Not sure I understand this, can you rephrase?
brewbuck is offline   Reply With Quote
Old 07-19-2007, 01:30 AM   #5
Registered User
 
Join Date: Jun 2007
Posts: 24
i guess it was'nt very clear..sorry about that..
consider the following scenario :
There are let us say 15000 features for each sample.
The subsets we can generate have 4000-8000 features i.e
subset1:4500 , subset2: 7900.... Now in case the actual informative features i.e those
which are required by the classifier to classify the sample is very less compared to the number of features may be something like 10-15 then the subsets generated will contain lots of irrelevant features ..noisy data..so even when the subset size is different when we try to classify the sample SINCE THE NOISE REDUCTION IS STILL INSIGNIFICANT AS COMPARED TO THE NUMBER OF INFORMATIVE FEATURES there should not be any improvement in accuracy..
which is why i guess my accuracies have been the same...THERE SHOULD BE A SIGNIFICANT REDUCTION IN NOISE WITH RETENTION OF THE REQUISITE FEATURES so as to improve the classifiers accuracy.When there are smaller subsets which have hundreds of features informative ones sometimes may or may not be retained which results in different accuracies.
This was one of my reasonings but i still suspect this may not be the only one...
Would you like to suggest anything?
megastar is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Errors including <windows.h> jw232 Windows Programming 4 07-29-2008 01:29 PM
failure to import external C libraries in C++ project nocturna_gr C++ Programming 3 12-02-2007 03:49 PM
Dev-cpp - compiler options tretton C Programming 7 01-06-2006 06:20 PM
what's a good TWAIN library/SDK? Also, JPG/TIF/GIF/PNG support... KeMBro Windows Programming 1 09-25-2003 08:25 AM


All times are GMT -6. The time now is 07:39 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.0 RC2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22