[Scikit-learn-general] SVM stability and LR feature mapping

Discussion:

Richard Cubek

2013-04-28 18:06:11 UTC

Hello everyone,

I'm new to the list so first of all thanks a lot for your work on this
lib!

I need libsvm probability estimates as well as Logistic Regression (LR)
in a three classes problem with a training data set size of about 5-6000
at 20-50 features. I am familiar with python and octave (regarding math
even more with octave), but I would prefer python since I need all the
programming stuff which can be tedious in octave...

Reading lot of posts in discussions, scikit seems to offer the most
advanced and well documented python binding for libsvm, but I also found
following site:
http://fseoane.net/blog/2010/fast-bindings-for-libsvm-in-scikitslearn

He writes, that his bindings are implemented in scikit, but he also
writes, that the code is in alpha status, that was three years ago.
Well, I started with a simple problem with 65 data points with 2
features each.

Questions:

1) Playing around with svm probability, it "seems" to work nice
(Loading Image...

). I just wanted
to ask, how stable the python binding is regarding the website issue
mentioned above.

2) Playing around with LR, the results "look interesting"
(Loading Image...

), but I was
not able to reproduce a model adopting/"overfitting" to every single
data point, as in the SVM example plot (tried very large C). I did the
first ML online class with Andrew Ng, there we implemented LR ourselves,
but the feature creation from the data features was ad hoc (from x and y
to x^2, y^2, x*y, x*y^2 and so on). I followed the same feature mapping
here, at the end getting 28 features out of 2. It takes about 15-17
seconds to fit the model (on my simple example).

I know feature selection/extraction itself is a big research topic, but
maybe scikit can help me here without the need to read a dozen papers or
maybe there are some rules of thumb. So is there any method within
scikit, that could help me finding a feature mapping? I guess, that
RandomizedLogisticRegression could help me somehow, but I didn't really
get the point. I think, here I again have to provide the features myself
and it will just help me finding the best by trying out randomly? On my
real data set, mapping the 20-50 features to higher-dimensional spaces
and trying out would probably take too long, if I consider the 15
seconds needed for a single model on the simple example (and here we are
not yet talking about searching the optimal regularization C). Any
suggestions?

Cheers!

Richard

Gael Varoquaux

2013-04-28 21:19:27 UTC

Permalink

how stable the python binding is regarding the website issue mentioned
above.

Faily stable I would say. The remarks applied years ago.

So is there any method within scikit, that could help me finding a
feature mapping?

I am not sure what you mean by feature mapping? Do you mean a non linear
mapping to a feature spacing in which the classes should be separable?

You might try totally random trees embedding for this purpose:
http://scikit-learn.org/stable/modules/ensemble.html#totally-random-trees-embedding
and
http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_embedding.html

HTH,

Gaël

Andreas Mueller

2013-04-29 08:53:21 UTC

Permalink

Post by Gael Varoquaux

how stable the python binding is regarding the website issue mentioned
above.

Faily stable I would say. The remarks applied years ago.

So is there any method within scikit, that could help me finding a
feature mapping?

I am not sure what you mean by feature mapping? Do you mean a non linear
mapping to a feature spacing in which the classes should be separable?
http://scikit-learn.org/stable/modules/ensemble.html#totally-random-trees-embedding
and
http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_embedding.html

Have you tried them?

Gael Varoquaux

2013-04-29 18:57:34 UTC

Permalink

Post by Andreas Mueller

Post by Gael Varoquaux
http://scikit-learn.org/stable/modules/ensemble.html#totally-random-trees-embedding
and
http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_embedding.html

Have you tried them?

Me. No, not yet. The data I play with tends to be too high-dimensional
with not enough samples, I believe.

G

Richard Cubek

2013-04-29 18:47:05 UTC

Permalink

Post by Gael Varoquaux

So is there any method within scikit, that could help me finding a
feature mapping?

I am not sure what you mean by feature mapping? Do you mean a non linear
mapping to a feature spacing in which the classes should be
separable?

Yes, sorry for being imprecise.

Ok, I will have a look. Thanks.

Cheers

Richard

Andreas Mueller

2013-04-29 08:55:51 UTC

Permalink

Post by Richard Cubek
Hello everyone,
2) Playing around with LR, the results "look interesting"
(https://dl.dropboxusercontent.com/u/95888530/logreg_1.png), but I was
not able to reproduce a model adopting/"overfitting" to every single
data point, as in the SVM example plot (tried very large C). I did the
first ML online class with Andrew Ng, there we implemented LR ourselves,
but the feature creation from the data features was ad hoc (from x and y
to x^2, y^2, x*y, x*y^2 and so on). I followed the same feature mapping
here, at the end getting 28 features out of 2. It takes about 15-17
seconds to fit the model (on my simple example).

Why don't you use a kernel SVM (SVC)?
There is no kernel Logistic Regression in sklearn. But there are some
kernel-approximation
methods that you could use together with various kernels and then use
the standard LogisticRegression.

Cheers,
Andy

Richard Cubek

2013-04-29 19:18:28 UTC

Permalink

Post by Andreas Mueller
Why don't you use a kernel SVM (SVC)?
There is no kernel Logistic Regression in sklearn. But there are some
kernel-approximation
methods that you could use together with various kernels and then use
the standard LogisticRegression.

I don't know how to combine these methods, so I will have to take a
look on some examples. Thanks.

Cheers

Richard