Discussion:
Google Summer of Code 2014
(too old to reply)
Manoj Kumar
2014-01-15 18:07:25 UTC
Permalink
Hello,

First of all, thanks to the scikit-learn community for guiding new
developers. I'm thankful for all the help that I've got with my Pull
Requests till now.

I hope that this is the right place to discuss GSoC related ideas (I've
idled at the scikit-learn irc channel for quite a few occasions, but I
could not meet any core developer). I was browsing through the threads of
last year, when I found this idea related to collaborative filtering (CF)
quite interesting,
http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
this was sadly not accepted.

If the scikit-learn community is still enthusiastic about a recsys module
with CF algorithms implemented, I would love this to be my GSoC proposal
and we could discuss more about the algorithms, gelling with the present
sklearn API, how much we could possibly fit in a 3 month period etc.

Awaiting a reply.

--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
Kyle Kastner
2014-01-15 18:42:05 UTC
Permalink
I looked into this once upon a time, and one of the key problems (from
talking to Jake IIRC) is how to handle the "missing values" in the input
array. You would either need a mask, or some kind of indexing system for
describing which value goes where in the input matrix. Either way, this
extra argument would be a requirement for CF, but not for the existing
algorithms in sklearn.

Maybe it would only operate on sparse arrays, and infer that the values
which are missing are the ones to be imputed ("recommended")? But not
supporting dense arrays would basically be the opposite of other modules in
sklearn, where dense input is default. Maybe someone can comment on this?

I don't know how well this lines up with the existing API/functionality and
the future directions there, but how to deal with the missing values is
probably the primary concern for implementing CF algorithms in sklearn IMO.


On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
***@gmail.com> wrote:

> Hello,
>
> First of all, thanks to the scikit-learn community for guiding new
> developers. I'm thankful for all the help that I've got with my Pull
> Requests till now.
>
> I hope that this is the right place to discuss GSoC related ideas (I've
> idled at the scikit-learn irc channel for quite a few occasions, but I
> could not meet any core developer). I was browsing through the threads of
> last year, when I found this idea related to collaborative filtering (CF)
> quite interesting,
> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
> this was sadly not accepted.
>
> If the scikit-learn community is still enthusiastic about a recsys module
> with CF algorithms implemented, I would love this to be my GSoC proposal
> and we could discuss more about the algorithms, gelling with the present
> sklearn API, how much we could possibly fit in a 3 month period etc.
>
> Awaiting a reply.
>
> --
> Regards,
> Manoj Kumar,
> Mech Undergrad
> http://manojbits.wordpress.com
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Nick Pentreath
2014-01-15 19:24:16 UTC
Permalink
While I think collaborative filtering / recommendations may have a place in
sklearn, it is true that the problem setting is a little different from
most of the sklearn models.

You may want to take a look into mrec (https://github.com/mendeley/mrec)
where many well established CF approaches are implemented in an sklearn
API-friendly manner. This package also provides for some parallel training
of models.

Furthermore if you're looking at very large-scale data Spark's new Python
bindings to MLlib allow one to use the efficient cluster-parallel ALS
implementation from Python:
https://github.com/apache/incubator-spark/pull/283


On Wed, Jan 15, 2014 at 8:42 PM, Kyle Kastner <***@gmail.com> wrote:

> I looked into this once upon a time, and one of the key problems (from
> talking to Jake IIRC) is how to handle the "missing values" in the input
> array. You would either need a mask, or some kind of indexing system for
> describing which value goes where in the input matrix. Either way, this
> extra argument would be a requirement for CF, but not for the existing
> algorithms in sklearn.
>
> Maybe it would only operate on sparse arrays, and infer that the values
> which are missing are the ones to be imputed ("recommended")? But not
> supporting dense arrays would basically be the opposite of other modules in
> sklearn, where dense input is default. Maybe someone can comment on this?
>
> I don't know how well this lines up with the existing API/functionality
> and the future directions there, but how to deal with the missing values is
> probably the primary concern for implementing CF algorithms in sklearn IMO.
>
>
> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
> ***@gmail.com> wrote:
>
>> Hello,
>>
>> First of all, thanks to the scikit-learn community for guiding new
>> developers. I'm thankful for all the help that I've got with my Pull
>> Requests till now.
>>
>> I hope that this is the right place to discuss GSoC related ideas (I've
>> idled at the scikit-learn irc channel for quite a few occasions, but I
>> could not meet any core developer). I was browsing through the threads of
>> last year, when I found this idea related to collaborative filtering (CF)
>> quite interesting,
>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
>> this was sadly not accepted.
>>
>> If the scikit-learn community is still enthusiastic about a recsys module
>> with CF algorithms implemented, I would love this to be my GSoC proposal
>> and we could discuss more about the algorithms, gelling with the present
>> sklearn API, how much we could possibly fit in a 3 month period etc.
>>
>> Awaiting a reply.
>>
>> --
>> Regards,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
n***@masonlive.gmu.edu
2014-01-16 17:17:49 UTC
Permalink
I agree that sparse matrices need to be supported as one of the main properties inherent to the user/item rating matrix in recommender systems is its sparsity. This sparsity is what has given rise to such a large scale of research in the field. Hence this property would have to be taken advantage of because if not, since we have to deal with matrices, similarity calculations would have complexity through the roof (although there are ways to overcome this by using item-item cf techniques where similarity calculation is done offline but nevertheless is still expensive).

Possibly solutions in my opinion:
1> Support dense and sparse matrices but I am not sure if such an implementation can be directly plugged into sklearn (because of the sparse matrix support.)

2> Distributed recommender systems (just provide the ability for people to distribute their similarity calculations.) This can be done using MRJob a hadoop-streaming wrapper for python. This is also a current field of research and I'm sure if you look into it you will find quite a lot of literature on the topic.

3> I am currently also trying to look into this library called scikit-crab which was started based upon a similar plan but I heard the developers are rewriting the library currently and it might not be open to the community for active development at present (not sure about this though). But I just mentioned it thinking maybe if you took a look at the code, you would get some more ideas about what improvements could be made. https://github.com/muricoca/crab

________________________________
From: Kyle Kastner [***@gmail.com]
Sent: Wednesday, January 15, 2014 1:42 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems (from talking to Jake IIRC) is how to handle the "missing values" in the input array. You would either need a mask, or some kind of indexing system for describing which value goes where in the input matrix. Either way, this extra argument would be a requirement for CF, but not for the existing algorithms in sklearn.

Maybe it would only operate on sparse arrays, and infer that the values which are missing are the ones to be imputed ("recommended")? But not supporting dense arrays would basically be the opposite of other modules in sklearn, where dense input is default. Maybe someone can comment on this?

I don't know how well this lines up with the existing API/functionality and the future directions there, but how to deal with the missing values is probably the primary concern for implementing CF algorithms in sklearn IMO.


On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <***@gmail.com<mailto:***@gmail.com>> wrote:
Hello,

First of all, thanks to the scikit-learn community for guiding new developers. I'm thankful for all the help that I've got with my Pull Requests till now.

I hope that this is the right place to discuss GSoC related ideas (I've idled at the scikit-learn irc channel for quite a few occasions, but I could not meet any core developer). I was browsing through the threads of last year, when I found this idea related to collaborative filtering (CF) quite interesting, http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though this was sadly not accepted.

If the scikit-learn community is still enthusiastic about a recsys module with CF algorithms implemented, I would love this to be my GSoC proposal and we could discuss more about the algorithms, gelling with the present sklearn API, how much we could possibly fit in a 3 month period etc.

Awaiting a reply.

--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
Kyle Kastner
2014-01-16 18:17:32 UTC
Permalink
@Manoj
The predict stage taking 2 parameters is what I was talking about - are
there any other estimators that need anything more than a single matrix to
do a prediction? I do not recall any - this would be something particular
to CF. Maybe you could recast it as a matrix with alternating rows of
item,rating but that is still a particular for CF.

Whether that is OK as far sklearn's API is concerned is not for me to
decide. I would also expect it to be closely tied with DictVectorizer or
something like it, probably more so than most other algorithms (though this
is not a big deal IMO) to get categorical labels.

@nmuralid
I agree totally - last number I saw was that the typical matrix for
something like Amazon is 99% sparse? I don't remember where I read it
though. Looking at crab, it seems like they are trying to do sklearn-style
API specifically for collaborative filtering. Not sure where the name crab
comes in, but it is definitely worth looking at.

Kyle


On Thu, Jan 16, 2014 at 11:17 AM, ***@masonlive.gmu.edu <
***@masonlive.gmu.edu> wrote:

> I agree that sparse matrices need to be supported as one of the main
> properties inherent to the user/item rating matrix in recommender systems
> is its sparsity. This sparsity is what has given rise to such a large scale
> of research in the field. Hence this property would have to be taken
> advantage of because if not, since we have to deal with matrices,
> similarity calculations would have complexity through the roof (although
> there are ways to overcome this by using item-item cf techniques where
> similarity calculation is done offline but nevertheless is still
> expensive).
>
> Possibly solutions in my opinion:
> 1> Support dense and sparse matrices but I am not sure if such an
> implementation can be directly plugged into sklearn (because of the sparse
> matrix support.)
>
> 2> Distributed recommender systems (just provide the ability for people
> to distribute their similarity calculations.) This can be done using MRJob
> a hadoop-streaming wrapper for python. This is also a current field of
> research and I'm sure if you look into it you will find quite a lot of
> literature on the topic.
>
> 3> I am currently also trying to look into this library called
> scikit-crab which was started based upon a similar plan but I heard the
> developers are rewriting the library currently and it might not be open to
> the community for active development at present (not sure about this
> though). But I just mentioned it thinking maybe if you took a look at the
> code, you would get some more ideas about what improvements could be made.
> https://github.com/muricoca/crab
>
> ------------------------------
> *From:* Kyle Kastner [***@gmail.com]
> *Sent:* Wednesday, January 15, 2014 1:42 PM
> *To:* scikit-learn-***@lists.sourceforge.net
> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>
> I looked into this once upon a time, and one of the key problems (from
> talking to Jake IIRC) is how to handle the "missing values" in the input
> array. You would either need a mask, or some kind of indexing system for
> describing which value goes where in the input matrix. Either way, this
> extra argument would be a requirement for CF, but not for the existing
> algorithms in sklearn.
>
> Maybe it would only operate on sparse arrays, and infer that the values
> which are missing are the ones to be imputed ("recommended")? But not
> supporting dense arrays would basically be the opposite of other modules in
> sklearn, where dense input is default. Maybe someone can comment on this?
>
> I don't know how well this lines up with the existing API/functionality
> and the future directions there, but how to deal with the missing values is
> probably the primary concern for implementing CF algorithms in sklearn IMO.
>
>
> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
> ***@gmail.com> wrote:
>
>> Hello,
>>
>> First of all, thanks to the scikit-learn community for guiding new
>> developers. I'm thankful for all the help that I've got with my Pull
>> Requests till now.
>>
>> I hope that this is the right place to discuss GSoC related ideas (I've
>> idled at the scikit-learn irc channel for quite a few occasions, but I
>> could not meet any core developer). I was browsing through the threads of
>> last year, when I found this idea related to collaborative filtering (CF)
>> quite interesting,
>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
>> this was sadly not accepted.
>>
>> If the scikit-learn community is still enthusiastic about a recsys
>> module with CF algorithms implemented, I would love this to be my GSoC
>> proposal and we could discuss more about the algorithms, gelling with the
>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>
>> Awaiting a reply.
>>
>> --
>> Regards,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Manoj Kumar
2014-01-16 19:09:54 UTC
Permalink
Kyle Kastner
2014-01-16 22:24:05 UTC
Permalink
Joel Nothman
2014-01-16 22:44:37 UTC
Permalink
Olivier Grisel
2014-01-16 22:51:10 UTC
Permalink
2014/1/16 Joel Nothman <***@gmail.com>:
> There are still issues of whether this is in scikit-learn scope. For
> example, does it make sense with sklearn's cross validation? Or will you
> want to cross validate on both axes? Given that there is plenty of work to
> be done that is well within scikit-learn's scope (prominent alternative
> solutions and utilities for problems it already solves), I think this
> extension of scope needs to be argued.

+1

I would first focus on generic matrix factorization / completion
estimators as unsupervised estimators (using the standard model.fit(X)
API with a scipy sparse X). Then real a CF system could leverage such
building blocks to build its features in 3rd party libraries that
would build upon scikit-learn but would provide the domain specific
recsys boilerplate.

--
Olivier
Manoj Kumar
2014-01-17 03:23:27 UTC
Permalink
Thanks everyone for your quick responses.

1. Could someone point me to a list of GSoC ideas this year?
2. Is it okay, if I take up projects related to ideas, that have not yet
been implemented. For example, a quick search tells me "Improving GMM" has
not been implemented.

Thanks.
Alex Companioni
2014-01-15 19:36:53 UTC
Permalink
Not sure how to handle the data representation (masked arrays make sense),
but you probably want to look into matrix
completion<http://en.wikipedia.org/wiki/Matrix_completion>.
In particular, a visitor at Knewton recently discussed his experience
implementing singular value
projection<http://books.nips.cc/papers/files/nips23/NIPS2010_0682.pdf>
.


> Maybe it would only operate on sparse arrays, and infer that the values
> which are missing are the ones to be imputed ("recommended")? But not
> supporting dense arrays would basically be the opposite of other modules in
> sklearn, where dense input is default. Maybe someone can comment on this?
>
> I don't know how well this lines up with the existing API/functionality and
> the future directions there, but how to deal with the missing values is
> probably the primary concern for implementing CF algorithms in sklearn IMO.
>
Manoj Kumar
2014-01-16 14:29:20 UTC
Permalink
Thanks for your responses.

@Kyle:
At the risk of sounding really naive, I'd like to make the following
comments. I'm referring to this paper that Sukru had posted,
http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
collaborative filtering. I don't think there is really any need for masking
the items that are not selected by the target user (or the user for which
you need to predict the item rating) here. I believe it would work for
dense cases too. Lets look at a sample session here.

from sklearn.recsys import item_cf # Tentative names.
clf = item_cf() # Here arguments like similarity criteria, number of
recommendations can be given in the __init__
# Lets say there are n users who have have already rated,
# X is an 2-D array with the first dimension of n, the second can vary
according to the number of items they have
# rated.
# y is the ratings they have provided. This can be either binary like
+1 or -1 , or continuous.
clf.fit(X, y)
# After doing clf.fit(X, y) , an attribute clf.items_ would return the
total number of items.
clf.predict(x) # This will return the top n recommendations of x
# For each item in clf.items_ provided item is not in x, similarity is
calculated by taking the top k similar items in x.

For user based CF, yes we need to provide a mask for the item for which we
need to predict the rating, but I suppose that can be provided in the
__init__ (can't it)?

@Alex and Nick: Thanks for your references, I'll have a look right now.

However a point I don't intutively understand what clf.transform() /
clf.fit_transform must be doing in these cases. Any pointers? Considering
the mentor problem, I don't think that would be a problem if the community
is genuinely interested in this project. If I do get a +1, I can start
thinking about the timeline, algorithms I'd like to implement etc. I'm
really looking forward to extending my really minor scikit-learn work right
now as part of GSoC.
Kyle Kastner
2014-01-16 15:26:00 UTC
Permalink
Manoj Kumar
2014-01-16 16:29:12 UTC
Permalink
Well y can be 2-D too, there are estimators like MultiTaskElasticNet
especially meant for multi-task y.

I was thinking something along these lines. Lets say
["ham", "spam", "ram", "bam", "tam"] are the five items.

and if first user gives
"ham" - 2
"spam" - 3

the second user gives
"ram" - 1
"bam" - -3
"tam" - 4

then I was thinking X = [["ham", "spam"], ["ram", "bam", "tam"]] and y =
[[2, 3], []]













On Thu, Jan 16, 2014 at 8:56 PM, Kyle Kastner <***@gmail.com> wrote:

> So X is the array of existing ratings, would y be a 2D array then? If not,
> how do you map the ratings given back to a single user (since y is
> typically, to my knowledge, 1D in sklearn)?
>
> I am still a little confused, but your example helped. Can you could go
> into a little more detail on X, x, and y?
>
> Let's say for an example of 5 users, 11 total items. That would make X a
> 5x11 matrix, right? What about y and x?
>
>
> On Thu, Jan 16, 2014 at 8:29 AM, Manoj Kumar <
> ***@gmail.com> wrote:
>
>> Thanks for your responses.
>>
>> @Kyle:
>> At the risk of sounding really naive, I'd like to make the following
>> comments. I'm referring to this paper that Sukru had posted,
>> http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
>> collaborative filtering. I don't think there is really any need for masking
>> the items that are not selected by the target user (or the user for which
>> you need to predict the item rating) here. I believe it would work for
>> dense cases too. Lets look at a sample session here.
>>
>> from sklearn.recsys import item_cf # Tentative names.
>> clf = item_cf() # Here arguments like similarity criteria, number of
>> recommendations can be given in the __init__
>> # Lets say there are n users who have have already rated,
>> # X is an 2-D array with the first dimension of n, the second can
>> vary according to the number of items they have
>> # rated.
>> # y is the ratings they have provided. This can be either binary like
>> +1 or -1 , or continuous.
>> clf.fit(X, y)
>> # After doing clf.fit(X, y) , an attribute clf.items_ would return
>> the total number of items.
>> clf.predict(x) # This will return the top n recommendations of x
>> # For each item in clf.items_ provided item is not in x, similarity
>> is calculated by taking the top k similar items in x.
>>
>> For user based CF, yes we need to provide a mask for the item for which
>> we need to predict the rating, but I suppose that can be provided in the
>> __init__ (can't it)?
>>
>> @Alex and Nick: Thanks for your references, I'll have a look right now.
>>
>> However a point I don't intutively understand what clf.transform() /
>> clf.fit_transform must be doing in these cases. Any pointers? Considering
>> the mentor problem, I don't think that would be a problem if the community
>> is genuinely interested in this project. If I do get a +1, I can start
>> thinking about the timeline, algorithms I'd like to implement etc. I'm
>> really looking forward to extending my really minor scikit-learn work right
>> now as part of GSoC.
>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
Manoj Kumar
2014-01-16 16:38:52 UTC
Permalink
I'm extremely sorry, that message got sent half way through. (I pressed
Ctrl + Enter by mistake)
X = [["ham", "spam"], ["ram", "bam", "tam"]], and y = [[2, 3], [1, -3, 4]]

and we do clf.fit(X, y)
Suppose we would like to predict, what we would recommend the user x who
has already rated "ram" as 1 and "bam" as 2. we do clf.predict(["ram",
"bam"], [1, -3]) and it would give the output. (Both parameters are
required)

I do not know however what clf.transform() or clf.fit_transform() would do
(as of know), and the meat of the computation would be done in clf.predict()

--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
1970-01-01 00:00:00 UTC
Permalink
--001a113a5fb6efdbcf04f0180a42
Content-Type: text/plain; charset=ISO-8859-1

So X is the array of existing ratings, would y be a 2D array then? If not,
how do you map the ratings given back to a single user (since y is
typically, to my knowledge, 1D in sklearn)?

I am still a little confused, but your example helped. Can you could go
into a little more detail on X, x, and y?

Let's say for an example of 5 users, 11 total items. That would make X a
5x11 matrix, right? What about y and x?


On Thu, Jan 16, 2014 at 8:29 AM, Manoj Kumar <***@gmail.com
> wrote:

> Thanks for your responses.
>
> @Kyle:
> At the risk of sounding really naive, I'd like to make the following
> comments. I'm referring to this paper that Sukru had posted,
> http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
> collaborative filtering. I don't think there is really any need for masking
> the items that are not selected by the target user (or the user for which
> you need to predict the item rating) here. I believe it would work for
> dense cases too. Lets look at a sample session here.
>
> from sklearn.recsys import item_cf # Tentative names.
> clf = item_cf() # Here arguments like similarity criteria, number of
> recommendations can be given in the __init__
> # Lets say there are n users who have have already rated,
> # X is an 2-D array with the first dimension of n, the second can vary
> according to the number of items they have
> # rated.
> # y is the ratings they have provided. This can be either binary like
> +1 or -1 , or continuous.
> clf.fit(X, y)
> # After doing clf.fit(X, y) , an attribute clf.items_ would return the
> total number of items.
> clf.predict(x) # This will return the top n recommendations of x
> # For each item in clf.items_ provided item is not in x, similarity is
> calculated by taking the top k similar items in x.
>
> For user based CF, yes we need to provide a mask for the item for which we
> need to predict the rating, but I suppose that can be provided in the
> __init__ (can't it)?
>
> @Alex and Nick: Thanks for your references, I'll have a look right now.
>
> However a point I don't intutively understand what clf.transform() /
> clf.fit_transform must be doing in these cases. Any pointers? Considering
> the mentor problem, I don't think that would be a problem if the community
> is genuinely interested in this project. If I do get a +1, I can start
> thinking about the timeline, algorithms I'd like to implement etc. I'm
> really looking forward to extending my really minor scikit-learn work right
> now as part of GSoC.
>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

--001a113a5fb6efdbcf04f0180a42
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir="ltr">So X is the array of existing ratings, would y be a 2D array then? If not, how do you map the ratings given back to a single user (since y is typically, to my knowledge, 1D in sklearn)?<div><br></div><div>I am still a little confused, but your example helped. Can you could go into a little more detail on X, x, and y?</div>
<div><br></div><div>Let&#39;s say for an example of 5 users, 11 total items. That would make X a 5x11 matrix, right? What about y and x? </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 8:29 AM, Manoj Kumar <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Thanks for your responses.</div><div><br></div><div>@Kyle:</div><div>At the risk of sounding really naive, I&#39;d like to make the following comments. I&#39;m referring to this paper that Sukru had posted, <a href="http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf" target="_blank">http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf</a> which is item based collaborative filtering. I don&#39;t think there is really any need for masking the items that are not selected by the target user (or the user for which you need to predict the item rating) here. I believe it would work for dense cases too. Lets look at a sample session here.</div>

<div><br></div><div>    from sklearn.recsys import item_cf  # Tentative names.</div><div>    clf = item_cf()  # Here arguments like similarity criteria, number of recommendations can be given in the __init__</div><div>    # Lets say there are n users who have have already rated, </div>

<div>    # X is an 2-D array with the first dimension of n, the second can vary according to the number of items they have</div><div>    # rated.</div><div>    # y is the ratings they have provided. This can be either binary like +1 or -1 , or continuous.</div>

<div>    clf.fit(X, y)</div><div>    # After doing clf.fit(X, y) , an attribute clf.items_ would return the total number of items.</div><div>    clf.predict(x)  # This will return the top n recommendations of x</div><div>

    # For each item in clf.items_ provided item is not in x, similarity is calculated by taking the top k similar items in x.</div><div><br></div><div>For user based CF, yes we need to provide a mask for the item for which we need to predict the rating, but I suppose that can be provided in the __init__ (can&#39;t it)?</div>

<div><br></div><div>@Alex and Nick: Thanks for your references, I&#39;ll have a look right now.</div><div><br></div><div>However a point I don&#39;t intutively understand what clf.transform() / clf.fit_transform must be doing in these cases. Any pointers?  Considering the mentor problem, I don&#39;t think that would be a problem if the community is genuinely interested in this project. If I do get a +1, I can start thinking about the timeline, algorithms I&#39;d like to implement etc. I&#39;m really looking forward to extending my really minor scikit-learn work right now as part of GSoC.</div>

<div><br></div></div>
<br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>

Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>

--001a113a5fb6efdbcf04f0180a42--
1970-01-01 00:00:00 UTC
Permalink
--047d7bf0d2dcb1542d04f01b2bbb
Content-Type: text/plain; charset=ISO-8859-1

Yes indeed, getting two parameters for predict would be specific to CF.
That was the most obvious idea that came to my mind. I would like to hear
other's opinions also regarding the API, and the feasibility of such a
project.


On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <***@gmail.com>wrote:

> @Manoj
> The predict stage taking 2 parameters is what I was talking about - are
> there any other estimators that need anything more than a single matrix to
> do a prediction? I do not recall any - this would be something particular
> to CF. Maybe you could recast it as a matrix with alternating rows of
> item,rating but that is still a particular for CF.
>
> Whether that is OK as far sklearn's API is concerned is not for me to
> decide. I would also expect it to be closely tied with DictVectorizer or
> something like it, probably more so than most other algorithms (though this
> is not a big deal IMO) to get categorical labels.
>
> @nmuralid
> I agree totally - last number I saw was that the typical matrix for
> something like Amazon is 99% sparse? I don't remember where I read it
> though. Looking at crab, it seems like they are trying to do sklearn-style
> API specifically for collaborative filtering. Not sure where the name crab
> comes in, but it is definitely worth looking at.
>
> Kyle
>
>
> On Thu, Jan 16, 2014 at 11:17 AM, ***@masonlive.gmu.edu <
> ***@masonlive.gmu.edu> wrote:
>
>> I agree that sparse matrices need to be supported as one of the main
>> properties inherent to the user/item rating matrix in recommender systems
>> is its sparsity. This sparsity is what has given rise to such a large scale
>> of research in the field. Hence this property would have to be taken
>> advantage of because if not, since we have to deal with matrices,
>> similarity calculations would have complexity through the roof (although
>> there are ways to overcome this by using item-item cf techniques where
>> similarity calculation is done offline but nevertheless is still
>> expensive).
>>
>> Possibly solutions in my opinion:
>> 1> Support dense and sparse matrices but I am not sure if such an
>> implementation can be directly plugged into sklearn (because of the sparse
>> matrix support.)
>>
>> 2> Distributed recommender systems (just provide the ability for people
>> to distribute their similarity calculations.) This can be done using MRJob
>> a hadoop-streaming wrapper for python. This is also a current field of
>> research and I'm sure if you look into it you will find quite a lot of
>> literature on the topic.
>>
>> 3> I am currently also trying to look into this library called
>> scikit-crab which was started based upon a similar plan but I heard the
>> developers are rewriting the library currently and it might not be open to
>> the community for active development at present (not sure about this
>> though). But I just mentioned it thinking maybe if you took a look at the
>> code, you would get some more ideas about what improvements could be made.
>> https://github.com/muricoca/crab
>>
>> ------------------------------
>> *From:* Kyle Kastner [***@gmail.com]
>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>> *To:* scikit-learn-***@lists.sourceforge.net
>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>
>> I looked into this once upon a time, and one of the key problems
>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>> input array. You would either need a mask, or some kind of indexing system
>> for describing which value goes where in the input matrix. Either way, this
>> extra argument would be a requirement for CF, but not for the existing
>> algorithms in sklearn.
>>
>> Maybe it would only operate on sparse arrays, and infer that the values
>> which are missing are the ones to be imputed ("recommended")? But not
>> supporting dense arrays would basically be the opposite of other modules in
>> sklearn, where dense input is default. Maybe someone can comment on this?
>>
>> I don't know how well this lines up with the existing API/functionality
>> and the future directions there, but how to deal with the missing values is
>> probably the primary concern for implementing CF algorithms in sklearn IMO.
>>
>>
>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>> ***@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> First of all, thanks to the scikit-learn community for guiding new
>>> developers. I'm thankful for all the help that I've got with my Pull
>>> Requests till now.
>>>
>>> I hope that this is the right place to discuss GSoC related ideas (I've
>>> idled at the scikit-learn irc channel for quite a few occasions, but I
>>> could not meet any core developer). I was browsing through the threads of
>>> last year, when I found this idea related to collaborative filtering (CF)
>>> quite interesting,
>>> http://sourceforge.net/mailarchive/message.php?msg_id0725712 , though
>>> this was sadly not accepted.
>>>
>>> If the scikit-learn community is still enthusiastic about a recsys
>>> module with CF algorithms implemented, I would love this to be my GSoC
>>> proposal and we could discuss more about the algorithms, gelling with the
>>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>>
>>> Awaiting a reply.
>>>
>>> --
>>> Regards,
>>> Manoj Kumar,
>>> Mech Undergrad
>>> http://manojbits.wordpress.com
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com

--047d7bf0d2dcb1542d04f01b2bbb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir="ltr">Yes indeed, getting two parameters for predict would be specific to CF. That was the most obvious idea that came to my mind. I would like to hear other&#39;s opinions also regarding the API, and the feasibility of such a project.</div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>@Manoj<br></div>The predict stage taking 2 parameters is what I was talking about - are there any other estimators that need anything more than a single matrix to do a prediction? I do not recall any - this would be something particular to CF. Maybe you could recast it as a matrix with alternating rows of item,rating but that is still a particular for CF. <br>

<br>Whether that is OK as far sklearn&#39;s API is concerned is not for me to decide. I would also expect it to be closely tied with DictVectorizer or something like it, probably more so than most other algorithms (though this is not a big deal IMO) to get categorical labels. <br>

<br></div><div>@nmuralid<br></div><div>I agree totally - last number I saw was that the typical matrix for something like Amazon is 99% sparse? I don&#39;t remember where I read it though. Looking at crab, it seems like they are trying to do sklearn-style API specifically for collaborative filtering. Not sure where the name crab comes in, but it is definitely worth looking at.<span class="HOEnZb"><font color="#888888"><br>

<br></font></span></div><span class="HOEnZb"><font color="#888888">Kyle<br></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:17 AM, <a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a> <span dir="ltr">&lt;<a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">




<div>
<div style="direction:ltr;font-size:10pt;font-family:Tahoma">I agree  that sparse matrices need to be supported as one of the main properties inherent to the user/item rating matrix in recommender systems is its sparsity. This sparsity is
what has given rise to such a large scale of research in the field. Hence this property would have to be taken advantage of because if not, since we have to deal with matrices, similarity calculations would have complexity through the roof (although there
are ways to overcome this by using item-item cf techniques where similarity calculation is done offline but nevertheless is still expensive).
<div><br>
</div>
<div>Possibly solutions in my opinion:</div>
<div>   1&gt; Support dense and sparse matrices but I am not sure if such an implementation can be directly plugged into sklearn (because of the sparse matrix support.)</div>
<div><br>
</div>
<div>2&gt; Distributed recommender systems (just provide the ability for people to distribute their similarity calculations.) This can be done using MRJob a hadoop-streaming wrapper for python. This is also a current field of research and I&#39;m sure if you look
into it you will find quite a lot of literature on the topic.</div>
<div><br>
</div>
<div>3&gt; I am currently also trying to look into this library called scikit-crab which was started based upon a similar plan but I heard the developers are rewriting the library currently and it might not be open to the community for active development at present
(not sure about this though). But I just mentioned it thinking maybe if you took a look at the code, you would get some more ideas about what improvements could be made. <a href="https://github.com/muricoca/crab" target="_blank">https://github.com/muricoca/crab</a></div>


<div><br>
</div>
<div>
<div style="font-size:16px;font-family:Times New Roman">
<hr>
<div style="direction:ltr"><font color="#000000" face="Tahoma"><b>From:</b> Kyle Kastner [<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>]<br>
<b>Sent:</b> Wednesday, January 15, 2014 1:42 PM<br>
<b>To:</b> <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>
<b>Subject:</b> Re: [Scikit-learn-general] Google Summer of Code 2014<br>
</font><br>
</div><div><div>
<div></div>
<div>
<div dir="ltr">
<div>
<div>I looked into this once upon a time, and one of the key problems (from talking to Jake IIRC) is how to handle the &quot;missing values&quot; in the input array. You would either need a mask, or some kind of indexing system for describing which value goes where in
the input matrix. Either way, this extra argument would be a requirement for CF, but not for the existing algorithms in sklearn.<br>
<br>
</div>
Maybe it would only operate on sparse arrays, and infer that the values which are missing are the ones to be imputed (&quot;recommended&quot;)? But not supporting dense arrays would basically be the opposite of other modules in sklearn, where dense input is default.
Maybe someone can comment on this?<br>
<br>
</div>
I don&#39;t know how well this lines up with the existing API/functionality and the future directions there, but how to deal with the missing values is probably the primary concern for implementing CF algorithms in sklearn IMO.<br>


</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <span dir="ltr">
&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>Hello,<br>
<br>
</div>
First of all, thanks to the scikit-learn community for guiding new developers. I&#39;m thankful for all the help that I&#39;ve got with my Pull Requests till now.<br>
<br>
</div>
I hope that this is the right place to discuss GSoC related ideas (I&#39;ve idled at the scikit-learn irc channel for quite a few occasions, but I could not meet any core developer). I was browsing through the threads of last year, when I found this idea related
to collaborative filtering (CF) quite interesting, <a href="http://sourceforge.net/mailarchive/message.php?msg_id=30725712" target="_blank">
http://sourceforge.net/mailarchive/message.php?msg_id=30725712</a> , though this was sadly not accepted.<br>
<br>
</div>
If the scikit-learn community is still enthusiastic about a recsys module with CF algorithms implemented, I would love this to be my GSoC proposal and we could discuss more about the algorithms, gelling with the present sklearn API, how much we could possibly
fit in a 3 month period etc.<br>
<br>
</div>
Awaiting a reply.<span><font color="#888888"><br clear="all">
<div>
<div>
<div>
<div>
<div><br>
-- <br>
<div dir="ltr">Regards,<br>
Manoj Kumar,<br>
Mech Undergrad<br>
<a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</font></span></div>
<br>
------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div></div></div>
</div>
</div>
</div>

<br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>


Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>

Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Regards,<br>Manoj Kumar,<br>Mech Undergrad<br><a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>
</div>
</div>

--047d7bf0d2dcb1542d04f01b2bbb--
1970-01-01 00:00:00 UTC
Permalink
--001a11c1303c252f8a04f01de230
Content-Type: text/plain; charset=ISO-8859-1

The other thing to keep mind an ideal solution would be compatible with
Pipeline() - it would be nice to be able to use it there, which is one of
the reasons a different signature for the predict() method is an issue.

Hopefully something can be figured out, as there is a lot interest in CF
algorithms, and a large majority of the algorithmic work (at least for the
CF algorithm I looked at) is already present in the NMF code.


On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <***@gmail.com
> wrote:

> Yes indeed, getting two parameters for predict would be specific to CF.
> That was the most obvious idea that came to my mind. I would like to hear
> other's opinions also regarding the API, and the feasibility of such a
> project.
>
>
> On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <***@gmail.com>wrote:
>
>> @Manoj
>> The predict stage taking 2 parameters is what I was talking about - are
>> there any other estimators that need anything more than a single matrix to
>> do a prediction? I do not recall any - this would be something particular
>> to CF. Maybe you could recast it as a matrix with alternating rows of
>> item,rating but that is still a particular for CF.
>>
>> Whether that is OK as far sklearn's API is concerned is not for me to
>> decide. I would also expect it to be closely tied with DictVectorizer or
>> something like it, probably more so than most other algorithms (though this
>> is not a big deal IMO) to get categorical labels.
>>
>> @nmuralid
>> I agree totally - last number I saw was that the typical matrix for
>> something like Amazon is 99% sparse? I don't remember where I read it
>> though. Looking at crab, it seems like they are trying to do sklearn-style
>> API specifically for collaborative filtering. Not sure where the name crab
>> comes in, but it is definitely worth looking at.
>>
>> Kyle
>>
>>
>> On Thu, Jan 16, 2014 at 11:17 AM, ***@masonlive.gmu.edu <
>> ***@masonlive.gmu.edu> wrote:
>>
>>> I agree that sparse matrices need to be supported as one of the main
>>> properties inherent to the user/item rating matrix in recommender systems
>>> is its sparsity. This sparsity is what has given rise to such a large scale
>>> of research in the field. Hence this property would have to be taken
>>> advantage of because if not, since we have to deal with matrices,
>>> similarity calculations would have complexity through the roof (although
>>> there are ways to overcome this by using item-item cf techniques where
>>> similarity calculation is done offline but nevertheless is still
>>> expensive).
>>>
>>> Possibly solutions in my opinion:
>>> 1> Support dense and sparse matrices but I am not sure if such an
>>> implementation can be directly plugged into sklearn (because of the sparse
>>> matrix support.)
>>>
>>> 2> Distributed recommender systems (just provide the ability for
>>> people to distribute their similarity calculations.) This can be done using
>>> MRJob a hadoop-streaming wrapper for python. This is also a current field
>>> of research and I'm sure if you look into it you will find quite a lot of
>>> literature on the topic.
>>>
>>> 3> I am currently also trying to look into this library called
>>> scikit-crab which was started based upon a similar plan but I heard the
>>> developers are rewriting the library currently and it might not be open to
>>> the community for active development at present (not sure about this
>>> though). But I just mentioned it thinking maybe if you took a look at the
>>> code, you would get some more ideas about what improvements could be made.
>>> https://github.com/muricoca/crab
>>>
>>> ------------------------------
>>> *From:* Kyle Kastner [***@gmail.com]
>>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>>> *To:* scikit-learn-***@lists.sourceforge.net
>>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>>
>>> I looked into this once upon a time, and one of the key problems
>>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>>> input array. You would either need a mask, or some kind of indexing system
>>> for describing which value goes where in the input matrix. Either way, this
>>> extra argument would be a requirement for CF, but not for the existing
>>> algorithms in sklearn.
>>>
>>> Maybe it would only operate on sparse arrays, and infer that the values
>>> which are missing are the ones to be imputed ("recommended")? But not
>>> supporting dense arrays would basically be the opposite of other modules in
>>> sklearn, where dense input is default. Maybe someone can comment on this?
>>>
>>> I don't know how well this lines up with the existing API/functionality
>>> and the future directions there, but how to deal with the missing values is
>>> probably the primary concern for implementing CF algorithms in sklearn IMO.
>>>
>>>
>>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>>> ***@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> First of all, thanks to the scikit-learn community for guiding new
>>>> developers. I'm thankful for all the help that I've got with my Pull
>>>> Requests till now.
>>>>
>>>> I hope that this is the right place to discuss GSoC related ideas
>>>> (I've idled at the scikit-learn irc channel for quite a few occasions, but
>>>> I could not meet any core developer). I was browsing through the threads of
>>>> last year, when I found this idea related to collaborative filtering (CF)
>>>> quite interesting,
>>>> http://sourceforge.net/mailarchive/message.php?msg_id0725712 ,
>>>> though this was sadly not accepted.
>>>>
>>>> If the scikit-learn community is still enthusiastic about a recsys
>>>> module with CF algorithms implemented, I would love this to be my GSoC
>>>> proposal and we could discuss more about the algorithms, gelling with the
>>>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>>>
>>>> Awaiting a reply.
>>>>
>>>> --
>>>> Regards,
>>>> Manoj Kumar,
>>>> Mech Undergrad
>>>> http://manojbits.wordpress.com
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>> Critical Workloads, Development Environments & Everything In Between.
>>>> Get a Quote or Start a Free Trial Today.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-***@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Regards,
> Manoj Kumar,
> Mech Undergrad
> http://manojbits.wordpress.com
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

--001a11c1303c252f8a04f01de230
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir="ltr">The other thing to keep mind an ideal solution would be compatible with Pipeline() - it would be nice to be able to use it there, which is one of the reasons a different signature for the predict() method is an issue. <br>
<br>Hopefully something can be figured out, as there is a lot interest in CF algorithms, and a large majority of the algorithmic work (at least for the CF algorithm I looked at) is already present in the NMF code.<br></div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Yes indeed, getting two parameters for predict would be specific to CF. That was the most obvious idea that came to my mind. I would like to hear other&#39;s opinions also regarding the API, and the feasibility of such a project.</div>
<div class="HOEnZb"><div class="h5">
<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>@Manoj<br></div>The predict stage taking 2 parameters is what I was talking about - are there any other estimators that need anything more than a single matrix to do a prediction? I do not recall any - this would be something particular to CF. Maybe you could recast it as a matrix with alternating rows of item,rating but that is still a particular for CF. <br>


<br>Whether that is OK as far sklearn&#39;s API is concerned is not for me to decide. I would also expect it to be closely tied with DictVectorizer or something like it, probably more so than most other algorithms (though this is not a big deal IMO) to get categorical labels. <br>


<br></div><div>@nmuralid<br></div><div>I agree totally - last number I saw was that the typical matrix for something like Amazon is 99% sparse? I don&#39;t remember where I read it though. Looking at crab, it seems like they are trying to do sklearn-style API specifically for collaborative filtering. Not sure where the name crab comes in, but it is definitely worth looking at.<span><font color="#888888"><br>


<br></font></span></div><span><font color="#888888">Kyle<br></font></span></div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:17 AM, <a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a> <span dir="ltr">&lt;<a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">




<div>
<div style="direction:ltr;font-size:10pt;font-family:Tahoma">I agree  that sparse matrices need to be supported as one of the main properties inherent to the user/item rating matrix in recommender systems is its sparsity. This sparsity is
what has given rise to such a large scale of research in the field. Hence this property would have to be taken advantage of because if not, since we have to deal with matrices, similarity calculations would have complexity through the roof (although there
are ways to overcome this by using item-item cf techniques where similarity calculation is done offline but nevertheless is still expensive).
<div><br>
</div>
<div>Possibly solutions in my opinion:</div>
<div>   1&gt; Support dense and sparse matrices but I am not sure if such an implementation can be directly plugged into sklearn (because of the sparse matrix support.)</div>
<div><br>
</div>
<div>2&gt; Distributed recommender systems (just provide the ability for people to distribute their similarity calculations.) This can be done using MRJob a hadoop-streaming wrapper for python. This is also a current field of research and I&#39;m sure if you look
into it you will find quite a lot of literature on the topic.</div>
<div><br>
</div>
<div>3&gt; I am currently also trying to look into this library called scikit-crab which was started based upon a similar plan but I heard the developers are rewriting the library currently and it might not be open to the community for active development at present
(not sure about this though). But I just mentioned it thinking maybe if you took a look at the code, you would get some more ideas about what improvements could be made. <a href="https://github.com/muricoca/crab" target="_blank">https://github.com/muricoca/crab</a></div>



<div><br>
</div>
<div>
<div style="font-size:16px;font-family:Times New Roman">
<hr>
<div style="direction:ltr"><font color="#000000" face="Tahoma"><b>From:</b> Kyle Kastner [<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>]<br>
<b>Sent:</b> Wednesday, January 15, 2014 1:42 PM<br>
<b>To:</b> <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>
<b>Subject:</b> Re: [Scikit-learn-general] Google Summer of Code 2014<br>
</font><br>
</div><div><div>
<div></div>
<div>
<div dir="ltr">
<div>
<div>I looked into this once upon a time, and one of the key problems (from talking to Jake IIRC) is how to handle the &quot;missing values&quot; in the input array. You would either need a mask, or some kind of indexing system for describing which value goes where in
the input matrix. Either way, this extra argument would be a requirement for CF, but not for the existing algorithms in sklearn.<br>
<br>
</div>
Maybe it would only operate on sparse arrays, and infer that the values which are missing are the ones to be imputed (&quot;recommended&quot;)? But not supporting dense arrays would basically be the opposite of other modules in sklearn, where dense input is default.
Maybe someone can comment on this?<br>
<br>
</div>
I don&#39;t know how well this lines up with the existing API/functionality and the future directions there, but how to deal with the missing values is probably the primary concern for implementing CF algorithms in sklearn IMO.<br>



</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <span dir="ltr">
&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>Hello,<br>
<br>
</div>
First of all, thanks to the scikit-learn community for guiding new developers. I&#39;m thankful for all the help that I&#39;ve got with my Pull Requests till now.<br>
<br>
</div>
I hope that this is the right place to discuss GSoC related ideas (I&#39;ve idled at the scikit-learn irc channel for quite a few occasions, but I could not meet any core developer). I was browsing through the threads of last year, when I found this idea related
to collaborative filtering (CF) quite interesting, <a href="http://sourceforge.net/mailarchive/message.php?msg_id=30725712" target="_blank">
http://sourceforge.net/mailarchive/message.php?msg_id=30725712</a> , though this was sadly not accepted.<br>
<br>
</div>
If the scikit-learn community is still enthusiastic about a recsys module with CF algorithms implemented, I would love this to be my GSoC proposal and we could discuss more about the algorithms, gelling with the present sklearn API, how much we could possibly
fit in a 3 month period etc.<br>
<br>
</div>
Awaiting a reply.<span><font color="#888888"><br clear="all">
<div>
<div>
<div>
<div>
<div><br>
-- <br>
<div dir="ltr">Regards,<br>
Manoj Kumar,<br>
Mech Undergrad<br>
<a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</font></span></div>
<br>
------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div></div></div>
</div>
</div>
</div>

<br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>



Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>


Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Regards,<br>Manoj Kumar,<br>Mech Undergrad<br><a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>

</div>
</div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>

Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>

--001a11c1303c252f8a04f01de230--
1970-01-01 00:00:00 UTC
Permalink
--001a11c1c6ea8b6a8604f01e2b69
Content-Type: text/plain; charset=ISO-8859-1

`y` is by definition hidden at prediction time for supervised learning, so
I don't think your representation makes sense. But I see this as a
completion problem, not a supervised learning problem: the same data is
observed at training and predict time.

Isn't the following:
X = [["ham", "spam"], ["ram", "bam", "tam"]], and y = [[2, 3], [1, -3, 4]]

equivalent to [{'ham': 2, 'spam': 3}, {'ram': 2, 'bam': -3, 'tam': 4}]?

Via DictVectorizer, this becomes equivalent to a sparse COO matrix with:
col = [0, 1, 2, 3, 4]
row = [0, 0, 1, 1, 1]
data = [2, 3, 2, -3, 4]

As far as I can tell, this is a plain old sparse matrix, without a need for
an extra `y`. (Please convince me otherwise!)

There are still issues of whether this is in scikit-learn scope. For
example, does it make sense with sklearn's cross validation? Or will you
want to cross validate on both axes? Given that there is plenty of work to
be done that is well within scikit-learn's scope (prominent alternative
solutions and utilities for problems it already solves), I think this
extension of scope needs to be argued.


On 17 January 2014 09:24, Kyle Kastner <***@gmail.com> wrote:

> The other thing to keep mind an ideal solution would be compatible with
> Pipeline() - it would be nice to be able to use it there, which is one of
> the reasons a different signature for the predict() method is an issue.
>
> Hopefully something can be figured out, as there is a lot interest in CF
> algorithms, and a large majority of the algorithmic work (at least for the
> CF algorithm I looked at) is already present in the NMF code.
>
>
> On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <
> ***@gmail.com> wrote:
>
>> Yes indeed, getting two parameters for predict would be specific to CF.
>> That was the most obvious idea that came to my mind. I would like to hear
>> other's opinions also regarding the API, and the feasibility of such a
>> project.
>>
>>
>> On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <***@gmail.com>wrote:
>>
>>> @Manoj
>>> The predict stage taking 2 parameters is what I was talking about - are
>>> there any other estimators that need anything more than a single matrix to
>>> do a prediction? I do not recall any - this would be something particular
>>> to CF. Maybe you could recast it as a matrix with alternating rows of
>>> item,rating but that is still a particular for CF.
>>>
>>> Whether that is OK as far sklearn's API is concerned is not for me to
>>> decide. I would also expect it to be closely tied with DictVectorizer or
>>> something like it, probably more so than most other algorithms (though this
>>> is not a big deal IMO) to get categorical labels.
>>>
>>> @nmuralid
>>> I agree totally - last number I saw was that the typical matrix for
>>> something like Amazon is 99% sparse? I don't remember where I read it
>>> though. Looking at crab, it seems like they are trying to do sklearn-style
>>> API specifically for collaborative filtering. Not sure where the name crab
>>> comes in, but it is definitely worth looking at.
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Jan 16, 2014 at 11:17 AM, ***@masonlive.gmu.edu <
>>> ***@masonlive.gmu.edu> wrote:
>>>
>>>> I agree that sparse matrices need to be supported as one of the main
>>>> properties inherent to the user/item rating matrix in recommender systems
>>>> is its sparsity. This sparsity is what has given rise to such a large scale
>>>> of research in the field. Hence this property would have to be taken
>>>> advantage of because if not, since we have to deal with matrices,
>>>> similarity calculations would have complexity through the roof (although
>>>> there are ways to overcome this by using item-item cf techniques where
>>>> similarity calculation is done offline but nevertheless is still
>>>> expensive).
>>>>
>>>> Possibly solutions in my opinion:
>>>> 1> Support dense and sparse matrices but I am not sure if such an
>>>> implementation can be directly plugged into sklearn (because of the sparse
>>>> matrix support.)
>>>>
>>>> 2> Distributed recommender systems (just provide the ability for
>>>> people to distribute their similarity calculations.) This can be done using
>>>> MRJob a hadoop-streaming wrapper for python. This is also a current field
>>>> of research and I'm sure if you look into it you will find quite a lot of
>>>> literature on the topic.
>>>>
>>>> 3> I am currently also trying to look into this library called
>>>> scikit-crab which was started based upon a similar plan but I heard the
>>>> developers are rewriting the library currently and it might not be open to
>>>> the community for active development at present (not sure about this
>>>> though). But I just mentioned it thinking maybe if you took a look at the
>>>> code, you would get some more ideas about what improvements could be made.
>>>> https://github.com/muricoca/crab
>>>>
>>>> ------------------------------
>>>> *From:* Kyle Kastner [***@gmail.com]
>>>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>>>> *To:* scikit-learn-***@lists.sourceforge.net
>>>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>>>
>>>> I looked into this once upon a time, and one of the key problems
>>>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>>>> input array. You would either need a mask, or some kind of indexing system
>>>> for describing which value goes where in the input matrix. Either way, this
>>>> extra argument would be a requirement for CF, but not for the existing
>>>> algorithms in sklearn.
>>>>
>>>> Maybe it would only operate on sparse arrays, and infer that the
>>>> values which are missing are the ones to be imputed ("recommended")? But
>>>> not supporting dense arrays would basically be the opposite of other
>>>> modules in sklearn, where dense input is default. Maybe someone can comment
>>>> on this?
>>>>
>>>> I don't know how well this lines up with the existing
>>>> API/functionality and the future directions there, but how to deal with the
>>>> missing values is probably the primary concern for implementing CF
>>>> algorithms in sklearn IMO.
>>>>
>>>>
>>>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> First of all, thanks to the scikit-learn community for guiding new
>>>>> developers. I'm thankful for all the help that I've got with my Pull
>>>>> Requests till now.
>>>>>
>>>>> I hope that this is the right place to discuss GSoC related ideas
>>>>> (I've idled at the scikit-learn irc channel for quite a few occasions, but
>>>>> I could not meet any core developer). I was browsing through the threads of
>>>>> last year, when I found this idea related to collaborative filtering (CF)
>>>>> quite interesting,
>>>>> http://sourceforge.net/mailarchive/message.php?msg_id0725712 ,
>>>>> though this was sadly not accepted.
>>>>>
>>>>> If the scikit-learn community is still enthusiastic about a recsys
>>>>> module with CF algorithms implemented, I would love this to be my GSoC
>>>>> proposal and we could discuss more about the algorithms, gelling with the
>>>>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>>>>
>>>>> Awaiting a reply.
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Manoj Kumar,
>>>>> Mech Undergrad
>>>>> http://manojbits.wordpress.com
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>>> Critical Workloads, Development Environments & Everything In Between.
>>>>> Get a Quote or Start a Free Trial Today.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-***@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>> Critical Workloads, Development Environments & Everything In Between.
>>>> Get a Quote or Start a Free Trial Today.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-***@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Regards,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id9420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

--001a11c1c6ea8b6a8604f01e2b69
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir="ltr"><div>`y` is by definition hidden at prediction time for supervised learning, so I don&#39;t think your representation makes sense. But I see this as a completion problem, not a supervised learning problem: the same data is observed at training and predict time.</div>
<div><br></div>Isn&#39;t the following:<div><div style="font-family:arial,sans-serif;font-size:13px">X = [[&quot;ham&quot;, &quot;spam&quot;], [&quot;ram&quot;, &quot;bam&quot;, &quot;tam&quot;]], and y = [[2, 3], [1, -3, 4]]</div>
</div><div><br></div><div>equivalent to [{&#39;ham&#39;: 2, &#39;spam&#39;: 3}, {&#39;ram&#39;: 2, &#39;bam&#39;: -3, &#39;tam&#39;: 4}]?</div><div><br></div><div>Via DictVectorizer, this becomes equivalent to a sparse COO matrix with:</div>
<div>col = [0, 1, 2, 3, 4]</div><div>row = [0, 0, 1, 1, 1]</div><div>data = [2, 3, 2, -3, 4]</div><div><br></div><div>As far as I can tell, this is a plain old sparse matrix, without a need for an extra `y`. (Please convince me otherwise!)</div>
<div><br></div><div>There are still issues of whether this is in scikit-learn scope. For example, does it make sense with sklearn&#39;s cross validation? Or will you want to cross validate on both axes? Given that there is plenty of work to be done that is well within scikit-learn&#39;s scope (prominent alternative solutions and utilities for problems it already solves), I think this extension of scope needs to be argued.</div>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On 17 January 2014 09:24, Kyle Kastner <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">The other thing to keep mind an ideal solution would be compatible with Pipeline() - it would be nice to be able to use it there, which is one of the reasons a different signature for the predict() method is an issue. <br>

<br>Hopefully something can be figured out, as there is a lot interest in CF algorithms, and a large majority of the algorithmic work (at least for the CF algorithm I looked at) is already present in the NMF code.<br></div>
<div class="HOEnZb"><div class="h5">
<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Yes indeed, getting two parameters for predict would be specific to CF. That was the most obvious idea that came to my mind. I would like to hear other&#39;s opinions also regarding the API, and the feasibility of such a project.</div>

<div><div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>@Manoj<br></div>The predict stage taking 2 parameters is what I was talking about - are there any other estimators that need anything more than a single matrix to do a prediction? I do not recall any - this would be something particular to CF. Maybe you could recast it as a matrix with alternating rows of item,rating but that is still a particular for CF. <br>



<br>Whether that is OK as far sklearn&#39;s API is concerned is not for me to decide. I would also expect it to be closely tied with DictVectorizer or something like it, probably more so than most other algorithms (though this is not a big deal IMO) to get categorical labels. <br>



<br></div><div>@nmuralid<br></div><div>I agree totally - last number I saw was that the typical matrix for something like Amazon is 99% sparse? I don&#39;t remember where I read it though. Looking at crab, it seems like they are trying to do sklearn-style API specifically for collaborative filtering. Not sure where the name crab comes in, but it is definitely worth looking at.<span><font color="#888888"><br>



<br></font></span></div><span><font color="#888888">Kyle<br></font></span></div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 16, 2014 at 11:17 AM, <a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a> <span dir="ltr">&lt;<a href="mailto:***@masonlive.gmu.edu" target="_blank">***@masonlive.gmu.edu</a>&gt;</span> wrote:<br>



<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">




<div>
<div style="direction:ltr;font-size:10pt;font-family:Tahoma">I agree  that sparse matrices need to be supported as one of the main properties inherent to the user/item rating matrix in recommender systems is its sparsity. This sparsity is
what has given rise to such a large scale of research in the field. Hence this property would have to be taken advantage of because if not, since we have to deal with matrices, similarity calculations would have complexity through the roof (although there
are ways to overcome this by using item-item cf techniques where similarity calculation is done offline but nevertheless is still expensive).
<div><br>
</div>
<div>Possibly solutions in my opinion:</div>
<div>   1&gt; Support dense and sparse matrices but I am not sure if such an implementation can be directly plugged into sklearn (because of the sparse matrix support.)</div>
<div><br>
</div>
<div>2&gt; Distributed recommender systems (just provide the ability for people to distribute their similarity calculations.) This can be done using MRJob a hadoop-streaming wrapper for python. This is also a current field of research and I&#39;m sure if you look
into it you will find quite a lot of literature on the topic.</div>
<div><br>
</div>
<div>3&gt; I am currently also trying to look into this library called scikit-crab which was started based upon a similar plan but I heard the developers are rewriting the library currently and it might not be open to the community for active development at present
(not sure about this though). But I just mentioned it thinking maybe if you took a look at the code, you would get some more ideas about what improvements could be made. <a href="https://github.com/muricoca/crab" target="_blank">https://github.com/muricoca/crab</a></div>




<div><br>
</div>
<div>
<div style="font-size:16px;font-family:Times New Roman">
<hr>
<div style="direction:ltr"><font color="#000000" face="Tahoma"><b>From:</b> Kyle Kastner [<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>]<br>
<b>Sent:</b> Wednesday, January 15, 2014 1:42 PM<br>
<b>To:</b> <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>
<b>Subject:</b> Re: [Scikit-learn-general] Google Summer of Code 2014<br>
</font><br>
</div><div><div>
<div></div>
<div>
<div dir="ltr">
<div>
<div>I looked into this once upon a time, and one of the key problems (from talking to Jake IIRC) is how to handle the &quot;missing values&quot; in the input array. You would either need a mask, or some kind of indexing system for describing which value goes where in
the input matrix. Either way, this extra argument would be a requirement for CF, but not for the existing algorithms in sklearn.<br>
<br>
</div>
Maybe it would only operate on sparse arrays, and infer that the values which are missing are the ones to be imputed (&quot;recommended&quot;)? But not supporting dense arrays would basically be the opposite of other modules in sklearn, where dense input is default.
Maybe someone can comment on this?<br>
<br>
</div>
I don&#39;t know how well this lines up with the existing API/functionality and the future directions there, but how to deal with the missing values is probably the primary concern for implementing CF algorithms in sklearn IMO.<br>




</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <span dir="ltr">
&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>Hello,<br>
<br>
</div>
First of all, thanks to the scikit-learn community for guiding new developers. I&#39;m thankful for all the help that I&#39;ve got with my Pull Requests till now.<br>
<br>
</div>
I hope that this is the right place to discuss GSoC related ideas (I&#39;ve idled at the scikit-learn irc channel for quite a few occasions, but I could not meet any core developer). I was browsing through the threads of last year, when I found this idea related
to collaborative filtering (CF) quite interesting, <a href="http://sourceforge.net/mailarchive/message.php?msg_id=30725712" target="_blank">
http://sourceforge.net/mailarchive/message.php?msg_id=30725712</a> , though this was sadly not accepted.<br>
<br>
</div>
If the scikit-learn community is still enthusiastic about a recsys module with CF algorithms implemented, I would love this to be my GSoC proposal and we could discuss more about the algorithms, gelling with the present sklearn API, how much we could possibly
fit in a 3 month period etc.<br>
<br>
</div>
Awaiting a reply.<span><font color="#888888"><br clear="all">
<div>
<div>
<div>
<div>
<div><br>
-- <br>
<div dir="ltr">Regards,<br>
Manoj Kumar,<br>
Mech Undergrad<br>
<a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</font></span></div>
<br>
------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div></div></div>
</div>
</div>
</div>

<br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>




Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>



Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Regards,<br>Manoj Kumar,<br>Mech Undergrad<br><a href="http://manojbits.wordpress.com" target="_blank">http://manojbits.wordpress.com</a><br>


</div>
</div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>


Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
</div></div><br>------------------------------------------------------------------------------<br>
CenturyLink Cloud: The Leader in Enterprise Cloud Services.<br>
Learn Why More Businesses Are Choosing CenturyLink Cloud For<br>
Critical Workloads, Development Environments &amp; Everything In Between.<br>
Get a Quote or Start a Free Trial Today.<br>
<a href="http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=119420431&amp;iu=/4140/ostg.clktrk</a><br>_______________________________________________<br>

Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>

--001a11c1c6ea8b6a8604f01e2b69--
Gael Varoquaux
2014-01-19 17:49:53 UTC
Permalink
Hi Manoj,

Thanks a lot for your contributions to scikit-learn, and for stepping up
to propose a GSOC. Let me give some high-level answers, as I am now too
busy to get in the details, and we have a fantastic team that does it
very well.

As you have seen from the answers that you got to your email, matrix
factorization and recommender systems can form the basis of a valid
proposal. Other developers have given great feedback on it, and I don't
need to add anything to what has been said.

In terms of setting a GSOC proposal, a few advice for you or any student
interested (this is very general, do not take it as something that
specifically applies to you):

* Keep in mind that scikit-learn receives a lot of solicitations, thus
your application should be of high quality, and you should be highly
motivated, and you need to have had patches accepted before the start
of the GSOC, as the rules of the PSF state.

* Once again, I'd like to remind that the GSOC requires a **full-time**
involvement of the student. Consider it as a full time job. We have
more than once have problems with students not committing enough to
their project. We will fail people unable to commit enough.

* The success of your project will depend on you, but also your ability
to create a tight link with your mentor, your backup mentor, and the
other developers. I think that an indicator of such a link is if you
are able to get a lot of involvement and feedback on your project
proposal and on your original pull requests. If people are excited
about your work, if means that they believe in it. Try to engage core
developers, but do not hassle them. We are all very busy, and
scikit-learn is something that most of us do on top of other duties,
let alone supervising a GSOC student.

I have created a wiki page for the 2014 GSOC proposals: It should get
updated as we get more and more organized for this year's GSOC
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I don't know who will be able to pick up the organization of the GSOC
this year, it will depend on the availability of the core devs.

With that, Manoj, my personal advice to you is to choose a project for
which you are enthusiastic, and for which you get good feedback. Working
early with the team is an indication of success, and so is a history of
pull request.

Cheers,

Gaël

On Wed, Jan 15, 2014 at 11:37:25PM +0530, Manoj Kumar wrote:
> Hello,

> First of all, thanks to the scikit-learn community for guiding new developers.
> I'm thankful for all the help that I've got with my Pull Requests till now.

> I hope that this is the right place to discuss GSoC related ideas (I've idled
> at the scikit-learn irc channel for quite a few occasions, but I could not meet
> any core developer). I was browsing through the threads of last year, when I
> found this idea related to collaborative filtering (CF) quite interesting,
> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though this
> was sadly not accepted.

> If the scikit-learn community is still enthusiastic about a recsys module with
> CF algorithms implemented, I would love this to be my GSoC proposal and we
> could discuss more about the algorithms, gelling with the present sklearn API,
> how much we could possibly fit in a 3 month period etc.

> Awaiting a reply.
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
Manoj Kumar
2014-01-19 18:16:26 UTC
Permalink
Hi Gael,

Thanks for the reply. I had posted on the list about the Gaussian Mixture
Model project over here
http://sourceforge.net/mailarchive/message.php?msg_id=31860906 too. (Your
name was listed as a potential mentor), . I understand that you are
incredibly busy, but it would be great if you or other core developers
could spend a few minutes of time replying on the other thread.

Thanks.
Şükrü Bezen
2014-01-20 00:55:04 UTC
Permalink
First of all, hi everyone,

As Manoj mentioned, last year I applied with my collaborative filtering
idea and not accepted mainly because I did not commit to the project.

This year I will apply again and I have a few project ideas (I won't be
avoiding the commits this time). I am writing my thesis on hybrid
recommendation systems and ideas I mentioned will be related with those but
I will be discussing those on another thread.

Thanks Gael for the informative mail.


On Sun, Jan 19, 2014 at 8:16 PM, Manoj Kumar <***@gmail.com
> wrote:

> Hi Gael,
>
> Thanks for the reply. I had posted on the list about the Gaussian Mixture
> Model project over here
> http://sourceforge.net/mailarchive/message.php?msg_id=31860906 too. (Your
> name was listed as a potential mentor), . I understand that you are
> incredibly busy, but it would be great if you or other core developers
> could spend a few minutes of time replying on the other thread.
>
> Thanks.
>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


--
--------------------------------------------------
ŞÌkrÌ BEZEN
Mathieu Blondel
2014-01-20 09:41:51 UTC
Permalink
On Mon, Jan 20, 2014 at 2:49 AM, Gael Varoquaux <
***@normalesup.org> wrote:

>
> In terms of setting a GSOC proposal, a few advice for you or any student
> interested (this is very general, do not take it as something that
> specifically applies to you):
>
> * Keep in mind that scikit-learn receives a lot of solicitations, thus
> your application should be of high quality, and you should be highly
> motivated, and you need to have had patches accepted before the start
> of the GSOC, as the rules of the PSF state.
>
> * Once again, I'd like to remind that the GSOC requires a **full-time**
> involvement of the student. Consider it as a full time job. We have
> more than once have problems with students not committing enough to
> their project. We will fail people unable to commit enough.
>
> * The success of your project will depend on you, but also your ability
> to create a tight link with your mentor, your backup mentor, and the
> other developers. I think that an indicator of such a link is if you
> are able to get a lot of involvement and feedback on your project
> proposal and on your original pull requests. If people are excited
> about your work, if means that they believe in it. Try to engage core
> developers, but do not hassle them. We are all very busy, and
> scikit-learn is something that most of us do on top of other duties,
> let alone supervising a GSOC student.
>

I would like to add that being familiar with the subject of the GSOC is a
key factor to success. Too often, we found that past students lacked
sufficient understanding of the subject they picked. Likewise, a GSOC
project is not likely to succeed (or even be selected) if there's no core
developer with expertise in the domain. So, don't just pick up a project in
a list, pick a project that you are familiar with, will enjoy working on
during the summer and for which there is a good match with a core
developer. If you're not familiar with a subject, you should be willing to
start reading the literature *before* the summer starts.

Mathieu
Mathieu Blondel
2014-01-28 08:59:31 UTC
Permalink
If we have a suitable mentor for it, locality-sensitive hashing (LSH) would
be a great GSOC subject:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Mathieu
Nick Pentreath
2014-01-28 09:04:36 UTC
Permalink
This would be a great addition.

Some ideas /code perhaps: http://nearpy.io/


On Tue, Jan 28, 2014 at 10:59 AM, Mathieu Blondel <***@mblondel.org>wrote:

> If we have a suitable mentor for it, locality-sensitive hashing (LSH)
> would be a great GSOC subject:
> http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> Mathieu
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Vlad Niculae
2014-01-28 09:16:52 UTC
Permalink
I like the locality-sensitive hashing idea!

Vlad

On Tue Jan 28 10:04:36 2014, Nick Pentreath wrote:
> This would be a great addition.
>
> Some ideas /code perhaps: http://nearpy.io/
>
>
> On Tue, Jan 28, 2014 at 10:59 AM, Mathieu Blondel
> <***@mblondel.org <mailto:***@mblondel.org>> wrote:
>
> If we have a suitable mentor for it, locality-sensitive hashing
> (LSH) would be a great GSOC subject:
> http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> Mathieu
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply
> import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> <mailto:Scikit-learn-***@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2014-01-28 09:23:43 UTC
Permalink
> I like the locality-sensitive hashing idea!

+1

we need to cleanup the GSOC idea wiki page...

Alex
Robert Layton
2014-01-28 09:48:37 UTC
Permalink
In principle, I'm happy to be a mentor for LSH, as I've used it quite a bit
and implemented nilsimsa in python and javascript, as well as tested a
number of other algorithms.
I don't know much about GSOC though. What would I need to do?


On 28 January 2014 20:23, Alexandre Gramfort <
***@telecom-paristech.fr> wrote:

> > I like the locality-sensitive hashing idea!
>
> +1
>
> we need to cleanup the GSOC idea wiki page...
>
> Alex
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
Joel Nothman
2014-01-28 09:59:51 UTC
Permalink
I have previously seen that there is interest in LSH in scikit-learn, but
don't know much about its application to machine learning. Is it basically
used for nearest neighbour methods?


On 28 January 2014 20:48, Robert Layton <***@gmail.com> wrote:

> In principle, I'm happy to be a mentor for LSH, as I've used it quite a
> bit and implemented nilsimsa in python and javascript, as well as tested a
> number of other algorithms.
> I don't know much about GSOC though. What would I need to do?
>
>
> On 28 January 2014 20:23, Alexandre Gramfort <
> ***@telecom-paristech.fr> wrote:
>
>> > I like the locality-sensitive hashing idea!
>>
>> +1
>>
>> we need to cleanup the GSOC idea wiki page...
>>
>> Alex
>>
>>
>> ------------------------------------------------------------------------------
>> WatchGuard Dimension instantly turns raw network data into actionable
>> security intelligence. It gives you real-time visual feedback on key
>> security issues and trends. Skip the complicated setup - simply import
>> a virtual appliance and go from zero to informed in seconds.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Robert Layton
2014-01-28 10:28:21 UTC
Permalink
Yes, but it doesn't suffer so much at high dimensions, as compared to
something like the Euclidean distance.


On 28 January 2014 20:59, Joel Nothman <***@gmail.com> wrote:

> I have previously seen that there is interest in LSH in scikit-learn, but
> don't know much about its application to machine learning. Is it basically
> used for nearest neighbour methods?
>
>
> On 28 January 2014 20:48, Robert Layton <***@gmail.com> wrote:
>
>> In principle, I'm happy to be a mentor for LSH, as I've used it quite a
>> bit and implemented nilsimsa in python and javascript, as well as tested a
>> number of other algorithms.
>> I don't know much about GSOC though. What would I need to do?
>>
>>
>> On 28 January 2014 20:23, Alexandre Gramfort <
>> ***@telecom-paristech.fr> wrote:
>>
>>> > I like the locality-sensitive hashing idea!
>>>
>>> +1
>>>
>>> we need to cleanup the GSOC idea wiki page...
>>>
>>> Alex
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> WatchGuard Dimension instantly turns raw network data into actionable
>>> security intelligence. It gives you real-time visual feedback on key
>>> security issues and trends. Skip the complicated setup - simply import
>>> a virtual appliance and go from zero to informed in seconds.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> WatchGuard Dimension instantly turns raw network data into actionable
>> security intelligence. It gives you real-time visual feedback on key
>> security issues and trends. Skip the complicated setup - simply import
>> a virtual appliance and go from zero to informed in seconds.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Nick Pentreath
2014-01-28 10:39:32 UTC
Permalink
Another important and related use case is to reduce the search space, for
example, in recommendation systems one often has to do the dot product, or
cosine similarity, between two vectors of moderate dimension. But you have
to do this in real-time across potentially millions of candidate items. In
this case the search space can be reduced to those candidate vectors that
are estimated to be "nearest" to the user ("query") vector.


On Tue, Jan 28, 2014 at 12:28 PM, Robert Layton <***@gmail.com>wrote:

> Yes, but it doesn't suffer so much at high dimensions, as compared to
> something like the Euclidean distance.
>
>
> On 28 January 2014 20:59, Joel Nothman <***@gmail.com> wrote:
>
>> I have previously seen that there is interest in LSH in scikit-learn, but
>> don't know much about its application to machine learning. Is it basically
>> used for nearest neighbour methods?
>>
>>
>> On 28 January 2014 20:48, Robert Layton <***@gmail.com> wrote:
>>
>>> In principle, I'm happy to be a mentor for LSH, as I've used it quite a
>>> bit and implemented nilsimsa in python and javascript, as well as tested a
>>> number of other algorithms.
>>> I don't know much about GSOC though. What would I need to do?
>>>
>>>
>>> On 28 January 2014 20:23, Alexandre Gramfort <
>>> ***@telecom-paristech.fr> wrote:
>>>
>>>> > I like the locality-sensitive hashing idea!
>>>>
>>>> +1
>>>>
>>>> we need to cleanup the GSOC idea wiki page...
>>>>
>>>> Alex
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> WatchGuard Dimension instantly turns raw network data into actionable
>>>> security intelligence. It gives you real-time visual feedback on key
>>>> security issues and trends. Skip the complicated setup - simply import
>>>> a virtual appliance and go from zero to informed in seconds.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-***@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> WatchGuard Dimension instantly turns raw network data into actionable
>>> security intelligence. It gives you real-time visual feedback on key
>>> security issues and trends. Skip the complicated setup - simply import
>>> a virtual appliance and go from zero to informed in seconds.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> WatchGuard Dimension instantly turns raw network data into actionable
>> security intelligence. It gives you real-time visual feedback on key
>> security issues and trends. Skip the complicated setup - simply import
>> a virtual appliance and go from zero to informed in seconds.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Arnaud Joly
2014-01-28 10:43:57 UTC
Permalink
You can also reduce the dimensionality using random projections.

Arnaud


On 28 Jan 2014, at 11:39, Nick Pentreath <***@gmail.com> wrote:

> Another important and related use case is to reduce the search space, for example, in recommendation systems one often has to do the dot product, or cosine similarity, between two vectors of moderate dimension. But you have to do this in real-time across potentially millions of candidate items. In this case the search space can be reduced to those candidate vectors that are estimated to be "nearest" to the user ("query") vector.
>
>
> On Tue, Jan 28, 2014 at 12:28 PM, Robert Layton <***@gmail.com> wrote:
> Yes, but it doesn't suffer so much at high dimensions, as compared to something like the Euclidean distance.
>
>
> On 28 January 2014 20:59, Joel Nothman <***@gmail.com> wrote:
> I have previously seen that there is interest in LSH in scikit-learn, but don't know much about its application to machine learning. Is it basically used for nearest neighbour methods?
>
>
> On 28 January 2014 20:48, Robert Layton <***@gmail.com> wrote:
> In principle, I'm happy to be a mentor for LSH, as I've used it quite a bit and implemented nilsimsa in python and javascript, as well as tested a number of other algorithms.
> I don't know much about GSOC though. What would I need to do?
>
>
> On 28 January 2014 20:23, Alexandre Gramfort <***@telecom-paristech.fr> wrote:
> > I like the locality-sensitive hashing idea!
>
> +1
>
> we need to cleanup the GSOC idea wiki page...
>
> Alex
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2014-01-28 12:25:41 UTC
Permalink
While vanilla LSH is an interesting baseline for Approximate Nearest
Neighbors search, it is often too error-prone to be practically
useful. There exists alternative data-driven ANN methods that can have
a much lower error rates (depending on the data). Among the top
implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
written in C++ with Python bindings: there is a bench here by Radim:

http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/

It would be interesting to implement the baseline vanilla LSH (either
with random projections or min-hash) and / or Cython versions of the
random projection forests from Annoy and / or the Hierarchical k-means
trees and randomized K-D trees from FLANN.

All approaches could be probably be implemented quite effiently by
reusing existing functions and classes from other sklearn modules. For
instance building 100 FLANN-style randomized kd-trees would look very
similar to:

ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data,
np.zeros(data.shape[0]))

max_features=1 makes the extra trees fully random. Then we would use
the apply method to compute the hashing itself. Implementing a
fit_apply in the sklearn decision trees and forests would make render
initial index building even more efficient.

For more details, read the FLANN paper (very interesting) [3] and the
source code of the Annoy random projection tree building [4].

[1] https://github.com/mariusmuja/flann
[2] https://github.com/spotify/annoy
[3] http://people.cs.ubc.ca/%7Emariusm/uploads/FLANN/flann_visapp09.pdf
[4] https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L354

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2014-01-28 12:35:32 UTC
Permalink
There is also this very interesting paper that I read a long time ago
comparing vanilla LSA with k-means based hashing schemes for ANN:

http://hal.inria.fr/docs/00/56/71/91/PDF/paper.pdf‎

--
Olivier
Mathieu Blondel
2014-01-28 14:16:04 UTC
Permalink
On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <***@ensta.org>wrote:

> While vanilla LSH is an interesting baseline for Approximate Nearest
> Neighbors search, it is often too error-prone to be practically
> useful. There exists alternative data-driven ANN methods that can have
> a much lower error rates (depending on the data). Among the top
> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
> written in C++ with Python bindings: there is a bench here by Radim:
>
>
> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>
> It would be interesting to implement the baseline vanilla LSH (either
> with random projections or min-hash) and / or Cython versions of the
> random projection forests from Annoy and / or the Hierarchical k-means
> trees and randomized K-D trees from FLANN.
>

As always, I think the rule of thumb for inclusion in scikit-learn should
be that the algorithm is standard in the field and have a fairly high
citation count. Is this the case for the algorithms you mention? What are
the 2 or 3 most famous LSH algorithms?


> All approaches could be probably be implemented quite effiently by
> reusing existing functions and classes from other sklearn modules. For
> instance building 100 FLANN-style randomized kd-trees would look very
> similar to:
>
> ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data,
> np.zeros(data.shape[0]))
>
> That sounds similar to

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html

Mathieu
Olivier Grisel
2014-01-28 14:31:16 UTC
Permalink
2014/1/28 Mathieu Blondel <***@mblondel.org>:
>
>
>
> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <***@ensta.org>
> wrote:
>>
>> While vanilla LSH is an interesting baseline for Approximate Nearest
>> Neighbors search, it is often too error-prone to be practically
>> useful. There exists alternative data-driven ANN methods that can have
>> a much lower error rates (depending on the data). Among the top
>> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
>> written in C++ with Python bindings: there is a bench here by Radim:
>>
>>
>> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>>
>> It would be interesting to implement the baseline vanilla LSH (either
>> with random projections or min-hash) and / or Cython versions of the
>> random projection forests from Annoy and / or the Hierarchical k-means
>> trees and randomized K-D trees from FLANN.
>
>
> As always, I think the rule of thumb for inclusion in scikit-learn should be
> that the algorithm is standard in the field and have a fairly high citation
> count. Is this the case for the algorithms you mention? What are the 2 or 3
> most famous LSH algorithms?

The original FLANN paper is from 2009 and has 168 citations:

http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721

The method implemented in Annoy (a forest or rejection sampled random
projections for the tree nodes) is an empirical trick that is
apparently common among practitioners but I do not know the main
reference:

https://twitter.com/ogrisel/status/428137534632099840

The code looks very simple and we can reuse our sparse random
projection matrices to spare some memory and speed up projections.

>> All approaches could be probably be implemented quite effiently by
>> reusing existing functions and classes from other sklearn modules. For
>> instance building 100 FLANN-style randomized kd-trees would look very
>> similar to:
>>
>> ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data,
>> np.zeros(data.shape[0]))
>>
> That sounds similar to
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html

This is indeed the same. What is missing is storing the transformed /
indexed training data and a method to run KNN queries on the output
(e.g. using the same API as in exact nearest neighbors methods the
sklearn.neighbors package).

Anyway even if we decide not to include the most recent ANN methods in
the main sklearn repo but instead as a side repo that follows the
sklearn API and coding conventions I think it would be an interesting
topic for a GSoC.

The baseline LSH method could be implemented directly in sklearn though.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2014-01-28 14:33:23 UTC
Permalink
2014/1/28 Olivier Grisel <***@ensta.org>:
>
>> As always, I think the rule of thumb for inclusion in scikit-learn should be
>> that the algorithm is standard in the field and have a fairly high citation
>> count. Is this the case for the algorithms you mention? What are the 2 or 3
>> most famous LSH algorithms?
>
> The original FLANN paper is from 2009 and has 168 citations:
>
> http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721

Actually Google found a lot more (698):

http://scholar.google.fr/scholar?cites=6315457751882853913

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gilles Louppe
2014-01-28 14:44:17 UTC
Permalink
Given our intent to release 1.0 in the next future, I think we should
also make it clear in the wiki page that adding more and more
algorithms is not exactly the direction in which we are going to.
Maybe this is the opportunity to remove some of the old subjects from
2013 and instead add topics focusing on improvements of existing parts
of the project.

On 28 January 2014 15:33, Olivier Grisel <***@ensta.org> wrote:
> 2014/1/28 Olivier Grisel <***@ensta.org>:
>>
>>> As always, I think the rule of thumb for inclusion in scikit-learn should be
>>> that the algorithm is standard in the field and have a fairly high citation
>>> count. Is this the case for the algorithms you mention? What are the 2 or 3
>>> most famous LSH algorithms?
>>
>> The original FLANN paper is from 2009 and has 168 citations:
>>
>> http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721
>
> Actually Google found a lot more (698):
>
> http://scholar.google.fr/scholar?cites=6315457751882853913
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
> ------------------------------------------------------------------------------
> WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends. Skip the complicated setup - simply import
> a virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-01-28 14:35:14 UTC
Permalink
On 28 Jan 2014, at 15:31, Olivier Grisel <***@ensta.org> wrote:

> 2014/1/28 Mathieu Blondel <***@mblondel.org>:
>>
>>
>>
>> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <***@ensta.org>
>> wrote:
>>>
>>> While vanilla LSH is an interesting baseline for Approximate Nearest
>>> Neighbors search, it is often too error-prone to be practically
>>> useful. There exists alternative data-driven ANN methods that can have
>>> a much lower error rates (depending on the data). Among the top
>>> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
>>> written in C++ with Python bindings: there is a bench here by Radim:
>>>
>>>
>>> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>>>
>>> It would be interesting to implement the baseline vanilla LSH (either
>>> with random projections or min-hash) and / or Cython versions of the
>>> random projection forests from Annoy and / or the Hierarchical k-means
>>> trees and randomized K-D trees from FLANN.
>>
>>
>> As always, I think the rule of thumb for inclusion in scikit-learn should be
>> that the algorithm is standard in the field and have a fairly high citation
>> count. Is this the case for the algorithms you mention? What are the 2 or 3
>> most famous LSH algorithms?
>
> The original FLANN paper is from 2009 and has 168 citations:
>
> http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721
>
> The method implemented in Annoy (a forest or rejection sampled random
> projections for the tree nodes) is an empirical trick that is
> apparently common among practitioners but I do not know the main
> reference:
>
> https://twitter.com/ogrisel/status/428137534632099840
>
> The code looks very simple and we can reuse our sparse random
> projection matrices to spare some memory and speed up projections.
>

I don’t know annoy, but could it be random projections trees
as in http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf?

Arnaud
Olivier Grisel
2014-01-28 14:45:43 UTC
Permalink
2014/1/28 Arnaud Joly <***@ulg.ac.be>:
>
> The code looks very simple and we can reuse our sparse random
> projection matrices to spare some memory and speed up projections.
>
>
> I don’t know annoy, but could it be random projections trees
> as in http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf?

This is similar but instead of finding the RP direction based on a
rule, Annoy chooses the direction at random but rejects the split if
all the data at that node would end up on the same side of the
hyperplane. It tries up to 20 times and do a random shuffling split
instead if the RP attempts budget is consumed.

Also the random projections trees does not really discuss applications
to Approximate Nearest Neighbors search.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...