[Scikit-learn-general] Latent Dirichlet Allocation

Discussion:

Rockenkamm, Christian

2016-01-26 13:21:10 UTC

Hallo,

I have question concerning the Latent Dirichlet Allocation. The results I get from using it are a bit confusing.
At first I use about 3000 documents. In the preparation with the CountVectorizrt I use the following parameters : max_df=0.95 and min_df=0.05.
For the LDA fit I use the bath learning method. For the other parameters I have tried many different values. However regardless of which configuration I used, I face one common problem. I get topics that are never used in any of the docs and said topics all show the same structure (topic-word-distribution). I even tried gensim with the same configuration as scikit, yet I still encountered this problem. I also tried lowering the number of topics in the model, but this did not lead to the expected results either. For 100 topics, 20-27 were still affected by this problem, for 50 topics, there were still 2-8 of them being affected, depending on the parameter setting.
Does anybody have an idea as to what might be causing this problem and how to resolve it?

Best regards,
Christian Rockenkamm

Rockenkamm, Christian

2016-01-26 12:42:52 UTC

Permalink

Andreas Mueller

2016-01-26 18:13:35 UTC

Permalink

Hi Christian.
Can you provide the data and code to reproduce?
Best,
Andy

On 01/26/2016 08:21 AM, Rockenkamm, Christian wrote:
>
> Hallo,
>
>
> I have question concerning the Latent Dirichlet Allocation. The
> results I get from using it are a bit confusing.
>
> At first I use about 3000 documents. In the preparation with the
> CountVectorizrt I use the following parameters : max_df=0.95 and
> min_df=0.05.
>
> For the LDA fit I use the bath learning method. For the other
> parameters I have tried many different values. However regardless of
> which configuration I used, I face one common problem. I get topics
> that are never used in any of the docs and said topics all show the
> same structure (topic-word-distribution). I even tried gensim with the
> same configuration as scikit, yet I still encountered this problem. I
> also tried lowering the number of topics in the model, but this did
> not lead to the expected results either. For 100 topics, 20-27 were
> still affected by this problem, for 50 topics, there were still 2-8 of
> them being affected, depending on the parameter setting.
>
> Does anybody have an idea as to what might be causing this problem and
> how to resolve it?
>
>
> Best regards,
>
> Christian Rockenkamm
>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-01-26 22:35:15 UTC

Permalink

How many distinct words are in your dataset?

On 27 January 2016 at 00:21, Rockenkamm, Christian <
***@stud.uni-goettingen.de> wrote:

> Hallo,
>
>
> I have question concerning the Latent Dirichlet Allocation. The results I
> get from using it are a bit confusing.
>
> At first I use about 3000 documents. In the preparation with the
> CountVectorizrt I use the following parameters : max_df=0.95 and
> min_df=0.05.
>
> For the LDA fit I use the bath learning method. For the other parameters I
> have tried many different values. However regardless of which configuration
> I used, I face one common problem. I get topics that are never used in any
> of the docs and said topics all show the same structure
> (topic-word-distribution). I even tried gensim with the same configuration
> as scikit, yet I still encountered this problem. I also tried lowering the
> number of topics in the model, but this did not lead to the expected
> results either. For 100 topics, 20-27 were still affected by this problem,
> for 50 topics, there were still 2-8 of them being affected, depending on
> the parameter setting.
>
> Does anybody have an idea as to what might be causing this problem and how
> to resolve it?
>
>
> Best regards,
>
> Christian Rockenkamm
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

Rockenkamm, Christian

2016-01-26 23:01:32 UTC

Permalink

I used more datasets in a range from 2200 to 3500 distinct words in the tf for training the LDA. This data are preprocessed with lemmatizing before CountVectorizrt.
________________________________
Von: Joel Nothman [***@gmail.com]
Gesendet: Dienstag, 26. Januar 2016 23:35
An: scikit-learn-general
Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation

How many distinct words are in your dataset?

On 27 January 2016 at 00:21, Rockenkamm, Christian <***@stud.uni-goettingen.de<mailto:***@stud.uni-goettingen.de>> wrote:
Hallo,

I have question concerning the Latent Dirichlet Allocation. The results I get from using it are a bit confusing.
At first I use about 3000 documents. In the preparation with the CountVectorizrt I use the following parameters : max_df=0.95 and min_df=0.05.
For the LDA fit I use the bath learning method. For the other parameters I have tried many different values. However regardless of which configuration I used, I face one common problem. I get topics that are never used in any of the docs and said topics all show the same structure (topic-word-distribution). I even tried gensim with the same configuration as scikit, yet I still encountered this problem. I also tried lowering the number of topics in the model, but this did not lead to the expected results either. For 100 topics, 20-27 were still affected by this problem, for 50 topics, there were still 2-8 of them being affected, depending on the parameter setting.
Does anybody have an idea as to what might be causing this problem and how to resolve it?

Best regards,
Christian Rockenkamm

Manjush Vundemodalu

2016-02-09 08:05:50 UTC

Permalink

I think you have most of words filtered out of tf because of the condition
min_df=0.05,

I faced similar problems while working with chat data and I tried min_df=2
instead of using float value and it worked

Regards,
Manjush

On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian <
***@stud.uni-goettingen.de> wrote:

> I used more datasets in a range from 2200 to 3500 distinct words in the tf
> for training the LDA. This data are preprocessed with lemmatizing before
> CountVectorizrt.
> ------------------------------
> *Von:* Joel Nothman [***@gmail.com]
> *Gesendet:* Dienstag, 26. Januar 2016 23:35
> *An:* scikit-learn-general
> *Betreff:* Re: [Scikit-learn-general] Latent Dirichlet Allocation
>
> How many distinct words are in your dataset?
>
> On 27 January 2016 at 00:21, Rockenkamm, Christian <
> ***@stud.uni-goettingen.de> wrote:
>
>> Hallo,
>>
>>
>> I have question concerning the Latent Dirichlet Allocation. The results I
>> get from using it are a bit confusing.
>>
>> At first I use about 3000 documents. In the preparation with the
>> CountVectorizrt I use the following parameters : max_df=0.95 and
>> min_df=0.05.
>>
>> For the LDA fit I use the bath learning method. For the other parameters
>> I have tried many different values. However regardless of which
>> configuration I used, I face one common problem. I get topics that are
>> never used in any of the docs and said topics all show the same structure
>> (topic-word-distribution). I even tried gensim with the same configuration
>> as scikit, yet I still encountered this problem. I also tried lowering the
>> number of topics in the model, but this did not lead to the expected
>> results either. For 100 topics, 20-27 were still affected by this problem,
>> for 50 topics, there were still 2-8 of them being affected, depending on
>> the parameter setting.
>>
>> Does anybody have an idea as to what might be causing this problem and
>> how to resolve it?
>>
>>
>> Best regards,
>>
>> Christian Rockenkamm
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

Vlad Niculae

2016-02-09 16:26:15 UTC

Permalink

I usually use an absolute threshold for min_df and a relative one for
max_df. I find it very useful to look at the histogram of word dfs for
choosing the latter, it varies a lot from dataset to dataset. For
short texts, like tweets, words such as "the" can have a df of 0.1.

It's very easy to look at dfs, just get a transformed X out of your
vectorizer and do:

>>> df = (X > 0).sum(axis=0)
>>> df = df.A.ravel().astype(np.double)
>>> df /= X.shape[0]

My 2c,
Vlad

On Tue, Feb 9, 2016 at 3:05 AM, Manjush Vundemodalu <***@gmail.com> wrote:
> I think you have most of words filtered out of tf because of the condition
> min_df=0.05,
>
> I faced similar problems while working with chat data and I tried min_df=2
> instead of using float value and it worked
>
> Regards,
> Manjush
>
>
>
> On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian
> <***@stud.uni-goettingen.de> wrote:
>>
>> I used more datasets in a range from 2200 to 3500 distinct words in the tf
>> for training the LDA. This data are preprocessed with lemmatizing before
>> CountVectorizrt.
>> ________________________________
>> Von: Joel Nothman [***@gmail.com]
>> Gesendet: Dienstag, 26. Januar 2016 23:35
>> An: scikit-learn-general
>> Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation
>>
>> How many distinct words are in your dataset?
>>
>> On 27 January 2016 at 00:21, Rockenkamm, Christian
>> <***@stud.uni-goettingen.de> wrote:
>>>
>>> Hallo,
>>>
>>>
>>> I have question concerning the Latent Dirichlet Allocation. The results I
>>> get from using it are a bit confusing.
>>>
>>> At first I use about 3000 documents. In the preparation with the
>>> CountVectorizrt I use the following parameters : max_df=0.95 and
>>> min_df=0.05.
>>>
>>> For the LDA fit I use the bath learning method. For the other parameters
>>> I have tried many different values. However regardless of which
>>> configuration I used, I face one common problem. I get topics that are never
>>> used in any of the docs and said topics all show the same structure
>>> (topic-word-distribution). I even tried gensim with the same configuration
>>> as scikit, yet I still encountered this problem. I also tried lowering the
>>> number of topics in the model, but this did not lead to the expected results
>>> either. For 100 topics, 20-27 were still affected by this problem, for 50
>>> topics, there were still 2-8 of them being affected, depending on the
>>> parameter setting.
>>>
>>> Does anybody have an idea as to what might be causing this problem and
>>> how to resolve it?
>>>
>>>
>>> Best regards,
>>>
>>> Christian Rockenkamm
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-***@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>