Topic modeling is a way to analyze massive documents by clustering them into some topics. In particular, Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques; papers introduce the method are as follows:
- D. M. Blei, et al. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, pp. 993-1022, 2003.
- M. D. Hoffman, et al. Online Learning for Latent Dirichlet Allocation. NIPS 2010.
Hivemall enables you to analyze your data such as, but not limited to, documents based on LDA. This page gives usage instructions of the feature.
Note
This feature is supported from Hivemall v0.5-rc.1 or later.
Prepare document data
Assume that we already have a table docs
which contains many documents as string format:
docid | doc |
---|---|
1 | "Fruits and vegetables are healthy." |
2 | "I like apples, oranges, and avocados. I do not like the flu or colds." |
... | ... |
Hivemall has several functions which are particularly useful for text processing. More specifically, by using tokenize()
and is_stopword()
, you can immediately convert the documents to bag-of-words-like format:
with word_counts as (
select
docid,
feature(word, count(word)) as word_count
from
docs t1
LATERAL VIEW explode(tokenize(doc, true)) t2 as word
where
not is_stopword(word)
group by
docid, word
)
select docid, collect_list(word_count) as features
from word_counts
group by docid
;
docid | features |
---|---|
1 | ["fruits:1","healthy:1","vegetables:1"] |
2 | ["apples:1","avocados:1","colds:1","flu:1","like:2","oranges:1"] |
Note
It should be noted that, as long as your data can be represented as the feature format, LDA can be applied for arbitrary data as a generic clustering technique.
Building Topic Models and Finding Topic Words
Each feature vector is input to the train_lda()
function:
with word_counts as (
select
docid,
feature(word, count(word)) as word_count
from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word
where
not is_stopword(word)
group by
docid, word
),
input as (
select docid, collect_list(word_count) as features
from word_counts
group by docid
)
select
train_lda(features, '-topics 2 -iter 20') as (label, word, lambda)
from
input
;
Here, an option -topics 2
specifies the number of topics we assume in the set of documents.
Notice that order by docid
ensures building a LDA model precisely in a single node. In case that you like to launch train_lda
in parallel, following query hopefully returns similar (but might be slightly approximated) result:
with word_counts as (
-- same as above
),
input as (
select docid, collect_list(f) as features
from word_counts
group by docid
)
select
label, word, avg(lambda) as lambda
from (
select
train_lda(features, '-topics 2 -iter 20') as (label, word, lambda)
from
input
) t2
group by label, word
-- order by lambda desc -- ordering is optional
;
Eventually, a new table lda_model
is generated as shown below:
label | word | lambda |
---|---|---|
0 | fruits | 0.33372128 |
0 | vegetables | 0.33272517 |
0 | healthy | 0.33246377 |
0 | flu | 2.3617347E-4 |
0 | apples | 2.1898883E-4 |
0 | oranges | 1.8161473E-4 |
0 | like | 1.7666373E-4 |
0 | avocados | 1.726186E-4 |
0 | colds | 1.037139E-4 |
1 | colds | 0.16622013 |
1 | avocados | 0.16618845 |
1 | oranges | 0.1661859 |
1 | like | 0.16618414 |
1 | apples | 0.16616651 |
1 | flu | 0.16615893 |
1 | healthy | 0.0012059759 |
1 | vegetables | 0.0010818697 |
1 | fruits | 6.080827E-4 |
In the table, label
indicates a topic index, and lambda
is a value which represents how each word is likely to characterize a topic. That is, we can say that, in terms of lambda
, top-N words are the topic words of a topic.
Obviously, we can observe that topic 0
corresponds to document 1
, and topic 1
represents words in document 2
.
Predicting Topic Assignments of Documents
Once you have constructed topic models as described before, a function lda_predict()
allows you to predict topic assignments of documents.
For example, if we consider the docs
table, the exactly same set of documents as used for training, probability that a document is assigned to a topic can be computed by:
with test as (
select
docid,
word,
count(word) as value
from
docs t1
LATERAL VIEW explode(tokenize(doc, true)) t2 as word
where
not is_stopword(word)
group by
docid, word
)
select
t.docid,
lda_predict(t.word, t.value, m.label, m.lambda, '-topics 2') as probabilities
from
test t
JOIN lda_model m ON (t.word = m.word)
group by
t.docid
;
docid | probabilities (sorted by probabilities) |
---|---|
1 | [{"label":0,"probability":0.875},{"label":1,"probability":0.125}] |
2 | [{"label":1,"probability":0.9375},{"label":0,"probability":0.0625}] |
Importantly, an option -topics
is expected to be the same value as you set for training.
Since the probabilities are sorted in descending order, a label of the most promising topic is easily obtained as:
select docid, probabilities[0].label
from topic;
docid | label |
---|---|
1 | 0 |
2 | 1 |
Of course, using the different set of documents for prediction is possible. Predicting topic assignments of newly observed documents should be more realistic scenario.