Okapi BM25 is a ranking function for documents for a given query.
It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.
The ranking function
Given a query , containing keywords , the BM25 score of a document is:
where is 's term frequency in the document , is the length of the document in words, and is the average document length in the text collection from which documents are drawn. and are free parameters, usually chosen, in absence of an advanced optimization, as and .
BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:
where is a term appeared in document .
Data preparation
In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.
create external table wikipage (
docid int,
page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv
LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;
create or replace view wikipage_exploded
as
select
docid,
word
from
wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
not is_stopword(word);
Define views of term/doc frequency
create or replace view term_frequency
as
select
t1.docid,
t2.word,
t2.freq
from (
select
docid,
tf(word) as word2freq
from
wikipage_exploded
group by
docid
) t1
LATERAL VIEW explode(word2freq) t2 as word, freq;
create or replace view document_frequency
as
select
word,
count(distinct docid) docs
from
wikipage_exploded
group by
word;
create or replace view doc_len
as
select
docid,
count(1) as dl,
avg(count(1)) over () as avgdl,
count(distinct docid) over () as total_docs
from
wikipage_exploded
group by
docid
;
Compute Okapi BM25 score
BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.
create table scores
as
select
tf.docid,
tf.word,
bm25(
tf.freq,
dl.dl,
dl.avgdl,
dl.total_docs,
df.docs
-- , '-k1 1.5 -b 0.75'
) as bm25,
tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
term_frequency tf
JOIN document_frequency df ON (tf.word = df.word)
JOIN doc_len dl ON (tf.docid = dl.docid)
;
Hyperparameters
bm25()
's function signature and hyperparameters are as follows:
hive> select bm25();
FAILED: SemanticException Line 1:7 Wrong arguments 'bm25':
#arguments must be greater than or equal to 5: 0
usage: bm25(double termFrequency, int docLength, double avgDocLength, int
numDocs, int numDocsWithTerm [, const string options]) - Return an
Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>]
[-min_idf <arg>]
-b <arg> Hyperparameter with type double in range 0.0
and 1.0 [default: 0.75]
-d,--delta <arg> Hyperparameter delta of BM25+ [default: 0.0]
-k1 <arg> Hyperparameter with type double, usually in
range 1.2 and 2.0 [default: 1.2]
-min_idf,--epsilon <arg> Hyperparameter delta of BM25+ [default: 1e-8]
Show important terms for each document
select
docid,
to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores,
to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores
from
scores
group by
docid
limit 10;
Retrive relevant documents for a given search terms
You can retrieve relevant documents for a given search query wisdom, justice, discussion
as follows:
WITH scores as (
select
tf.docid,
tf.word,
bm25(
tf.freq,
dl.dl,
dl.avgdl,
dl.total_docs,
df.docs
-- , '-k1 1.5 -b 0.75'
) as bm25,
tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
term_frequency tf
JOIN document_frequency df ON (tf.word = df.word)
JOIN doc_len dl ON (tf.docid = dl.docid)
where
tf.word in ('wisdom', 'justice', 'discussion')
)
select
docid,
sum(bm25) as score
from
scores
group by
docid
order by
score DESC
LIMIT 10
;
docid | score |
---|---|
1 | 0.14190456024682774 |
2 | 0.02197354085722251 |