Okapi BM25 is a ranking function for documents for a given query.

It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.

The ranking function

Given a query QQ, containing keywords q1,....,qnq1,....,q_n, the BM25 score of a document DD is:

score(Q,D)=i=1nIDF(qi)tf(qi,D)(k1+1)tf(qi,D)+k1(1b+bDavgdl) score(Q, D) = \sum_{i=1}^{n}IDF(q_{i}) \cdot \frac{tf(q_{i},D) \cdot (k_{1}+1)}{tf(q_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}

where tf(qi,D)tf(q_{i}, D) is qiq_{i}'s term frequency in the document DD, D|D| is the length of the document DD in words, and avgdlavgdl is the average document length in the text collection from which documents are drawn. k1k_{1} and bb are free parameters, usually chosen, in absence of an advanced optimization, as k1[1.2,2.0]k_{1} \in [1.2,2.0] and b=0.75b = 0.75.

BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:

score(ti,D)=IDF(ti)tf(ti,D)(k1+1)tf(ti,D)+k1(1b+bDavgdl) score(t_{i}, D) = IDF(t_{i}) \cdot \frac{tf(t_{i},D) \cdot (k_{1}+1)}{tf(t_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}

where tit_{i} is a term appeared in document DD.

Data preparation

In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.

create external table wikipage (
  docid int,
  page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
  docid, 
  word
from
  wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
  not is_stopword(word);

Define views of term/doc frequency

create or replace view term_frequency 
as
select
  t1.docid, 
  t2.word,
  t2.freq
from (
  select
    docid,
    tf(word) as word2freq
  from
    wikipage_exploded
  group by
    docid
) t1 
LATERAL VIEW explode(word2freq) t2 as word, freq;

create or replace view document_frequency
as
select
  word, 
  count(distinct docid) docs
from
  wikipage_exploded
group by
  word;

create or replace view doc_len
as
select 
  docid, 
  count(1) as dl,
  avg(count(1)) over () as avgdl,
  count(distinct docid) over () as total_docs
from
  wikipage_exploded
group by
  docid
;

Compute Okapi BM25 score

BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.

create table scores
as
select
  tf.docid,
  tf.word,
  bm25(
    tf.freq,
    dl.dl,
    dl.avgdl,
    dl.total_docs,
    df.docs
    -- , '-k1 1.5 -b 0.75'
  ) as bm25,
  tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
  term_frequency tf
  JOIN document_frequency df ON (tf.word = df.word)
  JOIN doc_len dl ON (tf.docid = dl.docid)
;

Hyperparameters

bm25()'s function signature and hyperparameters are as follows:

hive> select bm25();
FAILED: SemanticException Line 1:7 Wrong arguments 'bm25':

#arguments must be greater than or equal to 5: 0

usage: bm25(double termFrequency, int docLength, double avgDocLength, int
       numDocs, int numDocsWithTerm [, const string options]) - Return an
       Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>]
       [-min_idf <arg>]
 -b <arg>                   Hyperparameter with type double in range 0.0
                            and 1.0 [default: 0.75]
 -d,--delta <arg>           Hyperparameter delta of BM25+ [default: 0.0]
 -k1 <arg>                  Hyperparameter with type double, usually in
                            range 1.2 and 2.0 [default: 1.2]
 -min_idf,--epsilon <arg>   Hyperparameter delta of BM25+ [default: 1e-8]

Show important terms for each document

select
  docid, 
  to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores,
  to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores
from 
  scores
group by
  docid
limit 10;

Retrive relevant documents for a given search terms

You can retrieve relevant documents for a given search query wisdom, justice, discussion as follows:

WITH scores as (
  select
    tf.docid,
    tf.word,
    bm25(
      tf.freq,
      dl.dl,
      dl.avgdl,
      dl.total_docs,
      df.docs
      -- , '-k1 1.5 -b 0.75'
    ) as bm25,
    tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
  from
    term_frequency tf
    JOIN document_frequency df ON (tf.word = df.word)
    JOIN doc_len dl ON (tf.docid = dl.docid)
  where
    tf.word in ('wisdom', 'justice', 'discussion')
)
select
  docid,
  sum(bm25) as score 
from
  scores 
group by
  docid
order by
  score DESC 
LIMIT 10
;
docid score
1 0.14190456024682774
2 0.02197354085722251

results matching ""

    No results matching ""