nextupprevious
Next:Cover Density Ranking (CDR)Up:Combining the HITS-based AlgorithmsPrevious:Vector Space Model (VSM)

Okapi Similarity Measurement (Okapi)

Okapi similarity measurement is one of the most popular methods used in the traditional IR field. Unlike VSM, the Okapi method not only considers the frequency of the query terms, but also the average length of the whole collection and the length of the document under evaluation. In the Okapi method, the similarity between a query $q$ and a document $x_i$$sim_o (q,x_i)$ can also be described as the inner product of the query vector $Q$ and the document vector $X_i$ as follows [13,23]:
$\displaystyle sim_o(q,x_i) = Q \cdot X_i =\sum_{j=1}^{m} v_j \cdot w_{ij}$     (7)

where $m$ is the number of unique terms in the document collection; $v_j$ is the frequency of a term $y_j$ in the query $q$; and $w_{ij}$ is the document weight:

$\displaystyle w_{ij}$ $\textstyle =$ $\displaystyle \frac{f_{ij} \cdot log(\frac{N-d_j+0.5}{d_j+0.5})}{2 \cdot(0.25+0.75\cdot \frac{dl}{avdl})+f_{ij}}$ (8)

where $f_{ij}$ is the term frequency of a term $y_j$ in the document $x_i$;$N$ is the total number of documents in the collection; $d_j$ is the number of documents in the collection that contain the query term $y_j$$dl$ is the length of the document (in bytes); and $avdl$ is the average document length in the collection (in bytes).

For reasons similar to the VSM method, the Okapi similarity measurement cannot be applied directly in evaluating the precision of search engines [20]. We need values for $N, d_j,$ and$avdl$. In our research, we estimate the values of $N$ and $d_j$ in the way described in the last section for VSM. In addition, the average length of a Web document ($avdl$) is estimated as to be 10,939 bytes after removing all the HTML tags and Java scripts.


nextupprevious
Next:Cover Density Ranking (CDR)Up:Combining the HITS-based AlgorithmsPrevious:Vector Space Model (VSM)
2002-02-18