As noted earlier, the historical bifurcation of the retrieval machinery can, in no small part, be attributed to the differences between sparse and dense vectors—in addition to the application domain. For example, sparse vectors are plagued with a much more serious case of the curse of dimensionality. In extremely high-dimensional spaces where one may have thousands to millions of dimensions, the geometrical properties and probabilistic certainty that power clustering start to break down. So does our intuition of the space.
The high dimensionality of sparse vectors poses another challenge: greater computation required to perform basic operations. While optimized implementations (see, e.g., [
38] and references therein) of spherical
k-means exist for sparse vectors, e.g., their efficiency nonetheless degrades with the number of dimensions. Standard
k-means is even more challenged: Cluster centroids are likely to be high-dimensional
dense vectors, leading to orders of magnitude more computation to perform cluster assignments in each iteration of the algorithm.
The reason sketching is appealing is that the mathematics behind it offer guarantees in an oblivious manner: with no further assumptions on the source and nature of the vectors themselves or their distribution. Additionally, sketching a vector is often fast since it is a requisite for their application in streaming algorithms. Finally, the resulting sketch in a (dense and) low-dimensional space facilitates faster subsequent computation in exchange for a controllable error.
In the remainder of this section, we review these two algorithms in-depth and compare and contrast their performance. Importantly, we consider the approximation error in isolation: How does sketching affect MIPS if our MIPS algorithm itself were exact? In other words, if we searched exhaustively for the top-\(k\) maximizers of inner product with a query, what accuracy may we expect if that search were performed on sketches of vectors versus the original vectors?
4.3 Empirical Comparison
Our results from the preceding sections shed light on how JL and Weak Sinnamon transformations are expected to behave when applied to sparse vectors. Our main conclusion is that the sparsity rate heavily affects the approximation error. In this section, we design experiments that help us observe the expected behavior in practice and compare the two algorithms on real data.
Given a sparse dataset and a set of queries, we first obtain the exact top-\(1\) document for each query by performing an exhaustive search over the entire collection. We then create a second dataset wherein each vector is a sketch of a vector in the original dataset. We now perform exact search over the sketch dataset to obtain top-\(k^{\prime}\) (\(k^{\prime}\geq 1\)) documents, and report the accuracy of the approximate retrieval.
There are two parameters in the setup above that are of interest to us. First is the sketch size, \(n\). By fixing the dataset (thus its sparsity rate) but increasing the sketch size, we wish to empirically quantify the effect of using larger sketches on the ability of each algorithm to preserve inner product. Note that, because the vectors are non-negative, Weak Sinnamon only uses half the sketch capacity to form the upper-bound sketch—reducing its effective sketch size to \(n/2\).
The second factor is \(k^{\prime}\) which controls how “hard” a retrieval algorithm must work to compensate for the approximation error. Changing \(k^{\prime}\) helps us understand if the error introduced by a particular sketch size can be attenuated by simply retrieving more candidates and later re-ranking them according to their exact score.
The results of our experiments are presented in
Figure 1 for select datasets embedded with the
Splade model. We chose these datasets because they have very different sizes and sparsity rates, as shown in
Table 1, with
Quora having the largest sparsity rate and fewest documents, and
NQ the smallest sparsity rate and a medium collection size.
Naturally, our observations are consistent with what the theoretical results predict. The sketch quality improves as its size increases. That shows the effect of the parameter \(n\) on the approximation variance of the JL transform and the concentration of error in Weak Sinnamon sketches.
Another unsurprising finding is that
Weak Sinnamon’s sensitivity to the
\(\psi/n\) factor becomes evident in
NQ: When the ratio between the number of non-zero coordinates and the sketch size (
\(\psi/n\)) is large, the variance of the approximation error becomes larger. The reason is twofold: more non-zero coordinates are likely to collide as vectors become more dense; and, additionally, sketches themselves become more dense, thereby increasing the likelihood of error for
inactive coordinates. To contextualize
Weak Sinnamon and the effects of our modifications to the original algorithm on the approximation error, we also plot in
Figure 1 the performance of
Sinnamon.
While increasing the sketch size is one way to lower the probability of error, casting a wider net (i.e., \(k^{\prime}{\gt}k\)) followed by re-ranking appears to also improve retrieval quality.
Now that we have a better understanding of the effect of the parameters on the quality of the sketching algorithms, let us choose one configuration and repeat the experiments above on all our datasets. One noteworthy adjustment is that we set Weak Sinnamon’s effective sketch size to match that of the JL transform’s: As we noted, because Weak Sinnamon leaves the lower-bound sketch unused for non-negative vectors, we re-allocate it for the upper-bound sketch, in effect giving Weak Sinnamon’s upper-bound sketch \(n\) dimensions to work with. Another change is that we use a more challenging configuration and perform top-\(10\) retrieval. Finally, we also include Efficient Splade for completeness.
Figure 2 shows the results of these experiments. The general trends observed in these figures are consistent with the findings of
Figure 1: Obtaining a larger pool of candidates from sketches and re-ranking them according to their exact inner product is a reliable way of countering the approximation error; and,
Weak Sinnamon generally underperforms the JL transform in preserving inner product between vectors. Additionally, as vectors become more dense, the sketching quality degrades, leading to a higher approximation error.
Another interesting but expected phenomenon is that sketching performs comparatively poorly on
Efficient Splade. That is because, query vectors generated by the
Efficient Splade model are more sparse than those made by
Splade. When a query has few non-zero coordinates, the expected inner product becomes small while the variance of JL transform sketches concentrates around a constant, as predicted by
Theorem 4.3. As for
Weak Sinnamon, when queries have a large number of non-zero coordinates, the shape of the distribution of error becomes less sensitive to the approximation error of individual coordinates; with fewer non-zero coordinates in the query vector, the opposite happens.
As a final observation, we notice that retrieval accuracy is generally higher for
Quora, MS
Marco, and
NQ datasets. That is easy to explain for
Quora as it is a more sparse dataset with a much smaller
\(\psi/n\). On the other hand, the observed trend is rather intriguing for a larger and more dense dataset such as MS
Marco. On closer inspection, however, it appears that the stronger performance can be attributed to the probabilities of coordinates being non-zero (i.e.,
\(p_{i}\)’s). In
Figure 3, we plot the distribution of
\(p_{i}\)’s but, to make the illustration cleaner, sort the coordinates by their
\(p_{i}\) in descending order. Interestingly, the distribution of
\(p_{i}\)’s is closer to uniform for MS
Marco and
NQ, while it is more heavily skewed for
Fever,
DBPedia, and
HotpotQA.