• No results found

One can abstract the similarity between objects to a metric space. The similar-ity/dissimilarity between objects in a metric space is described by the distance between them.

2.2.1 Metric Search Indexing Problem

In this report, we will use the following definition of a metric search indexing problem [30]:

Definition 1 Metric Search Indexing Problem. LetDbe a domain,da distance measure on D, and (D, d) a metric space. As the space is metric, the following holds: positiveness (distance d(x, y) 0, d(x, x) = 0), symmetry (d(x, y) = d(y, x)), and the triangle inequality (d(x, y) +d(y, z) d(x, z). Given a set X D ofnelements, preprocess the data so that similarity queries are answered efficiently.

2.2.2 Types of Queries

The queries in a metric search indexing problem can be of different types. The most basic kind is asimilarity range query, where all objects within a certain radius of the query object are of interest [15]. An example of a range query is found in Figure 2.1. A potential issue with this type of query is that it will return an unpredictable number of data objects.

Another query type is a k-NN-query, where the k nearest neighbor to a query object is retrieved. An example of a k-NN-query with k = 4 is found in Figure 2.2. If only the closest neighbor is of interest, the query is called an NN-query [7]. Applyingk-NN-queries is more complicated than applying range queries and can be executed through a series of range queries. This can be done by first executing a range query where the query object is the same as in the k-NN-query, but the radius is infinite, so all objects in the index will overlap with the query. The firstk objects to overlap with the query will be added to a set of possible candidates to be the result. Next, a range query with the radius set as the distance from the query object to the candidate object furthest away is applied. Every object that overlaps this range query will be swapped with the candidate object that is furthest away from the query. This process is repeated until there are no new overlaps, and at this point, the set of candidates will contain the query object’sk nearest neighbors.

A way to simulate performing a range query with a predictable number of results like a k-NN-query without the complexity of performing a k-NN-query, is to sort the data set with respect to the distance to the query object and apply a range query with the radius equal to the distance from the query object to the kth element of the sorted list [16]. However, this is not an efficient way to apply a query and is only applicable for simulation used in research.

If one is not dependent on predictably retrieving an exact number of objects, a simpler approach would be to apply range queries with different radii and select the radius value that returns approximately the number of objects desired. This is what we did for the experiments in this report.

1

2

4

5 6

7

8

9 3

Q

Figure 2.1: An example of a range query. Qis the query object, and the dashed circle marks the query radius. Data object 3 and 8 will be the result of the range query as these lie within the query radius.

2.2.3 Metric Indexing Methods

Metric indexing methods can be classified into two categories: pivot-based meth-ods and clustering-based methmeth-ods. Pivot-based indexing methmeth-ods calculate and store the distance from each object in the data set to a subset of objects that are chosen as pivots. These pre-calculated distances are used to prune the data set when searching instead of comparing each object to the query [8]. AESA is an example of an efficient pivot-based method, but since every object can play

1

2

4

5 6

7

8

9 3

Q

Figure 2.2: An example of a 4-NN-query. Qis the query object, and 3, 4, 6, and 8 are the four closest data objects toQ, and will be the result of the 4-NN-query.

Clustering-based indexing methods pre-process the data set to a set of clus-ters, or regions. The information about the regions is stored in the index.

When searching for a query in such an index, one can discard entire regions from the result. Examples of well-known clustering-based indexing methods are SSS-tree [5], M-Tree [10], GNAT [4], and EGNAT [28].

2.2.4 The Ambit Region Type

The indices used in this report are clustering-based, which means that the in-formation used for searching is stored in the form of regions.

To generalize the description of regions, the Ambit Region type was intro-duced by Hetland [17]:

Definition 2 Ambit [17]. An ambit of degree m in a universe U with

sym-metric comparison function :U U K is a comparison-based region B[p, r;f] :=C[p, x:f(x) r] (2.1) as defined by

(i) a tuple p of sources or foci p1, ..., pm U; (ii) a radius r L; and

(iii) a remoteness map f : Km L, with L partially ordered.

The features xi ofu are called its radients, and f(x) is its remoteness.

The ambits are linear if the remoteness map f is linear.

The shape of an ambit is determined by its foci and its remoteness map. The radius decides the size of the ambit. In figs. 2.3 to 2.5, the remoteness map is the sum of the radients, but the number of foci varies from one to three, which creates a ball, an ellipse, and an egglipse respectively. The foci are weighted equally in all three figures.

In Figure 2.6, the foci of the ambit are weighted differently, creating an ambit that resembles an egglipse more than an ellipse, even though it has two foci, not three.

The ambit in Figure 2.7 has the remoteness map x1 x2, and forms a hyperboloid. Figure 2.8 shows an ambit with a more complex remoteness map, demonstrating that there are vast possibilities for the shape of an ambit if one allows non-linear ambits. When the ambits form regions in an index, one can not ensure that the search will be correct with an arbitrary f as the remoteness map. In this report, we will limit f to be linear, so that we can use the linear ambit overlap check introduced by Hetland [17] when traversing the indices.

The linear ambit overlap check is described in Definition 3.

Figure 2.3: Ambit with one focus and remoteness map x1.

Figure 2.4: Ambit with two foci and remoteness map x1+x2.

Figure 2.5: Ambit with three foci and remoteness map x1+x2+x3. Definition 3 Linear Ambit Overlap Check [17]. Let R :=B[p, r;a] and Q:=

B[q, s;c] be linear ambits for a quasimetric , with either a or c non-negative and a 1, c 1 = 1. If R and Q intersect, then

r+s aZct, (2.2)

where zij is the distance (pi, qj) between focus pi of R and focus qj of Q.

2.2.5 The Goal of Metric Indexing

Measuring the distance in a metric space can be an expensive operation [29].

This is especially true in spaces with complex distance functions, such as quadratic form distance and string edit distance, and in spaces with high dimensional-ity [2]. Therefore, metric indexing aims to minimize the number of distances that have to be calculated. The upper bound is a linear scan, where the distance

Figure 2.6: Ambit with two foci and remoteness map x1+ 4x2.

Figure 2.7: Ambit with two foci and remoteness map x1 x2. from the query to each data object in the data set has to be calculated.

Although reducing the number of distance calculations is the primary goal of metric indexing, there are also other essential features to the indices. One needs to be able to store the indices efficiently in secondary memory, and one can, therefore, take into account the number of I/O operations needed to access them [5]. In this report, we will not take this into account, but it can be relevant, especially when working with vector spaces with low dimensionality, as the distance functions in such spaces often are simple [7].

Other features one can consider when making indices are whether they work with discrete or continuous distance functions and whether they allow one to dynamically insert new objects after initially building the index [5].

2.2.6 The Curse of Dimensionality

Metric indexing methods are designed to structure the data so that one can discard objects that are not of interest, without having to review the object

Figure 2.8: Ambit with two foci and remoteness map f(x) = x1x2 if x1 > x2,

x22 otherwise.

query. However, for this to be possible, one would need to be able to differentiate the distances between the objects. For example, if all distances in a data set are one, the only information obtained by comparison is whether the object is the query or not. In such a case, it would make little sense to apply a similarity search as all objects are equally similar. It would be impossible to avoid a linear scan, meaning that a traditional database scheme would work just as well [7].

The more equal the distances between the objects in a data set are, the harder it is to create an efficient index. One can create a histogram of the distances between the objects in a metric space. If this histogram is high and narrow, meaning that the mean is high and the variance is low, the distances in the data set are similar. If the mean is low, and the variance is high, meaning that the histogram is low and wide, the values of distances in the data set are spread in a larger range. The variance and mean of the histogram is referred to as the intrinsic dimensionality of a metric space [7]. High intrinsic dimen-sionality is characterized by a high mean and low variance, and low intrinsic dimensionality is characterized by a low mean and high variance. Figures 2.9 and 2.10 show two metric spaces and their corresponding distance histograms.

From the histograms, we can see that the metric space in Figure 2.9 has a higher intrinsic dimensionality than the metric space in Figure 2.10. The higher the intrinsic dimensionality of a metric space is, the harder it is to make an efficient index. This phenomenon is referred to as the curse of dimensionality [7].

Intrinsic dimensionality is not the same as representational dimensionality.

Nevertheless, the higher the representational dimensionality of a vectors space is, the more equidistant the data can appear. Finding efficient solutions for searching high dimensional metric spaces is one of the most important open

problems in metric indexing [6].

(b) Histogram of the distances between the objects inM1.

Figure 2.9: Example of a metric space with a relatively tall and narrow distance histogram, indicating that the intrinsic dimensionality of the space is high.