Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices

Описание

Тип публикации: статья из журнала

Год издания: 2025

Идентификатор DOI: 10.3390/bdcc9100254

Аннотация: <jats:p>In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: all data vectors are divided intoПоказать полностьюclusters, and each cluster is assigned to its central point (centroid). For an ANN search query, the closest centroids are determined, and the further search continues in the corresponding clusters only. In our study, the complexity of each query is assessed and classified with the use of results of an initial trial search in a limited number of clusters. Based on this classification, the algorithm dynamically determines the presumably sufficient number of clusters which is sufficient to achieve the desired Recall value, thereby improving vector search efficiency. Our experiments show that such a complexity classifier can be built with the use of a single feature, and we propose an algorithm for its training. We studied the impact of various features on the query processing and discovered a strong dependence on the number of clusters that contains at least one nearest neighbor (productive clusters). The new algorithm is designed to be implemented on top of the IVF search which is a well-known algorithm for approximate nearest neighbor search and uses existing IVF indexes that are widely used in the most popular vector database management systems, such as pgvector. The results obtained demonstrate a significant increase in the speed of nearest neighbor search (up to 35%) while maintaining a high Recall rate of 0.99. Additionally, the search algorithm is deterministic, which might be extremely important for tasks where the reproducibility of results plays a crucial role. The developed algorithm has been tested on datasets of varying sizes up to one billion data vectors.</jats:p>

Ссылки на полный текст

Издание

Журнал: Big Data and Cognitive Computing

Выпуск журнала: Т. 9, 10

Номера страниц: 254

ISSN журнала: 25042289

Издатель: MDPI AG

Персоны

  • Kazakovtsev Vladimir (Laboratory “Hybrid Methods of Modelling and Optimization in Complex Systems”, Siberian Federal University, ul. Kirenskogo 26A, Krasnoyarsk 660074, Russia)
  • Plekhanov Mikhail (Department of Information Technologies, Novosibirsk State University, ul. Pirogova 1, Novosibirsk 630090, Russia)
  • Naumchev Alexandr (Department of Information Technologies, Novosibirsk State University, ul. Pirogova 1, Novosibirsk 630090, Russia)
  • Shkaberina Guzel (Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, Prosp. Krasnoyarskiy Rabochiy 31, Krasnoyarsk 660037, Russia)
  • Masich Igor (Laboratory “Hybrid Methods of Modelling and Optimization in Complex Systems”, Siberian Federal University, ul. Kirenskogo 26A, Krasnoyarsk 660074, Russia)
  • Egorova Lyudmila (Laboratory “Hybrid Methods of Modelling and Optimization in Complex Systems”, Siberian Federal University, ul. Kirenskogo 26A, Krasnoyarsk 660074, Russia)
  • Stupina Alena (Laboratory “Hybrid Methods of Modelling and Optimization in Complex Systems”, Siberian Federal University, ul. Kirenskogo 26A, Krasnoyarsk 660074, Russia)
  • Popov Aleksey (Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, Prosp. Krasnoyarskiy Rabochiy 31, Krasnoyarsk 660037, Russia)
  • Kazakovtsev Lev (Laboratory “Hybrid Methods of Modelling and Optimization in Complex Systems”, Siberian Federal University, ul. Kirenskogo 26A, Krasnoyarsk 660074, Russia)

Вхождение в базы данных