Initial commit

This commit is contained in:
yichuan520030910320
2025-06-30 09:05:05 +00:00
commit 46f6cc100b
1231 changed files with 278432 additions and 0 deletions

View File

@@ -0,0 +1,74 @@
**Usage for SSD-based indices**
===============================
To generate an SSD-friendly index, use the `apps/build_disk_index` program.
----------------------------------------------------------------------------
The arguments are as follows:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **--dist_fn**: Three distance functions are supported: cosine distance, minimum Euclidean distance (l2) and maximum inner product (mips).
3. **--data_file**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as an integer. The next 4 bytes represent the dimension of data as an integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. `sizeof(T)` is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **--index_path_prefix**: the index will span a few files, all beginning with the specified prefix path. For example, if you provide `~/index_test` as the prefix path, build generates files such as `~/index_test_pq_pivots.bin, ~/index_test_pq_compressed.bin, ~/index_test_disk.index, ...`. There may be between 8 and 10 files generated with this prefix depending on how the index is constructed.
5. **-R (--max_degree)** (default is 64): the degree of the graph index, typically between 60 and 150. Larger R will result in larger indices and longer indexing times, but better search quality.
6. **-L (--Lbuild)** (default is 100): the size of search list during index build. Typical values are between 75 to 200. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value for L value that is at least the value of R unless you need to build indices really quickly and can somewhat compromise on quality.
7. **-B (--search_DRAM_budget)**: bound on the memory footprint of the index at search time in GB. Once built, the index will use up only the specified RAM limit, the rest will reside on disk. This will dictate how aggressively we compress the data vectors to store in memory. Larger will yield better performance at search time. For an n point index, to use b byte PQ compressed representation in memory, use `B = ((n * b) / 2^30 + (250000*(4*R + sizeof(T)*ndim)) / 2^30)`. The second term in the summation is to allow some buffer for caching about 250,000 nodes from the graph in memory while serving. If you are not sure about this term, add 0.25GB to the first term.
8. **-M (--build_DRAM_budget)**: Limit on the memory allowed for building the index in GB. If you specify a value less than what is required to build the index in one pass, the index is built using a divide and conquer approach so that sub-graphs will fit in the RAM budget. The sub-graphs are overlayed to build the overall index. This approach can be upto 1.5 times slower than building the index in one shot. Allocate as much memory as your RAM allows.
9. **-T (--num_threads)** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
10. **--PQ_disk_bytes** (default is 0): Use 0 to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory
11. **--build_PQ_bytes** (default is 0): Set to a positive value less than the dimensionality of the data to enable faster index build with PQ based distance comparisons.
12. **--use_opq**: use the flag to use OPQ rather than PQ compression. OPQ is more space efficient for some high dimensional datasets, but also needs a bit more build time.
To search the SSD-index, use the `apps/search_disk_index` program.
-------------------------------------------------------------------
The arguments are as follows:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported. Use the same data type as in arg (1) above used in building the index.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips). Use the same distance as in arg (2) above used in building the index.
3. **--index_path_prefix**: same as the prefix used in building the index (see arg 4 above).
4. **--num_nodes_to_cache** (default is 0): While serving the index, the entire graph is stored on SSD. For faster search performance, you can cache a few frequently accessed nodes in memory.
5. **-T (--num_threads)** (default is to get_omp_num_procs()): The number of threads used for searching. Threads run in parallel and one thread handles one query at a time. More threads will result in higher aggregate query throughput, but will also use more IOs/second across the system, which may lead to higher per-query latency. So find the balance depending on the maximum number of IOPs supported by the SSD.
6. **-W (--beamwidth)** (default is 2): The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use `W=1`. For best latency, use `W=4,8` or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.
7. **--query_file**: The queries to be searched on in same binary file format as the data file in arg (2) above. The query file must be the same type as argument (1).
8. **--gt_file**: The ground truth file for the queries in arg (7) and data file used in index construction. The binary file must start with *n*, the number of queries (4 bytes), followed by *d*, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the d closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes. The groundtruth file, if not available, can be calculated using the program `apps/utils/compute_groundtruth`. Use "null" if you do not have this file and if you do not want to compute recall.
9. **K**: search for *K* neighbors and measure *K*-recall@*K*, meaning the intersection between the retrieved top-*K* nearest neighbors and ground truth *K* nearest neighbors.
10. **result_output_prefix**: Search results will be stored in files with specified prefix, in bin format.
11. **-L (--search_list)**: A list of search_list sizes to perform search with. Larger parameters will result in slower latencies, but higher accuracies. Must be at least the value of *K* in arg (9).
Example with BIGANN:
--------------------
This example demonstrates the use of the commands above on a 100K slice of the [BIGANN dataset](http://corpus-texmex.irisa.fr/) with 128 dimensional SIFT descriptors applied to images.
Download the base and query set and convert the data to binary format
```bash
mkdir -p DiskANN/build/data && cd DiskANN/build/data
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xf sift.tar.gz
cd ..
./apps/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./apps/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin
```
Now build and search the index and measure the recall using ground truth computed using brutefoce.
```bash
./apps/utils/compute_groundtruth --data_type float --dist_fn l2 --base_file data/sift/sift_learn.fbin --query_file data/sift/sift_query.fbin --gt_file data/sift/sift_query_learn_gt100 --K 100
# Using 0.003GB search memory budget for 100K vectors implies 32 byte PQ compression
./apps/build_disk_index --data_type float --dist_fn l2 --data_path data/sift/sift_learn.fbin --index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 -R 32 -L50 -B 0.003 -M 1
./apps/search_disk_index --data_type float --dist_fn l2 --index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 --query_file data/sift/sift_query.fbin --gt_file data/sift/sift_query_learn_gt100 -K 10 -L 10 20 30 40 50 100 --result_path data/sift/res --num_nodes_to_cache 10000
```
The search might be slower on machine with remote SSDs. The output lists the query throughput, the mean and 99.9pc latency in microseconds and mean number of 4KB IOs to disk for each `L` parameter provided.
```
L Beamwidth QPS Mean Latency 99.9 Latency Mean IOs CPU (s) Recall@10
======================================================================================================================
10 2 27723.95 2271.92 4700.00 8.81 40.47 81.79
20 2 15369.23 4121.04 7576.00 15.93 61.60 96.42
30 2 10335.75 6147.14 11424.00 23.30 74.96 98.78
40 2 7684.18 8278.83 14714.00 30.78 94.27 99.40
50 2 6421.66 9913.28 16550.00 38.35 116.86 99.63
100 2 3337.98 19107.81 29292.00 76.59 226.88 99.91
```

View File

@@ -0,0 +1,187 @@
<!-- Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license. -->
**Usage for dynamic indices**
================================
A "dynamic" index refers to an index which supports insertion of new points into a (possibly previously built) index as well as deletions of points.
While eager deletes can be supported by DiskANN, `lazy_deletes` are the preferred method.
A sequence of lazy deletions must be followed by an invocation of the `consolidate_deletes` method that frees up slots in the index and edits the graph to maintain good recall.
The program `apps/test_insert_deletes_consolidate` demonstrates this functionality. It allows the user to specify which points from the data file will be used
to initially build the index, which points will be deleted from the index, and which points will be inserted into the index.
Insertions, searches and lazy deletions can be performed concurrently.
Conslolidation of lazy deletes can be performed synchnronously or concurrently with insertions and deletions.
When modifying the index sequentially, the user has the ability to take *snapshots*--
that is, save the index to memory for every *m* insertions or deletions instead of only at the end of the build.
The program `apps/test_streaming_scenario` simulates a scenario where the index actively maintains a sliding window of active points from a larger dataset.
The program starts with an index build over the first `active_window` set of points from a data file.
The program then simultaneously inserts newer points drawn from the file and deletes older points from the index
in chunks of `consolidate_interval` points so that the number of active points in the index is approximately `active_window`.
It terminates when the end of data file is reached, and the final index has `active_window + consolidate_interval` number of points.
The index also supports filters on steaming index, you can use `insert_point` function overloads to either insert points as before or insert points with labels.
Additional options are added to support this in `apps/test_streaming_scenario` and `apps/test_streaming_scenario` please refer to program arguments for more details.
---
> Note
* The index does not support mixed points, that is, either all points do not have labels or all points have labels.
* You can search the built filter index (one built with filters) without filters as well.
> WARNING: Deleting points in case of filtered build may cause the quality of Index to degrade and affect recall.
---
`apps/test_insert_deletes_consolidate` to try inserting, lazy deletes and consolidate_delete
---------------------------------------------------------------------------------------------
The arguments are as follows:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips).
3. **--data_file**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as integer. The next 4 bytes represent the dimension of data as integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. sizeof(T) is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **--index_path_prefix**: The constructed index components will be saved to this path prefix.
5. **-R (--max_degree)** (default is 64): the degree of the graph index, typically between 32 and 150. Larger R will result in larger indices and longer indexing times, but might yield better search quality.
6. **-L (--Lbuild)** (default is 100): the size of search list we maintain during index building. Typical values are between 75 to 400. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Ensure that value of L is at least that of R value unless you need to build indices really quickly and can somewhat compromise on quality.
7. **--alpha** (default is 1.2): A float value between 1.0 and 1.5 which determines the diameter of the graph, which will be approximately *log n* to the base alpha. Typical values are between 1 to 1.5. 1 will yield the sparsest graph, 1.5 will yield denser graphs.
8. **T (--num_threads)** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
9. **--points_to_skip**: number of points to skip from the beginning of the data file.
10. **--max_points_to_insert**: the maximum size of the index.
11. **--beginning_index_size**: how many points to build the initial index with. The number of points inserted dynamically will be max_points_to_insert - beginning_index_size.
12. **--points_per_checkpoint**: when inserting and deleting sequentially, each update is handled in points_per_checkpoint batches. When updating concurrently, insertions are handled in points_per_checkpoint batches but deletions are always processed in a single batch.
13. **--checkpoints_per_snapshot**: when inserting and deleting sequentially, the graph is saved to memory every checkpoints_per_snapshot checkpoints. This is not currently supported for concurrent updates.
14. **--points_to_delete_from_beginning**: how many points to delete from the index, starting in order of insertion. If deletions are concurrent with insertions, points_to_delete_from_beginning cannot be larger than beginning_index_size.
15. **--start_point_norm**: Set the starting node to a random point on a sphere of this radius. A reasonable choice is to set this to the average norm of the data set. Use when starting an index with zero points.
16. **--do_concurrent** (default false): whether to perform conslidate_deletes and other updates concurrently or sequentially. If concurrent is specified, half the threads are used for insertions and half the threads are used for processing deletes. Note that insertions are performed before deletions if this flag is set to false, so in this case is possible to delete more than beginning_index_size points.
`apps/test_streaming_scenario` to try inserting, lazy deletes and consolidate_delete
---------------------------------------------------------------------------------------------
The arguments are as follows:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips).
3. **--data_file**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as integer. The next 4 bytes represent the dimension of data as integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. sizeof(T) is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **--index_path_prefix**: The constructed index components will be saved to this path prefix.
5. **-R (--max_degree)** (default is 64): the degree of the graph index, typically between 32 and 150. Larger R will result in larger indices and longer indexing times, but might yield better search quality.
6. **-L (--Lbuild)** (default is 100): the size of search list we maintain during index building. Typical values are between 75 to 400. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Ensure that value of L is at least that of R value unless you need to build indices really quickly and can somewhat compromise on quality.
7. **--alpha** (default is 1.2): A float value between 1.0 and 1.5 which determines the diameter of the graph, which will be approximately *log n* to the base alpha. Typical values are between 1 to 1.5. 1 will yield the sparsest graph, 1.5 will yield denser graphs.
8. **--insert_threads**: number of threads used for inserting points in to the index.
9. **--consolidate_threads**: number of threads used for consolidating deletes to the index.
10. **--max_points_to_insert**: Maximum number of points from the data file to insert in to the index.
11. **--active_window**: Approximate number of points in the index at any point.
12. **--consolidate_interval**: Granularity at which insert and delete functions are called.
13. **--start_point_norm**: Set the starting node to a random point on a sphere of this radius. A reasonable choice is to set this to the average norm of the data stream.
** To build with filters add these optional parameters.
14. **--label_file**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of labels corresponding to point `i` in the file passed via `--data_file`.
15. **--FilteredLbuild**: If building a filtered index, we maintain a separate search list from the one provided by `--Lbuild/-L`.
16. **--num_start_points**: number of frozen points in this case should be more then number of unique labels.
17. **--universal_label**: Optionally, the label data may contain a special "universal" label. A point with the universal label can be matched against a query with any label. Note that if a point has the universal label, then the filter data must only have the universal label on the line corresponding.
18. **--label_type**: Optionally, type of label to be use its either uint or short, defaulted to `uint`.
To search the generated index, use the `apps/search_memory_index` program:
---------------------------------------------------------------------------
The arguments are as follows:
1. **data_type**: The type of dataset you built the index on. float(32 bit), signed int8 and unsigned uint8 are supported. Use the same data type as in arg (1) above used in building the index.
2. **dist_fn**: There are two distance functions supported: l2 and mips. There is an additional *fast_l2* implementation that could provide faster results for small (about a million-sized) indices. Use the same distance as in arg (2) above used in building the index.
3. **memory_index_path**: index built above in argument (4).
4. **T**: The number of threads used for searching. Threads run in parallel and one thread handles one query at a time. More threads will result in higher aggregate query throughput, but may lead to higher per-query latency, especially if the DRAM bandwidth is a bottleneck. So find the balance depending on throughput and latency required for your application.
5. **query_bin**: The queries to be searched on in same binary file format as the data file (ii) above. The query file must be the same type as in argument (1).
6. **truthset.bin**: The ground truth file for the queries in arg (7) and data file used in index construction. The binary file must start with *n*, the number of queries (4 bytes), followed by *d*, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the d closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes. The groundtruth file, if not available, can be calculated using the program `apps/utils/compute_groundtruth`. Use "null" if you do not have this file and if you do not want to compute recall.
7. **K**: search for *K* neighbors and measure *K*-recall@*K*, meaning the intersection between the retrieved top-*K* nearest neighbors and ground truth *K* nearest neighbors.
8. **result_output_prefix**: search results will be stored in files, one per L value (see next arg), with specified prefix, in binary format.
9. **-L (--search_list)**: A list of search_list sizes to perform search with. Larger parameters will result in slower latencies, but higher accuracies. Must be at least the value of *K* in (7).
10. **--dynamic** (default false): whether the index being searched is dynamic or not.
11. **--tags** (default false): whether to search with tags. This should be used if point *i* in the ground truth file does not correspond the point in the *i*th position in the loaded index.
** to search with filters add these
12. **--filter_label**: Filter for each query. For each query, a search is performed with this filter.
Example with BIGANN:
--------------------
This example demonstrates the use of the commands above on a 100K slice of the [BIGANN dataset](http://corpus-texmex.irisa.fr/) with 128 dimensional SIFT descriptors applied to images.
Download the base and query set and convert the data to binary format
```bash
mkdir -p DiskANN/build/data && cd DiskANN/build/data
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xf sift.tar.gz
cd ..
./apps/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./apps/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin
```
The example below tests the following scenario: using a file with 100000 points, the index is incrementally constructed point by point. After the first 50000 ponts are inserted, another concurrent job deletes the first 25000 points from the index and consolidates the index (edit the graph and cleans up resources). At the same time an additional 25000 points (i.e. points 50001 to 75000) are concurrently inserted into the index. Note that the index should be built **before** calculating the ground truth, since the memory index returns the slice of the sift100K dataset that was used to build the final graph (that is, points 25001-75000 in the original index).
```bash
type='float'
data='data/sift/sift_learn.fbin'
query='data/sift/sift_query.fbin'
index_prefix='data/sift/index'
result='data/sift/res'
deletes=25000
inserts=75000
deletes_after=50000
pts_per_checkpoint=10000
begin=0
thr=64
index=${index_prefix}.after-concurrent-delete-del${deletes}-${inserts}
gt_file=data/sift/gt100_learn-conc-${deletes}-${inserts}
~/DiskANN/build/apps/test_insert_deletes_consolidate --data_type ${type} --dist_fn l2 --data_path ${data} --index_path_prefix ${index_prefix} -R 64 -L 300 --alpha 1.2 -T ${thr} --points_to_skip 0 --max_points_to_insert ${inserts} --beginning_index_size ${begin} --points_per_checkpoint ${pts_per_checkpoint} --checkpoints_per_snapshot 0 --points_to_delete_from_beginning ${deletes} --start_deletes_after ${deletes_after} --do_concurrent true;
~/DiskANN/build/apps/utils/compute_groundtruth --data_type ${type} --dist_fn l2 --base_file ${index}.data --query_file ${query} --K 100 --gt_file ${gt_file} --tags_file ${index}.tags
~/DiskANN/build/apps/search_memory_index --data_type ${type} --dist_fn l2 --index_path_prefix ${index} --result_path ${result} --query_file ${query} --gt_file ${gt_file} -K 10 -L 20 40 60 80 100 -T ${thr} --dynamic true --tags 1
```
The example below tests the following scenario: using a file with 100000 points, insert 10000 points at a time. After the first 40000
are inserted, start deleting the first 10000 points while inserting points 40000--50000. Then delete points 10000--20000 while inserting
points 50000--60000 and so until the index is left with points 60000-100000.
Generate labels for filtered build like this. Generating 50 unique labels zipf's distributed for 100K point dataset.
```
~/DiskANN/build/apps/utils/generate_synthetic_labels --num_labels 50 --num_points 100000 --output_file data/zipf_labels_50_100K.txt --distribution_type zipf
```
```bash
type='float'
data='data/sift/sift_learn.fbin'
query='data/sift/sift_query.fbin'
index_prefix='data/sift/idx_learn_str'
result='data/sift/res'
ins_thr=16
cons_thr=16
inserts=100000
active=20000
cons_int=10000
index=${index_prefix}.after-streaming-act${active}-cons${cons_int}-max${inserts}
gt=data/sift/gt100_learn-act${active}-cons${cons_int}-max${inserts}
filter_label=1
## filter options
universal_label = '0'
label_file = 'data/zipf_labels_50_100K.txt'
num_start_points = 50
gt_filtered= data/sift/gt100_learn-act${active}-cons${cons_int}-max${inserts}_wlabel_${filter_label}
# Without Filters (build and search)
./apps/test_streaming_scenario --data_type ${type} --dist_fn l2 --data_path ${data} --index_path_prefix ${index_prefix} -R 64 -L 600 --alpha 1.2 --insert_threads ${ins_thr} --consolidate_threads ${cons_thr} --max_points_to_insert ${inserts} --active_window ${active} --consolidate_interval ${cons_int} --start_point_norm 508;
./apps/utils/compute_groundtruth --data_type ${type} --dist_fn l2 --base_file ${index}.data --query_file ${query} --K 100 --gt_file ${gt} --tags_file ${index}.tags
./apps/search_memory_index --data_type ${type} --dist_fn l2 --index_path_prefix ${index} --result_path ${result} --query_file ${query} --gt_file ${gt} -K 10 -L 20 40 60 80 100 -T 64 --dynamic true --tags 1
# With filters (build and search)
./apps/test_streaming_scenario --data_type ${type} --num_start_points ${num_start_points} --label_file ${label_file} --universal_label {universal_label} --dist_fn l2 --data_path ${data} --index_path_prefix ${index_prefix} -R 64 -L 600 --alpha 1.2 --insert_threads ${ins_thr} --consolidate_threads ${cons_thr} --max_points_to_insert ${inserts} --active_window ${active} --consolidate_interval ${cons_int} --start_point_norm 508;
./apps/utils/compute_groundtruth_for_filters --data_type ${type} --dist_fn l2 --base_file ${index}.data --query_file ${query} --K 100 --gt_file ${gt_filtered} --label_file ${label_file} --universal_label {universal_label} --filter_label {filter_label}
./apps/search_memory_index --data_type ${type} --filter_label {filter_label} --dist_fn l2 --index_path_prefix ${index} --result_path ${result} --query_file ${query} --gt_file ${gt_filtered} -K 10 -L 20 40 60 80 100 -T 64 --dynamic true --tags 1
```

View File

@@ -0,0 +1,126 @@
**Usage for filtered indices**
================================
## Building a filtered Index
DiskANN provides two algorithms for building an index with filters support: filtered-vamana and stitched-vamana. Here, we describe the parameters for building both. `apps/build_memory_index.cpp` and `apps/build_stitched_index.cpp` are respectively used to build each kind of index.
### 1. filtered-vamana
1. **`--data_type`**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **`--dist_fn`**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips).
3. **`--data_file`**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as integer. The next 4 bytes represent the dimension of data as integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. sizeof(T) is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **`--index_path_prefix`**: The constructed index components will be saved to this path prefix.
5. **`-R (--max_degree)`** (default is 64): the degree of the graph index, typically between 32 and 150. Larger R will result in larger indices and longer indexing times, but might yield better search quality.
6. **`-L (--Lbuild)`** (default is 100): the size of search list we maintain during index building. Typical values are between 75 to 400. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Ensure that value of L is at least that of R value unless you need to build indices really quickly and can somewhat compromise on quality. Note that this is to be used only for building an unfiltered index. The corresponding search list parameter for a filtered index is managed by `--FilteredLbuild`.
7. **`--alpha`** (default is 1.2): A float value between 1.0 and 1.5 which determines the diameter of the graph, which will be approximately *log n* to the base alpha. Typical values are between 1 to 1.5. 1 will yield the sparsest graph, 1.5 will yield denser graphs.
8. **`-T (--num_threads)`** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
9. **`--build_PQ_bytes`** (default is 0): Set to a positive value less than the dimensionality of the data to enable faster index build with PQ based distance comparisons. Defaults to using full precision vectors for distance comparisons.
10. **`--use_opq`**: use the flag to use OPQ rather than PQ compression. OPQ is more space efficient for some high dimensional datasets, but also needs a bit more build time.
11. **`--label_file`**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of filters corresponding to point `i` in the file passed via `--data_file`.
12. **`--universal_label`**: Optionally, the the filter data may contain a "wild-card" filter corresponding to all filters. This is referred to as a universal label. Note that if a point has the universal label, then the filter data must only have the universal label on the line corresponding to said point.
13. **`--FilteredLbuild`**: If building a filtered index, we maintain a separate search list from the one provided by `--Lbuild`.
### 2. stitched-vamana
1. **`--data_type`**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **`--data_path`**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as integer. The next 4 bytes represent the dimension of data as integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. sizeof(T) is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
3. **`--index_path_prefix`**: The constructed index components will be saved to this path prefix.
4. **`-R (--max_degree)`** (default is 64): Recall that stitched-vamana first builds a sub-index for each filter. This parameter sets the max degree for each sub-index.
5. **`-L (--Lbuild)`** (default is 100): the size of search list we maintain during sub-index building. Typical values are between 75 to 400. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Ensure that value of L is at least that of R value unless you need to build indices really quickly and can somewhat compromise on quality.
6. **`--alpha`** (default is 1.2): A float value between 1.0 and 1.5 which determines the diameter of the graph, which will be approximately *log n* to the base alpha. Typical values are between 1 to 1.5. 1 will yield the sparsest graph, 1.5 will yield denser graphs.
7. **`-T (--num_threads)`** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
8. **`--label_file`**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of filters corresponding to point `i` in the file passed via `--data_file`.
9. **`--universal_label`**: Optionally, the the filter data may contain a "wild-card" filter corresponding to all filters. This is referred to as a universal label. Note that if a point has the universal label, then the filter data must only have the universal label on the line corresponding to said point.
10. **`--Stitched_R`**: Once all sub-indices are "stitched" together, we prune the resulting graph down to the degree given by this parameter.
## Computing a groundtruth file for a filtered index
In order to evaluate the performance of our algorithms, we can compare its results (i.e. the top `k` neighbors found for each query) against the results found by an exact nearest neighbor search. We provide the program `apps/utils/compute_groundtruth.cpp` to provide the results for the latter:
1. **`--data_type`** The type of dataset you built an index with. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **`--dist_fn`**: There are two distance functions supported: l2 and mips.
3. **`--base_file`**: The input data over which to build an index, in .bin format. Corresponds to the `--data_path` argument from above.
4. **`--query_file`**: The queries to be searched on, which are stored in the same .bin format.
5. **`--label_file`**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of filters corresponding to point `i` in the file passed via `--data_file`.
6. **`--filter_label`**: Filter for each query. For each query, a search is performed with this filter.
7. **`--universal_label`**: Corresponds to the universal label passed when building an index with filter support.
8. **`--gt_file`**: File to output results to. The binary file starts with `n`, the number of queries (4 bytes), followed by `d`, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the `d` closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes.
9. **`-K`**: The number of nearest neighbors to compute for each query.
## Searching a Filtered Index
Searching a filtered index uses the `apps/search_memory_index.cpp`:
1. **`--data_type`**: The type of dataset you built the index on. float(32 bit), signed int8 and unsigned uint8 are supported. Use the same data type as in arg (1) above used in building the index.
2. **`--dist_fn`**: There are two distance functions supported: l2 and mips. There is an additional *fast_l2* implementation that could provide faster results for small (about a million-sized) indices. Use the same distance as in arg (2) above used in building the index. Note that stitched-vamana only supports l2.
3. **`--index_path_prefix`**: index built above in argument (4).
4. **`--result_path`**: search results will be stored in files, one per L value (see last arg), with specified prefix, in binary format.
5. **`-T (--num_threads)`**: The number of threads used for searching. Threads run in parallel and one thread handles one query at a time. More threads will result in higher aggregate query throughput, but may lead to higher per-query latency, especially if the DRAM bandwidth is a bottleneck. So find the balance depending on throughput and latency required for your application.
6. **`--query_file`**: The queries to be searched on in same binary file format as the data file (ii) above. The query file must be the same type as in argument (1).
7. **`--filter_label`**: The filter to be used when searching an index with filters. For each query, a search is performed with this filter.
8. **`--gt_file`**: The ground truth file for the queries and data file used in index construction. Use "null" if you do not have this file and if you do not want to compute recall. Note that if building a filtered index, a special groundtruth must be computed, as described above.
9. **`-K`**: search for *K* neighbors and measure *K*-recall@*K*, meaning the intersection between the retrieved top-*K* nearest neighbors and ground truth *K* nearest neighbors.
10. **`-L (--search_list)`**: A list of search_list sizes to perform search with. Larger parameters will result in slower latencies, but higher accuracies. Must be atleast the value of *K* in (7).
Example with SIFT10K:
--------------------
We demonstrate how to work through this pipeline using the SIFT10K dataset (http://corpus-texmex.irisa.fr/). Before starting, make sure you have compiled diskANN according to the instructions in the README and can see the following binaries (paths with respect to repository root):
- `build/apps/utils/compute_groundtruth`
- `build/apps/utils/fvecs_to_bin`
- `build/apps/build_memory_index`
- `build/apps/build_stitched_index`
- `build/apps/search_memory_index`
Now, download the base and query set and convert the data to binary format:
```bash
wget ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz
tar -zxvf siftsmall.tar.gz
build/apps/utils/fvecs_to_bin float siftsmall/siftsmall_base.fvecs siftsmall/siftsmall_base.bin
build/apps/utils/fvecs_to_bin float siftsmall/siftsmall_query.fvecs siftsmall/siftsmall_query.bin
```
We now need to make label file for our vectors. For convenience, we've included a synthetic label generator through which we can generate label file as follow
```bash
build/apps/utils/generate_synthetic_labels --num_labels 50 --num_points 10000 --output_file ./rand_labels_50_10K.txt --distribution_type zipf
```
Note : `distribution_type` can be `rand` or `zipf`
This will genearate label file with 10000 data points with 50 distinct labels, ranging from 1 to 50 assigned using zipf distribution (0 is the universal label).
Label count for each unique label in the generated label file can be printed with help of following command
```bash
build/apps/utils/stats_label_data.exe --labels_file ./rand_labels_50_10K.txt --universal_label 0
```
Note that neither approach is designed for use with random synthetic labels, which will lead to unpredictable accuracy at search time.
Now build and search the index and measure the recall using ground truth computed using bruteforce. We search for results with the filter 35.
```bash
build/apps/utils/compute_groundtruth --data_type float --dist_fn l2 --base_file siftsmall/siftsmall_base.bin --query_file siftsmall/siftsmall_query.bin --gt_file siftsmall/siftsmall_gt_35.bin --K 100 --label_file ./rand_labels_50_10K.txt --filter_label 35 --universal_label 0
build/apps/build_memory_index --data_type float --dist_fn l2 --data_path siftsmall/siftsmall_base.bin --index_path_prefix siftsmall/siftsmall_R32_L50_filtered_index -R 32 --FilteredLbuild 50 --alpha 1.2 --label_file ./rand_labels_50_10K.txt --universal_label 0
build/apps/build_stitched_index --data_type float --data_path siftsmall/siftsmall_base.bin --index_path_prefix siftsmall/siftsmall_R20_L40_SR32_stitched_index -R 20 -L 40 --stitched_R 32 --alpha 1.2 --label_file ./rand_labels_50_10K.txt --universal_label 0
build/apps/search_memory_index --data_type float --dist_fn l2 --index_path_prefix data/sift/siftsmall_R20_L40_SR32_filtered_index --query_file siftsmall/siftsmall_query.bin --gt_file siftsmall/siftsmall_gt_35.bin --filter_label 35 -K 10 -L 10 20 30 40 50 100 --result_path siftsmall/filtered_search_results
build/apps/search_memory_index --data_type float --dist_fn l2 --index_path_prefix data/sift/siftsmall_R20_L40_SR32_stitched_index --query_file siftsmall/siftsmall_query.bin --gt_file siftsmall/siftsmall_gt_35.bin --filter_label 35 -K 10 -L 10 20 30 40 50 100 --result_path siftsmall/stitched_search_results
```
The output of both searches is listed below. The throughput (Queries/sec) as well as mean and 99.9 latency in microseconds for each `L` parameter provided. (Measured on a physical machine with a Intel(R) Xeon(R) W-2145 CPU and 64 GB RAM)
```
Stitched Index
Ls QPS Avg dist cmps Mean Latency (mus) 99.9 Latency Recall@10
=================================================================================
10 31324.39 37.33 116.79 311.90 17.80
20 91357.57 44.36 193.06 1042.30 17.90
30 69314.48 49.89 258.09 1398.00 18.20
40 61421.29 60.52 289.08 1515.00 18.60
50 54203.48 70.27 294.26 685.10 19.40
100 52904.45 79.00 336.26 1018.80 19.50
Filtered Index
Ls QPS Avg dist cmps Mean Latency (mus) 99.9 Latency Recall@10
=================================================================================
10 69671.84 21.48 45.25 146.20 11.60
20 168577.20 38.94 100.54 547.90 18.20
30 127129.41 52.95 126.83 768.40 19.70
40 106349.04 62.38 167.23 899.10 20.90
50 89952.33 70.95 189.12 1070.80 22.10
100 56899.00 112.26 304.67 636.60 23.80
```

View File

@@ -0,0 +1,103 @@
**Usage for filtered indices**
================================
To generate an SSD-friendly index, use the `apps/build_disk_index` program.
----------------------------------------------------------------------------
## Building a SSD based filtered Index
### filtered-vamana SSD Index
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips).
3. **--data_file**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as an integer. The next 4 bytes represent the dimension of data as an integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. `sizeof(T)` is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **--index_path_prefix**: the index will span a few files, all beginning with the specified prefix path. For example, if you provide `~/index_test` as the prefix path, build generates files such as `~/index_test_pq_pivots.bin, ~/index_test_pq_compressed.bin, ~/index_test_disk.index, ...`. There may be between 8 and 10 files generated with this prefix depending on how the index is constructed.
5. **-R (--max_degree)** (default is 64): the degree of the graph index, typically between 60 and 150. Larger R will result in larger indices and longer indexing times, but better search quality.
6. **-L (--Lbuild)** (default is 100): the size of search listduring index build. Typical values are between 75 to 200. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value for L value that is at least the value of R unless you need to build indices really quickly and can somewhat compromise on quality. Note that this is to be used only for building an unfiltered index. The corresponding search list parameter for a filtered index is managed by `--FilteredLbuild`.
7. **-B (--search_DRAM_budget)**: bound on the memory footprint of the index at search time in GB. Once built, the index will use up only the specified RAM limit, the rest will reside on disk. This will dictate how aggressively we compress the data vectors to store in memory. Larger will yield better performance at search time. For an n point index, to use b byte PQ compressed representation in memory, use `B = ((n * b) / 2^30 + (250000*(4*R + sizeof(T)*ndim)) / 2^30)`. The second term in the summation is to allow some buffer for caching about 250,000 nodes from the graph in memory while serving. If you are not sure about this term, add 0.25GB to the first term.
8. **-M (--build_DRAM_budget)**: Limit on the memory allowed for building the index in GB. If you specify a value less than what is required to build the index in one pass, the index is built using a divide and conquer approach so that sub-graphs will fit in the RAM budget. The sub-graphs are overlayed to build the overall index. This approach can be upto 1.5 times slower than building the index in one shot. Allocate as much memory as your RAM allows.
9. **-T (--num_threads)** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
10. **--PQ_disk_bytes** (default is 0): Use 0 to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory
11. **--build_PQ_bytes** (default is 0): Set to a positive value less than the dimensionality of the data to enable faster index build with PQ based distance comparisons.
12. **--use_opq**: use the flag to use OPQ rather than PQ compression. OPQ is more space efficient for some high dimensional datasets, but also needs a bit more build time.
13. **--label_file**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of filters corresponding to point `i` in the file passed via `--data_file`.
14. **--universal_label**: Optionally, the label data may contain a special "universal" label. A point with the universal label can be matched against a query with any label. Note that if a point has the universal label, then the filter data must only have the universal label on the line corresponding.
15. **--FilteredLbuild**: If building a filtered index, we maintain a separate search list from the one provided by `--Lbuild`.
16. **--filter_threshold**: Threshold to break up the existing nodes to generate new graph internally by breaking dense points where each node will have a maximum F labels. Default value is zero where no break up happens for the dense points.
## Computing a groundtruth file for a filtered index
In order to evaluate the performance of our algorithms, we can compare its results (i.e. the top `k` neighbors found for each query) against the results found by an exact nearest neighbor search. We provide the program `apps/utils/compute_groundtruth.cpp` to provide the results for the latter:
1. **`--data_type`** The type of dataset you built an index with. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **`--dist_fn`**: There are two distance functions supported: l2 and mips.
3. **`--base_file`**: The input data over which to build an index, in .bin format. Corresponds to the `--data_path` argument from above.
4. **`--query_file`**: The queries to be searched on, which are stored in the same .bin format.
5. **`--label_file`**: Filter data for each point, in `.txt` format. Line `i` of the file consists of a comma-separated list of filters corresponding to point `i` in the file passed via `--data_file`.
6. **`--filter_label`**: Filter for each query. For each query, a search is performed with this filter.
7. **`--universal_label`**: Corresponds to the universal label passed when building an index with filter support.
8. **`--gt_file`**: File to output results to. The binary file starts with `n`, the number of queries (4 bytes), followed by `d`, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the `d` closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes.
9. **`-K`**: The number of nearest neighbors to compute for each query.
## Searching a Filtered Index
Searching a filtered index uses the `apps/search_disk_index.cpp`:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported. Use the same data type as in arg (1) above used in building the index.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips). Use the same distance as in arg (2) above used in building the index.
3. **--index_path_prefix**: same as the prefix used in building the index (see arg 4 above).
4. **--num_nodes_to_cache** (default is 0): While serving the index, the entire graph is stored on SSD. For faster search performance, you can cache a few frequently accessed nodes in memory.
5. **-T (--num_threads)** (default is to get_omp_num_procs()): The number of threads used for searching. Threads run in parallel and one thread handles one query at a time. More threads will result in higher aggregate query throughput, but will also use more IOs/second across the system, which may lead to higher per-query latency. So find the balance depending on the maximum number of IOPs supported by the SSD.
6. **-W (--beamwidth)** (default is 2): The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use `W=1`. For best latency, use `W=4,8` or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.
7. **--query_file**: The queries to be searched on in same binary file format as the data file in arg (2) above. The query file must be the same type as argument (1).
8. **--gt_file**: The ground truth file for the queries in arg (7) and data file used in index construction. The binary file must start with *n*, the number of queries (4 bytes), followed by *d*, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the d closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes. The groundtruth file, if not available, can be calculated using the program `apps/utils/compute_groundtruth`. Use "null" if you do not have this file and if you do not want to compute recall.
9. **-K**: search for *K* neighbors and measure *K*-recall@*K*, meaning the intersection between the retrieved top-*K* nearest neighbors and ground truth *K* nearest neighbors.
10. **--result_path**: Search results will be stored in files with specified prefix, in bin format.
11. **-L (--search_list)**: A list of search_list sizes to perform search with. Larger parameters will result in slower latencies, but higher accuracies. Must be atleast the value of *K* in arg (9).
12. **--filter_label**: The filter to be used when searching an index with filters. For each query, a search is performed with this filter.
Example with SIFT10K:
--------------------
We demonstrate how to work through this pipeline using the SIFT10K dataset (http://corpus-texmex.irisa.fr/). Before starting, make sure you have compiled diskANN according to the instructions in the README and can see the following binaries (paths with respect to repository root):
- `build/apps/utils/compute_groundtruth`
- `build/apps/utils/fvecs_to_bin`
- `build/apps/build_disk_index`
- `build/apps/search_disk_index`
Now, download the base and query set and convert the data to binary format:
```bash
wget ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz
tar -zxvf siftsmall.tar.gz
build/apps/utils/fvecs_to_bin float siftsmall/siftsmall_base.fvecs siftsmall/siftsmall_base.bin
build/apps/utils/fvecs_to_bin float siftsmall/siftsmall_query.fvecs siftsmall/siftsmall_query.bin
```
We now need to make label file for our vectors. For convenience, we've included a synthetic label generator through which we can generate label file as follow
```bash
build/apps/utils/generate_synthetic_labels --num_labels 50 --num_points 10000 --output_file ./rand_labels_50_10K.txt --distribution_type zipf
```
Note : `distribution_type` can be `rand` or `zipf`
This will genearate label file with 10000 data points with 50 distinct labels, ranging from 1 to 50 assigned using zipf distribution (0 is the universal label).
Now build and search the index and measure the recall using ground truth computed using bruteforce. We search for results with the filter 35.
```bash
build/apps/utils/compute_groundtruth --data_type float --dist_fn l2 --base_file siftsmall/siftsmall_base.bin --query_file siftsmall/siftsmall_query.bin --gt_file siftsmall_gt_35.bin --K 100 --label_file rand_labels_50_10K.txt --filter_label 35 --universal_label 0
build/apps/build_disk_index --data_type float --dist_fn l2 --data_path siftsmall/siftsmall_base.bin --index_path_prefix data/sift/siftsmall_R32_L50_filtered -R 32 --FilteredLbuild 50 -B 1 -M 1 --label_file rand_labels_50_10K.txt --universal_label 0 -F 0
build/apps/search_disk_index --data_type float --dist_fn l2 --index_path_prefix data/sift/siftsmall_R32_L50_filtered --result_path siftsmall/search_35 --query_file siftsmall/siftsmall_query.bin --gt_file siftsmall_gt_35.bin -K 10 -L 10 20 30 40 50 100 --filter_label 35 -W 4 -T 8
```
The output of both searches is listed below. The throughput (Queries/sec) as well as mean and 99.9 latency in microseconds for each `L` parameter provided. (Measured on a physical machine with a 11th Gen Intel(R) Core(TM) i7-1185G7 CPU and 32 GB RAM)
```
Filtered Disk Index
L Beamwidth QPS Mean Latency 99.9 Latency Mean IOs CPU (s) Recall@10
==================================================================================================================
10 4 1922.02 4062.19 12849.00 15.49 66.19 11.80
20 4 4609.91 1618.68 3438.00 30.66 140.48 17.20
30 4 3377.83 2250.22 4631.00 42.70 202.39 20.70
40 4 2707.77 2817.21 4889.00 51.46 267.03 22.00
50 4 2191.56 3509.43 5943.00 60.80 349.10 23.50
100 4 1257.92 6113.45 7321.00 109.08 609.42 23.90
```

View File

@@ -0,0 +1,73 @@
**Usage for in-memory indices**
================================
To generate index, use the `apps/build_memory_index` program.
--------------------------------------------------------------
The arguments are as follows:
1. **--data_type**: The type of dataset you wish to build an index on. float(32 bit), signed int8 and unsigned uint8 are supported.
2. **--dist_fn**: There are two distance functions supported: minimum Euclidean distance (l2) and maximum inner product (mips).
3. **--data_file**: The input data over which to build an index, in .bin format. The first 4 bytes represent number of points as integer. The next 4 bytes represent the dimension of data as integer. The following `n*d*sizeof(T)` bytes contain the contents of the data one data point in time. sizeof(T) is 1 for byte indices, and 4 for float indices. This will be read by the program as int8_t for signed indices, uint8_t for unsigned indices or float for float indices.
4. **--index_path_prefix**: The constructed index components will be saved to this path prefix.
5. **-R (--max_degree)** (default is 64): the degree of the graph index, typically between 32 and 150. Larger R will result in larger indices and longer indexing times, but might yield better search quality.
6. **-L (--Lbuild)** (default is 100): the size of search list we maintain during index building. Typical values are between 75 to 400. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Ensure that value of L is at least that of R value unless you need to build indices really quickly and can somewhat compromise on quality.
7. **--alpha** (default is 1.2): A float value between 1.0 and 1.5 which determines the diameter of the graph, which will be approximately *log n* to the base alpha. Typical values are between 1 to 1.5. 1 will yield the sparsest graph, 1.5 will yield denser graphs.
8. **T (--num_threads)** (default is to get_omp_num_procs()): number of threads used by the index build process. Since the code is highly parallel, the indexing time improves almost linearly with the number of threads (subject to the cores available on the machine and DRAM bandwidth).
9. **--build_PQ_bytes** (default is 0): Set to a positive value less than the dimensionality of the data to enable faster index build with PQ based distance comparisons. Defaults to using full precision vectors for distance comparisons.
10.**--use_opq**: use the flag to use OPQ rather than PQ compression. OPQ is more space efficient for some high dimensional datasets, but also needs a bit more build time.
To search the generated index, use the `apps/search_memory_index` program:
---------------------------------------------------------------------------
The arguments are as follows:
1. **data_type**: The type of dataset you built the index on. float(32 bit), signed int8 and unsigned uint8 are supported. Use the same data type as in arg (1) above used in building the index.
2. **dist_fn**: There are two distance functions supported: l2 and mips. There is an additional *fast_l2* implementation that could provide faster results for small (about a million-sized) indices. Use the same distance as in arg (2) above used in building the index.
3. **memory_index_path**: index built above in argument (4).
4. **T**: The number of threads used for searching. Threads run in parallel and one thread handles one query at a time. More threads will result in higher aggregate query throughput, but may lead to higher per-query latency, especially if the DRAM bandwidth is a bottleneck. So find the balance depending on throughput and latency required for your application.
5. **query_bin**: The queries to be searched on in same binary file format as the data file (ii) above. The query file must be the same type as in argument (1).
6. **truthset.bin**: The ground truth file for the queries in arg (7) and data file used in index construction. The binary file must start with *n*, the number of queries (4 bytes), followed by *d*, the number of ground truth elements per query (4 bytes), followed by `n*d` entries per query representing the d closest IDs per query in integer format, followed by `n*d` entries representing the corresponding distances (float). Total file size is `8 + 4*n*d + 4*n*d` bytes. The groundtruth file, if not available, can be calculated using the program `apps/utils/compute_groundtruth`. Use "null" if you do not have this file and if you do not want to compute recall.
7. **K**: search for *K* neighbors and measure *K*-recall@*K*, meaning the intersection between the retrieved top-*K* nearest neighbors and ground truth *K* nearest neighbors.
8. **result_output_prefix**: search results will be stored in files, one per L value (see next arg), with specified prefix, in binary format.
9. **-L (--search_list)**: A list of search_list sizes to perform search with. Larger parameters will result in slower latencies, but higher accuracies. Must be atleast the value of *K* in (7).
Example with BIGANN:
--------------------
This example demonstrates the use of the commands above on a 100K slice of the [BIGANN dataset](http://corpus-texmex.irisa.fr/) with 128 dimensional SIFT descriptors applied to images.
Download the base and query set and convert the data to binary format
```bash
mkdir -p DiskANN/build/data && cd DiskANN/build/data
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xf sift.tar.gz
cd ..
./apps/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./apps/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin
```
Now build and search the index and measure the recall using ground truth computed using brutefoce.
```bash
./apps/utils/compute_groundtruth --data_type float --dist_fn l2 --base_file data/sift/sift_learn.fbin --query_file data/sift/sift_query.fbin --gt_file data/sift/sift_query_learn_gt100 --K 100
./apps/build_memory_index --data_type float --dist_fn l2 --data_path data/sift/sift_learn.fbin --index_path_prefix data/sift/index_sift_learn_R32_L50_A1.2 -R 32 -L 50 --alpha 1.2
./apps/search_memory_index --data_type float --dist_fn l2 --index_path_prefix data/sift/index_sift_learn_R32_L50_A1.2 --query_file data/sift/sift_query.fbin --gt_file data/sift/sift_query_learn_gt100 -K 10 -L 10 20 30 40 50 100 --result_path data/sift/res
```
The output of search lists the throughput (Queries/sec) as well as mean and 99.9 latency in microseconds for each `L` parameter provided. (We measured on a 32-core 64-vCPU D-series Azure VM)
```
Ls QPS Avg dist cmps Mean Latency (mus) 99.9 Latency Recall@10
=================================================================================
10 319901.78 348.93 174.51 4943.35 97.80
20 346572.72 525.85 183.36 376.60 98.93
30 292060.12 688.86 217.73 421.60 99.30
40 248945.22 841.74 255.41 476.80 99.45
50 215888.81 986.67 294.62 542.21 99.56
100 129711.39 1631.94 490.58 848.61 99.88
```

View File

@@ -0,0 +1,133 @@
# `diskannpy`
We publish (sporadic) builds of `diskann` with python bindings to `pypi.org`, which you can install via `pip install diskannpy`.
#### Caveats
Native python modules with cffi need to be built for *every* version of Python and *every* OS and *every* native-integration-library.
This makes for a complicated build matrix that only `(ana)conda` is properly fit to solve. However, we do build wheels
for python 3.9-3.11, across linux, Windows, and macOS (x86_64). These versions are also built against `numpy` 1.25 -
which makes for a hard runtime requirement that can be challenging to use if you are using older or newer versions of numpy.
There *are* instructions for building against other versions of numpy
[documented in this issue response](https://github.com/microsoft/DiskANN/issues/544#issuecomment-2103437976) if you require a different build.
# Basic Usage
`diskannpy` provides access to both building and reading `DiskANN` indices. In all cases, the _lingua franca_ is numpy
ndarrays. Currently, the only supported dtypes are `np.float32`, `np.int8`, and `np.uint8`.
`diskannpy` provides a number of helpful functions, like reading or writing `diskann` style vector binary files via the
`vectors_to_file` and `vectors_from_file` functions. For a full suite of python functions and their documentation,
please be sure to read the latest documentation @ [https://microsoft.github.io/](https://microsoft.github.io/DiskANN/docs/python/latest/diskannpy.html).
## Scenarios
The following scenarios are supported via the `diskannpy` api.
### Commonalities
```python
my_dtype = np.float32 # or np.uint8 or np.int8 ONLY
my_set_of_vectors: np.typing.NDArray[my_dtype] = ... # your vectors come from somewhere - you need to bring these!
index_to_identifiers_map: np.typing.NDArray[str] = ... # your vectors likely have some kind of external identifier -
# you need to keep track of the external identifier -> index relationship somehow
identifiers_to_index_map: dict[str, np.uint32|np.uint.64] = ... # your map of your external id to the `diskannpy` internal id
# diskannpy `query` responses will contain the _internal id only_, and if you don't have these maps you won't be able to
# know what this relates to
```
### Build Disk Index
A disk index is a memory mapped, [vamana](https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)
index that heavily leans into the hardware speeds of modern NVMe based solid state storage.
This means you can build performant ANN indices that overflow plausibly available system memory!
```python
import numpy as np
import diskannpy as dap
vecs = my_set_of_vectors / np.linalg.norm(my_set_of_vectors, axis=1) # useful if your intention is to rank by a directionless
# cosine angle distance
dap.build_disk_index(
data=vecs,
distance_metric="l2", # can also be cosine, especially if you don't normalize your vectors like above
index_directory="/tmp/my_index",
complexity=128, # the larger this is, the more candidate points we consider when ranking
graph_degree=64, # the beauty of a vamana index is it's ability to shard and be able to transfer long distances across the grpah without navigating the whole thing. the larger this value is, the higher quality your results, but the longer it will take to build
search_memory_maximum=16.0, # a floating point number to represent how much memory in GB we want to optimize for @ query time
build_memory_maximum=100.0, # a floating point number to represent how much memory in GB we are allocating for the index building process
num_threads=0, # 0 means use all available threads - but if you are in a shared environment you may need to restrict how greedy you are
vector_dtype=my_dtype, # we specified this in the Commonalities section above
index_prefix="ann", # ann is the default anyway. all files generated will have the prefix `ann_`, in the form of `f"{index_prefix}_"`
pq_disk_bytes=0 # using product quantization of your vectors can still achieve excellent recall characteristics at a fraction of the latency, but we'll do it without PQ for now
)
```
### Search Disk Index
Now we want to search our disk index - using a completely different set of vectors that aren't necessarily guaranteed to
be in our index. We will call this set of vectors `q`, and it is *critical* that they are the same dtype and
dimensionality as the disk index we have just built.
**Note**: If you manually normalized your indexed vectors prior to building the index, you will *also* need to normalize
them prior to query!
#### Common index query setup
```python
index = dap.StaticDiskIndex(
index_directory="/tmp/my_index",
num_threads=0,
num_nodes_to_cache=1_000_000,
index_prefix="ann"
)
```
#### Individual Vectors
```python
some_index: np.uint32 = ... # the index in our `q` array of points that we will be using to query on an individual basis
my_query_vector: np.typing.NDArray[my_dtype] = q[some_index] # make sure this is a 1-d array of the same dimensionality as your index!
# normalize if required by my_query_vector /= np.linalg.norm(my_query_vector)
internal_indices, distances = index.search(
query=my_query_vector,
k_neighbors=25,
complexity=50, # must be as big or bigger than `k_neighbors`
)
```
#### Mapping to our External Ids
The internal IDs that diskann returns via query aren't necessarily directly useful to you, and the onus is on you
to figure out what they actually link to via your `index_to_identifiers_map` map.
```python
actual_identifiers = index_to_identifiers_map[internal_indices] # using np fancy indexing (advanced indexing?) to map them all to ids you actually understand
```
#### Batch Vectors
```python
import multiprocessing
internal_indices, distances = index.batch_search(
queries=q,
k_neighbors=25,
complexity=50,
num_threads=multiprocessing.cpu_count(), # there's a current bug where this is not handling the value 0 properly
beam_width=8 # beamwidth is the parameter that indicates our parallelism of individual searches, whereas num_threads
# indicates the number of threads *per* query item in the batch
)
# note that in batch_query form, our internal_indices and distances are 2d arrays
```
#### Mapping to our External Ids
Unlike the previous entry, I have yet to get the fancy awesome advanced indexing to work in one shot, we will have
to do this the not-numpy-paragon way.
```python
actual_neighbors = np.full(shape=internal_indices.shape, dtype=str, fill_value="")
for row in range(internal_indices.shape[0]):
actual_neighbors[row] = index_to_identifiers_map[internal_indices[row]]
```
This is only scratching the surface of what `diskannpy` can offer. Please read the API documentation @ [https://microsoft.github.io/](https://microsoft.github.io/DiskANN/docs/python/latest/diskannpy.html)
for more details.

View File

@@ -0,0 +1,72 @@
<!-- Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license. -->
**REST service set up for serving DiskANN indices and query interface**
=======================================================================
Install dependencies on Ubuntu and compile
------------------------------------------
In addition to the common dependencies in the [README](/README.md), install [Microsoft C++ REST SDK](https://github.com/Microsoft/cpprestsdk).
```bash
sudo apt install libcpprest-dev
mkdir -p build && cd build
cmake -DRESTAPI=True -DCMAKE_BUILD_TYPE=Release ..
make -j
```
Starting an index hosting service
---------------------------------
Follow the instructions for [building an in-memory DiskANN index](/workflows/in_memory_index.md) or [building an SSD DiskANN index](/workflows/SSD_index.md). Then start a service bound at the appropriate IP:port. For querying from the local machine, you may want to use `http://127.0.0.1:port`. For serving queries originating from remote machines, you may want to use `http://0.0.0.0:port`.
```bash
# To start serving an in-memory index
./apps/restapi/inmem_server --address <http://ip_addr:port> --data_type <float/int8/uint8> --data_file <data_file> --index_path_prefix <index_file> --num_threads <number of threads> --l_search <Value for L> --tags_file [tags_file]
# To start serving an SSD-based index.
./apps/restapi/ssd_server --address <http://ip_addr:port> --data_type <float/int8/uint8> --index_path_prefix <index_file_prefix> --num_nodes_to_cache <num_nodes_to_cache> --num_threads <num_threads> --tags_file [tags_file]
```
The `data_type` and the `data_file` should be the same as those used in the construction of the index. The server returns the ids and distances of the closests vector in the index to the query. The ids are implicitly defined by the order of the vector in the data file. If you wish to assign a different numbering or GUID or URL to the vectors in the index, use the optional `tags_file`. This should be a file which lists a "tag" string for each vector in the index. The file should contain one string per line. The string on the line `n` is considered the tag corresponding to the vector `n` in the index (in the implicit order defined in the `data_file`).
For an SSD-based index, specify the number of nodes to cache in-memory to make queries faster. For large indices with over 100 million vectors, a typical value for `num_nodes_to_cache` could be 500000. Increase or decrease based on DRAM footprint desired.
For an SSD-based index, also specify the number of threads used for search by setting the `num_threads` parameter.
You can also query multiple SSD based indices using the following command by listing the prefix of each index in a file (one prefix per line) and passing it through the `index_prefix_paths` parameter to the following command.
```bash
multiple_ssdserver --address <ip_addr:port> --data_type <float/int8/uint8> --index_prefix_paths <index_prefix_paths> --num_nodes_to_cache <num_nodes_to_cache> --num_threads <num_threads> --tags_file [tags_file]
```
The service searches each of the indices and aggregate the results based on distances to find the closest neighbors across all indices.
Querying the service
--------------------
Issue a json query with the following fields
- "k" : The number of nearest neighbors needed
- "query" : The query vector with a listing of co-ordinates.
- "query_id" : An id to track the query. Use a unique number to keep track of queries, or "0" if you do not want to keep track.
- "Ls" : query complexity. Higher Ls takes more milliseconds to process but offers higher recall. Default to 256 if you don't want to tune this.
**Post a json query using python**
```python
import requests
jsonquery = {"Ls": 256,
"query_id": 1234,
"query": [0.00407, 0.01534, 0.02498, ...],
"k": 10}
response = requests.post('http://ip_addr:port', json=jsonquery)
print(response.text)
```
The response might look like the following. The partition array indicates the ID of index from which the result was found in the case of a multi-index set up. For a single index set up, the response would not contain the information on partitions. The response may or may not contain `tags` based on whether the server was started with a `tags_file`.
```json
{"distances":[1.6947,1.6954,1.6972,1.6985,1.6991,1.7003,1.7008,1.7014,1.7021,1.7039],"indices":[8976853,8221762,30909336,13100282,30514543,11537860,7133262,34074869,50512601,17983301],"k":10,"partition":[20,7,20,20,6,6,11,6,6,20],"query_id":1234,"tags":["https://xyz1", "https://xyz2", "https://xyz3", "https://xyz4", "https://xyz5", "https://xyz6", "https://xyz7", "https://xyz8", "https://xyz9", "https://xyz10"],"time_taken_in_us":3245}
```
**Command line interface to issue multiple queries from a file**
To issue `num_queries` queries from `query_file`, run the following command
```bash
client ip_addr:port data_type<float/int8/uint8> query_file num_queries Ls"
```