Initial commit

2025-06-30 09:05:05 +00:00
commit 46f6cc100b
1231 changed files with 278432 additions and 0 deletions
--- a/packages/leann-backend-hnsw/third_party/faiss/benchs/distributed_ondisk/README.md
+++ b/packages/leann-backend-hnsw/third_party/faiss/benchs/distributed_ondisk/README.md
@@ -0,0 +1,194 @@
+# Distributed on-disk index for 1T-scale datasets
+
+This is code corresponding to the description in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).
+All the code is in python 3 (and not compatible with Python 2).
+The current code uses the Deep1B dataset for demonstration purposes, but can scale to 1000x larger.
+To run it, download the Deep1B dataset as explained [here](../#getting-deep1b), and edit paths to the dataset in the scripts.
+
+The cluster commands are written for the Slurm batch scheduling system.
+Hopefully, changing to another type of scheduler should be quite straightforward.
+
+## Distributed k-means
+
+To cluster 500M vectors to 10M centroids, it is useful to have a distributed k-means implementation.
+The distribution simply consists in splitting the training vectors across machines (servers) and have them do the assignment.
+The master/client then synthesizes the results and updates the centroids.
+
+The distributed k-means implementation here is based on 3 files:
+
+- [`distributed_kmeans.py`](distributed_kmeans.py) contains the k-means implementation.
+The main loop of k-means is re-implemented in python but follows closely the Faiss C++ implementation, and should not be significantly less efficient.
+It relies on a `DatasetAssign` object that does the assignment to centroids, which is the bulk of the computation.
+The object can be a Faiss CPU index, a GPU index or a set of remote GPU or CPU indexes.
+
+- [`run_on_cluster.bash`](run_on_cluster.bash) contains the shell code to run the distributed k-means on a cluster.
+
+The distributed k-means works with a Python install that contains faiss and scipy (for sparse matrices).
+It clusters the training data of Deep1B, this can be changed easily to any file in fvecs, bvecs or npy format that contains the training set.
+The training vectors may be too large to fit in RAM, but they are memory-mapped so that should not be a problem.
+The file is also assumed to be accessible from all server machines with eg. a distributed file system.
+
+### Local tests
+
+Edit `distributed_kmeans.py` to point `testdata` to your local copy of the dataset.
+
+Then, 4 levels of sanity check can be run:
+```bash
+# reference Faiss C++ run
+python distributed_kmeans.py --test 0
+# using the Python implementation
+python distributed_kmeans.py --test 1
+# use the dispatch object (on local datasets)
+python distributed_kmeans.py --test 2
+# same, with GPUs
+python distributed_kmeans.py --test 3
+```
+The output should look like [This gist](https://gist.github.com/mdouze/ffa01fe666a9325761266fe55ead72ad).
+
+### Distributed sanity check
+
+To run the distributed k-means, `distributed_kmeans.py` has to be run both on the servers (`--server` option) and client sides (`--client` option).
+Edit the top of `run_on_cluster.bash` to set the path of the data to cluster.
+
+Sanity checks can be run with
+```bash
+# non distributed baseline
+bash run_on_cluster.bash test_kmeans_0
+# using all the machine's GPUs
+bash run_on_cluster.bash test_kmeans_1
+# distributed run, with one local server per GPU
+bash run_on_cluster.bash test_kmeans_2
+```
+The test `test_kmeans_2` simulates a distributed run on a single machine by starting one server process per GPU and connecting to the servers via the rpc protocol.
+The output should look like [this gist](https://gist.github.com/mdouze/5b2dc69b74579ecff04e1686a277d32e).
+
+
+
+### Distributed run
+
+The way the script can be distributed depends on the cluster's scheduling system.
+Here we use Slurm, but it should be relatively easy to adapt to any scheduler that can allocate a set of machines and start the same executable on all of them.
+
+The command
+```bash
+bash run_on_cluster.bash slurm_distributed_kmeans
+```
+asks SLURM for 5 machines with 4 GPUs each with the `srun` command.
+All 5 machines run the script with the `slurm_within_kmeans_server` option.
+They determine the number of servers and their own server id via the `SLURM_NPROCS` and `SLURM_PROCID` environment variables.
+
+All machines start `distributed_kmeans.py` in server mode for the slice of the dataset they are responsible for.
+
+In addition, the machine #0 also starts the client.
+The client knows who are the other servers via the variable `SLURM_JOB_NODELIST`.
+It connects to all clients and performs the clustering.
+
+The output should look like [this gist](https://gist.github.com/mdouze/8d25e89fb4af5093057cae0f917da6cd).
+
+### Run used for deep1B
+
+For the real run, we run the clustering on 50M vectors to 1M centroids.
+This is just a matter of using as many machines / GPUs as possible in setting the output centroids with the `--out filename` option.
+Then run
+```bash
+bash run_on_cluster.bash deep1b_clustering
+```
+
+The last lines of output read like:
+```bash
+  Iteration 19 (898.92 s, search 875.71 s): objective=1.33601e+07 imbalance=1.303 nsplit=0
+ 0: writing centroids to /checkpoint/matthijs/ondisk_distributed/1M_centroids.npy
+```
+
+This means that the total training time was 899s, of which 876s were used for computation.
+However, the computation includes the I/O overhead to the assignment servers.
+In this implementation, the overhead of transmitting the data is non-negligible and so is the centroid computation stage.
+This is due to the inefficient Python implementation and the RPC protocol that is not optimized for broadcast / gather (like MPI).
+However, it is a simple implementation that should run on most clusters.
+
+## Making the trained index
+
+After the centroids are obtained, an empty trained index must be constructed.
+This is done by:
+
+- applying a pre-processing stage (a random rotation) to balance the dimensions of the vectors. This can be done after clustering, the clusters are just rotated as well.
+
+- wrapping the centroids into a HNSW index to speed up the CPU-based assignment of vectors
+
+- training the 6-bit scalar quantizer used to encode the vectors
+
+This is performed by the script [`make_trained_index.py`](make_trained_index.py).
+
+## Building the index by slices
+
+We call the slices "vslices" as they are vertical slices of the big matrix, see explanation in the wiki section [Split across database partitions](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors#split-across-database-partitions).
+
+The script [make_index_vslice.py](make_index_vslice.py) makes an index for a subset of the vectors of the input data and stores it as an independent index.
+There are 200 slices of 5M vectors each for Deep1B.
+It can be run in a brute-force parallel fashion, there is no constraint on ordering.
+To run the script in parallel on a slurm cluster, use:
+```bash
+bash run_on_cluster.bash make_index_vslices
+```
+For a real dataset, the data would be read from a DBMS.
+In that case, reading the data and indexing it in parallel is worthwhile because reading is very slow.
+
+## Splitting across inverted lists
+
+The 200 slices need to be merged together.
+This is done with the script [merge_to_ondisk.py](merge_to_ondisk.py), that memory maps the 200 vertical slice indexes, extracts a subset of the inverted lists and writes them to a contiguous horizontal slice.
+We slice the inverted lists into 50 horizontal slices.
+This is run with
+```bash
+bash run_on_cluster.bash make_index_hslices
+```
+
+## Querying the index
+
+At this point the index is ready.
+The horizontal slices need to be loaded in the right order and combined into an index to be usable.
+This is done in the [combined_index.py](combined_index.py) script.
+It provides a `CombinedIndexDeep1B` object that contains an index object that can be searched.
+To test, run:
+```bash
+python combined_index.py
+```
+The output should look like:
+```bash
+(faiss_1.5.2) matthijs@devfair0144:~/faiss_versions/faiss_1Tcode/faiss/benchs/distributed_ondisk$ python combined_index.py
+reading /checkpoint/matthijs/ondisk_distributed//hslices/slice49.faissindex
+loading empty index /checkpoint/matthijs/ondisk_distributed/trained.faissindex
+replace invlists
+loaded index of size  1000000000
+nprobe=1 1-recall@1=0.2904 t=12.35s
+nnprobe=10 1-recall@1=0.6499 t=17.67s
+nprobe=100 1-recall@1=0.8673 t=29.23s
+nprobe=1000 1-recall@1=0.9132 t=129.58s
+```
+ie. searching is a lot slower than from RAM.
+
+## Distributed query
+
+To reduce the bandwidth required from the machine that does the queries, it is possible to split the search across several search servers.
+This way, only the effective results are returned to the main machine.
+
+The search client and server are implemented in [`search_server.py`](search_server.py).
+It can be used as a script to start a search server for `CombinedIndexDeep1B` or as a module to load the clients.
+
+The search servers can be started with
+```bash
+bash run_on_cluster.bash run_search_servers
+```
+(adjust to the number of servers that can be used).
+
+Then an example of search client is [`distributed_query_demo.py`](distributed_query_demo.py).
+It connects to the servers and assigns subsets of inverted lists to visit to each of them.
+
+A typical output is [this gist](https://gist.github.com/mdouze/1585b9854a9a2437d71f2b2c3c05c7c5).
+The number in MiB indicates the amount of data that is read from disk to perform the search.
+In this case, the scale of the dataset is too small for the distributed search to have much impact, but on datasets > 10x larger, the difference becomes more significant.
+
+## Conclusion
+
+This code contains the core components to make an index that scales up to 1T vectors.
+There are a few simplifications wrt. the index that was effectively used in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).