Initial commit

This commit is contained in:
yichuan520030910320
2025-06-30 09:05:05 +00:00
commit 46f6cc100b
1231 changed files with 278432 additions and 0 deletions

View File

@@ -0,0 +1,194 @@
# Distributed on-disk index for 1T-scale datasets
This is code corresponding to the description in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).
All the code is in python 3 (and not compatible with Python 2).
The current code uses the Deep1B dataset for demonstration purposes, but can scale to 1000x larger.
To run it, download the Deep1B dataset as explained [here](../#getting-deep1b), and edit paths to the dataset in the scripts.
The cluster commands are written for the Slurm batch scheduling system.
Hopefully, changing to another type of scheduler should be quite straightforward.
## Distributed k-means
To cluster 500M vectors to 10M centroids, it is useful to have a distributed k-means implementation.
The distribution simply consists in splitting the training vectors across machines (servers) and have them do the assignment.
The master/client then synthesizes the results and updates the centroids.
The distributed k-means implementation here is based on 3 files:
- [`distributed_kmeans.py`](distributed_kmeans.py) contains the k-means implementation.
The main loop of k-means is re-implemented in python but follows closely the Faiss C++ implementation, and should not be significantly less efficient.
It relies on a `DatasetAssign` object that does the assignment to centroids, which is the bulk of the computation.
The object can be a Faiss CPU index, a GPU index or a set of remote GPU or CPU indexes.
- [`run_on_cluster.bash`](run_on_cluster.bash) contains the shell code to run the distributed k-means on a cluster.
The distributed k-means works with a Python install that contains faiss and scipy (for sparse matrices).
It clusters the training data of Deep1B, this can be changed easily to any file in fvecs, bvecs or npy format that contains the training set.
The training vectors may be too large to fit in RAM, but they are memory-mapped so that should not be a problem.
The file is also assumed to be accessible from all server machines with eg. a distributed file system.
### Local tests
Edit `distributed_kmeans.py` to point `testdata` to your local copy of the dataset.
Then, 4 levels of sanity check can be run:
```bash
# reference Faiss C++ run
python distributed_kmeans.py --test 0
# using the Python implementation
python distributed_kmeans.py --test 1
# use the dispatch object (on local datasets)
python distributed_kmeans.py --test 2
# same, with GPUs
python distributed_kmeans.py --test 3
```
The output should look like [This gist](https://gist.github.com/mdouze/ffa01fe666a9325761266fe55ead72ad).
### Distributed sanity check
To run the distributed k-means, `distributed_kmeans.py` has to be run both on the servers (`--server` option) and client sides (`--client` option).
Edit the top of `run_on_cluster.bash` to set the path of the data to cluster.
Sanity checks can be run with
```bash
# non distributed baseline
bash run_on_cluster.bash test_kmeans_0
# using all the machine's GPUs
bash run_on_cluster.bash test_kmeans_1
# distributed run, with one local server per GPU
bash run_on_cluster.bash test_kmeans_2
```
The test `test_kmeans_2` simulates a distributed run on a single machine by starting one server process per GPU and connecting to the servers via the rpc protocol.
The output should look like [this gist](https://gist.github.com/mdouze/5b2dc69b74579ecff04e1686a277d32e).
### Distributed run
The way the script can be distributed depends on the cluster's scheduling system.
Here we use Slurm, but it should be relatively easy to adapt to any scheduler that can allocate a set of machines and start the same executable on all of them.
The command
```bash
bash run_on_cluster.bash slurm_distributed_kmeans
```
asks SLURM for 5 machines with 4 GPUs each with the `srun` command.
All 5 machines run the script with the `slurm_within_kmeans_server` option.
They determine the number of servers and their own server id via the `SLURM_NPROCS` and `SLURM_PROCID` environment variables.
All machines start `distributed_kmeans.py` in server mode for the slice of the dataset they are responsible for.
In addition, the machine #0 also starts the client.
The client knows who are the other servers via the variable `SLURM_JOB_NODELIST`.
It connects to all clients and performs the clustering.
The output should look like [this gist](https://gist.github.com/mdouze/8d25e89fb4af5093057cae0f917da6cd).
### Run used for deep1B
For the real run, we run the clustering on 50M vectors to 1M centroids.
This is just a matter of using as many machines / GPUs as possible in setting the output centroids with the `--out filename` option.
Then run
```bash
bash run_on_cluster.bash deep1b_clustering
```
The last lines of output read like:
```bash
Iteration 19 (898.92 s, search 875.71 s): objective=1.33601e+07 imbalance=1.303 nsplit=0
0: writing centroids to /checkpoint/matthijs/ondisk_distributed/1M_centroids.npy
```
This means that the total training time was 899s, of which 876s were used for computation.
However, the computation includes the I/O overhead to the assignment servers.
In this implementation, the overhead of transmitting the data is non-negligible and so is the centroid computation stage.
This is due to the inefficient Python implementation and the RPC protocol that is not optimized for broadcast / gather (like MPI).
However, it is a simple implementation that should run on most clusters.
## Making the trained index
After the centroids are obtained, an empty trained index must be constructed.
This is done by:
- applying a pre-processing stage (a random rotation) to balance the dimensions of the vectors. This can be done after clustering, the clusters are just rotated as well.
- wrapping the centroids into a HNSW index to speed up the CPU-based assignment of vectors
- training the 6-bit scalar quantizer used to encode the vectors
This is performed by the script [`make_trained_index.py`](make_trained_index.py).
## Building the index by slices
We call the slices "vslices" as they are vertical slices of the big matrix, see explanation in the wiki section [Split across database partitions](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors#split-across-database-partitions).
The script [make_index_vslice.py](make_index_vslice.py) makes an index for a subset of the vectors of the input data and stores it as an independent index.
There are 200 slices of 5M vectors each for Deep1B.
It can be run in a brute-force parallel fashion, there is no constraint on ordering.
To run the script in parallel on a slurm cluster, use:
```bash
bash run_on_cluster.bash make_index_vslices
```
For a real dataset, the data would be read from a DBMS.
In that case, reading the data and indexing it in parallel is worthwhile because reading is very slow.
## Splitting across inverted lists
The 200 slices need to be merged together.
This is done with the script [merge_to_ondisk.py](merge_to_ondisk.py), that memory maps the 200 vertical slice indexes, extracts a subset of the inverted lists and writes them to a contiguous horizontal slice.
We slice the inverted lists into 50 horizontal slices.
This is run with
```bash
bash run_on_cluster.bash make_index_hslices
```
## Querying the index
At this point the index is ready.
The horizontal slices need to be loaded in the right order and combined into an index to be usable.
This is done in the [combined_index.py](combined_index.py) script.
It provides a `CombinedIndexDeep1B` object that contains an index object that can be searched.
To test, run:
```bash
python combined_index.py
```
The output should look like:
```bash
(faiss_1.5.2) matthijs@devfair0144:~/faiss_versions/faiss_1Tcode/faiss/benchs/distributed_ondisk$ python combined_index.py
reading /checkpoint/matthijs/ondisk_distributed//hslices/slice49.faissindex
loading empty index /checkpoint/matthijs/ondisk_distributed/trained.faissindex
replace invlists
loaded index of size 1000000000
nprobe=1 1-recall@1=0.2904 t=12.35s
nnprobe=10 1-recall@1=0.6499 t=17.67s
nprobe=100 1-recall@1=0.8673 t=29.23s
nprobe=1000 1-recall@1=0.9132 t=129.58s
```
ie. searching is a lot slower than from RAM.
## Distributed query
To reduce the bandwidth required from the machine that does the queries, it is possible to split the search across several search servers.
This way, only the effective results are returned to the main machine.
The search client and server are implemented in [`search_server.py`](search_server.py).
It can be used as a script to start a search server for `CombinedIndexDeep1B` or as a module to load the clients.
The search servers can be started with
```bash
bash run_on_cluster.bash run_search_servers
```
(adjust to the number of servers that can be used).
Then an example of search client is [`distributed_query_demo.py`](distributed_query_demo.py).
It connects to the servers and assigns subsets of inverted lists to visit to each of them.
A typical output is [this gist](https://gist.github.com/mdouze/1585b9854a9a2437d71f2b2c3c05c7c5).
The number in MiB indicates the amount of data that is read from disk to perform the search.
In this case, the scale of the dataset is too small for the distributed search to have much impact, but on datasets > 10x larger, the difference becomes more significant.
## Conclusion
This code contains the core components to make an index that scales up to 1T vectors.
There are a few simplifications wrt. the index that was effectively used in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).

View File

@@ -0,0 +1,193 @@
#!/usr/bin/env python3
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import faiss
import numpy as np
class CombinedIndex:
"""
combines a set of inverted lists into a hstack
masks part of those lists
adds these inverted lists to an empty index that contains
the info on how to perform searches
"""
def __init__(self, invlist_fnames, empty_index_fname,
masked_index_fname=None):
self.indexes = indexes = []
ilv = faiss.InvertedListsPtrVector()
for fname in invlist_fnames:
if os.path.exists(fname):
print('reading', fname, end='\r', flush=True)
index = faiss.read_index(fname)
indexes.append(index)
il = faiss.extract_index_ivf(index).invlists
else:
raise AssertionError
ilv.push_back(il)
print()
self.big_il = faiss.VStackInvertedLists(ilv.size(), ilv.data())
if masked_index_fname:
self.big_il_base = self.big_il
print('loading', masked_index_fname)
self.masked_index = faiss.read_index(
masked_index_fname,
faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)
self.big_il = faiss.MaskedInvertedLists(
faiss.extract_index_ivf(self.masked_index).invlists,
self.big_il_base)
print('loading empty index', empty_index_fname)
self.index = faiss.read_index(empty_index_fname)
ntotal = self.big_il.compute_ntotal()
print('replace invlists')
index_ivf = faiss.extract_index_ivf(self.index)
index_ivf.replace_invlists(self.big_il, False)
index_ivf.ntotal = self.index.ntotal = ntotal
index_ivf.parallel_mode = 1 # seems reasonable to do this all the time
quantizer = faiss.downcast_index(index_ivf.quantizer)
quantizer.hnsw.efSearch = 1024
############################################################
# Expose fields and functions of the index as methods so that they
# can be called by RPC
def search(self, x, k):
return self.index.search(x, k)
def range_search(self, x, radius):
return self.index.range_search(x, radius)
def transform_and_assign(self, xq):
index = self.index
if isinstance(index, faiss.IndexPreTransform):
assert index.chain.size() == 1
vt = index.chain.at(0)
xq = vt.apply_py(xq)
# perform quantization
index_ivf = faiss.extract_index_ivf(index)
quantizer = index_ivf.quantizer
coarse_dis, list_nos = quantizer.search(xq, index_ivf.nprobe)
return xq, list_nos, coarse_dis
def ivf_search_preassigned(self, xq, list_nos, coarse_dis, k):
index_ivf = faiss.extract_index_ivf(self.index)
n, d = xq.shape
assert d == index_ivf.d
n2, d2 = list_nos.shape
assert list_nos.shape == coarse_dis.shape
assert n2 == n
assert d2 == index_ivf.nprobe
D = np.empty((n, k), dtype='float32')
I = np.empty((n, k), dtype='int64')
index_ivf.search_preassigned(
n, faiss.swig_ptr(xq), k,
faiss.swig_ptr(list_nos), faiss.swig_ptr(coarse_dis),
faiss.swig_ptr(D), faiss.swig_ptr(I), False)
return D, I
def ivf_range_search_preassigned(self, xq, list_nos, coarse_dis, radius):
index_ivf = faiss.extract_index_ivf(self.index)
n, d = xq.shape
assert d == index_ivf.d
n2, d2 = list_nos.shape
assert list_nos.shape == coarse_dis.shape
assert n2 == n
assert d2 == index_ivf.nprobe
res = faiss.RangeSearchResult(n)
index_ivf.range_search_preassigned(
n, faiss.swig_ptr(xq), radius,
faiss.swig_ptr(list_nos), faiss.swig_ptr(coarse_dis),
res)
lims = faiss.rev_swig_ptr(res.lims, n + 1).copy()
nd = int(lims[-1])
D = faiss.rev_swig_ptr(res.distances, nd).copy()
I = faiss.rev_swig_ptr(res.labels, nd).copy()
return lims, D, I
def set_nprobe(self, nprobe):
index_ivf = faiss.extract_index_ivf(self.index)
index_ivf.nprobe = nprobe
def set_parallel_mode(self, pm):
index_ivf = faiss.extract_index_ivf(self.index)
index_ivf.parallel_mode = pm
def get_ntotal(self):
return self.index.ntotal
def set_prefetch_nthread(self, nt):
for idx in self.indexes:
il = faiss.downcast_InvertedLists(
faiss.extract_index_ivf(idx).invlists)
il.prefetch_nthread
il.prefetch_nthread = nt
def set_omp_num_threads(self, nt):
faiss.omp_set_num_threads(nt)
class CombinedIndexDeep1B(CombinedIndex):
""" loads a CombinedIndex with the data from the big photodna index """
def __init__(self):
# set some paths
workdir = "/checkpoint/matthijs/ondisk_distributed/"
# empty index with the proper quantizer
indexfname = workdir + 'trained.faissindex'
# index that has some invlists that override the big one
masked_index_fname = None
invlist_fnames = [
'%s/hslices/slice%d.faissindex' % (workdir, i)
for i in range(50)
]
CombinedIndex.__init__(self, invlist_fnames, indexfname, masked_index_fname)
def ivecs_read(fname):
a = np.fromfile(fname, dtype='int32')
d = a[0]
return a.reshape(-1, d + 1)[:, 1:].copy()
def fvecs_read(fname):
return ivecs_read(fname).view('float32')
if __name__ == '__main__':
import time
ci = CombinedIndexDeep1B()
print('loaded index of size ', ci.index.ntotal)
deep1bdir = "/datasets01_101/simsearch/041218/deep1b/"
xq = fvecs_read(deep1bdir + "deep1B_queries.fvecs")
gt_fname = deep1bdir + "deep1B_groundtruth.ivecs"
gt = ivecs_read(gt_fname)
for nprobe in 1, 10, 100, 1000:
ci.set_nprobe(nprobe)
t0 = time.time()
D, I = ci.search(xq, 100)
t1 = time.time()
print('nprobe=%d 1-recall@1=%.4f t=%.2fs' % (
nprobe, (I[:, 0] == gt[:, 0]).sum() / len(xq),
t1 - t0
))

View File

@@ -0,0 +1,239 @@
#! /usr/bin/env python3
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Simple distributed kmeans implementation Relies on an abstraction
for the training matrix, that can be sharded over several machines.
"""
import os
import sys
import argparse
import numpy as np
import faiss
from multiprocessing.pool import ThreadPool
from faiss.contrib import rpc
from faiss.contrib.datasets import SyntheticDataset
from faiss.contrib.vecs_io import bvecs_mmap, fvecs_mmap
from faiss.contrib.clustering import DatasetAssign, DatasetAssignGPU, kmeans
class DatasetAssignDispatch:
"""dispatches to several other DatasetAssigns and combines the
results"""
def __init__(self, xes, in_parallel):
self.xes = xes
self.d = xes[0].dim()
if not in_parallel:
self.imap = map
else:
self.pool = ThreadPool(len(self.xes))
self.imap = self.pool.imap
self.sizes = list(map(lambda x: x.count(), self.xes))
self.cs = np.cumsum([0] + self.sizes)
def count(self):
return self.cs[-1]
def dim(self):
return self.d
def get_subset(self, indices):
res = np.zeros((len(indices), self.d), dtype='float32')
nos = np.searchsorted(self.cs[1:], indices, side='right')
def handle(i):
mask = nos == i
sub_indices = indices[mask] - self.cs[i]
subset = self.xes[i].get_subset(sub_indices)
res[mask] = subset
list(self.imap(handle, range(len(self.xes))))
return res
def assign_to(self, centroids, weights=None):
src = self.imap(
lambda x: x.assign_to(centroids, weights),
self.xes
)
I = []
D = []
sum_per_centroid = None
for Ii, Di, sum_per_centroid_i in src:
I.append(Ii)
D.append(Di)
if sum_per_centroid is None:
sum_per_centroid = sum_per_centroid_i
else:
sum_per_centroid += sum_per_centroid_i
return np.hstack(I), np.hstack(D), sum_per_centroid
class AssignServer(rpc.Server):
""" Assign version that can be exposed via RPC """
def __init__(self, s, assign, log_prefix=''):
rpc.Server.__init__(self, s, log_prefix=log_prefix)
self.assign = assign
def __getattr__(self, f):
return getattr(self.assign, f)
def do_test(todo):
testdata = '/datasets01_101/simsearch/041218/bigann/bigann_learn.bvecs'
if os.path.exists(testdata):
x = bvecs_mmap(testdata)
else:
print("using synthetic dataset")
ds = SyntheticDataset(128, 100000, 0, 0)
x = ds.get_train()
# bad distribution to stress-test split code
xx = x[:100000].copy()
xx[:50000] = x[0]
todo = sys.argv[1:]
if "0" in todo:
# reference C++ run
km = faiss.Kmeans(x.shape[1], 1000, niter=20, verbose=True)
km.train(xx.astype('float32'))
if "1" in todo:
# using the Faiss c++ implementation
data = DatasetAssign(xx)
kmeans(1000, data, 20)
if "2" in todo:
# use the dispatch object (on local datasets)
data = DatasetAssignDispatch([
DatasetAssign(xx[20000 * i : 20000 * (i + 1)])
for i in range(5)
], False
)
kmeans(1000, data, 20)
if "3" in todo:
# same, with GPU
ngpu = faiss.get_num_gpus()
print('using %d GPUs' % ngpu)
data = DatasetAssignDispatch([
DatasetAssignGPU(xx[100000 * i // ngpu: 100000 * (i + 1) // ngpu], i)
for i in range(ngpu)
], True
)
kmeans(1000, data, 20)
def main():
parser = argparse.ArgumentParser()
def aa(*args, **kwargs):
group.add_argument(*args, **kwargs)
group = parser.add_argument_group('general options')
aa('--test', default='', help='perform tests (comma-separated numbers)')
aa('--k', default=0, type=int, help='nb centroids')
aa('--seed', default=1234, type=int, help='random seed')
aa('--niter', default=20, type=int, help='nb iterations')
aa('--gpu', default=-2, type=int, help='GPU to use (-2:none, -1: all)')
group = parser.add_argument_group('I/O options')
aa('--indata', default='',
help='data file to load (supported formats fvecs, bvecs, npy')
aa('--i0', default=0, type=int, help='first vector to keep')
aa('--i1', default=-1, type=int, help='last vec to keep + 1')
aa('--out', default='', help='file to store centroids')
aa('--store_each_iteration', default=False, action='store_true',
help='store centroid checkpoints')
group = parser.add_argument_group('server options')
aa('--server', action='store_true', default=False, help='run server')
aa('--port', default=12345, type=int, help='server port')
aa('--when_ready', default=None, help='store host:port to this file when ready')
aa('--ipv4', default=False, action='store_true', help='force ipv4')
group = parser.add_argument_group('client options')
aa('--client', action='store_true', default=False, help='run client')
aa('--servers', default='', help='list of server:port separated by spaces')
args = parser.parse_args()
if args.test:
do_test(args.test.split(','))
return
# prepare data matrix (either local or remote)
if args.indata:
print('loading ', args.indata)
if args.indata.endswith('.bvecs'):
x = bvecs_mmap(args.indata)
elif args.indata.endswith('.fvecs'):
x = fvecs_mmap(args.indata)
elif args.indata.endswith('.npy'):
x = np.load(args.indata, mmap_mode='r')
else:
raise AssertionError
if args.i1 == -1:
args.i1 = len(x)
x = x[args.i0:args.i1]
if args.gpu == -2:
data = DatasetAssign(x)
else:
print('moving to GPU')
data = DatasetAssignGPU(x, args.gpu)
elif args.client:
print('connecting to servers')
def connect_client(hostport):
host, port = hostport.split(':')
port = int(port)
print('connecting %s:%d' % (host, port))
client = rpc.Client(host, port, v6=not args.ipv4)
print('client %s:%d ready' % (host, port))
return client
hostports = args.servers.strip().split(' ')
# pool = ThreadPool(len(hostports))
data = DatasetAssignDispatch(
list(map(connect_client, hostports)),
True
)
else:
raise AssertionError
if args.server:
print('starting server')
log_prefix = f"{rpc.socket.gethostname()}:{args.port}"
rpc.run_server(
lambda s: AssignServer(s, data, log_prefix=log_prefix),
args.port, report_to_file=args.when_ready,
v6=not args.ipv4)
else:
print('running kmeans')
centroids = kmeans(args.k, data, niter=args.niter, seed=args.seed,
checkpoint=args.out if args.store_each_iteration else None)
if args.out != '':
print('writing centroids to', args.out)
np.save(args.out, centroids)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,70 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import faiss
import numpy as np
import time
import rpc
import sys
import combined_index
import search_server
hostnames = sys.argv[1:]
print("Load local index")
ci = combined_index.CombinedIndexDeep1B()
print("connect to clients")
clients = []
for host in hostnames:
client = rpc.Client(host, 12012, v6=False)
clients.append(client)
# check if all servers respond
print("sizes seen by servers:", [cl.get_ntotal() for cl in clients])
# aggregate all clients into a one that uses them all for speed
# note that it also requires a local index ci
sindex = search_server.SplitPerListIndex(ci, clients)
sindex.verbose = True
# set reasonable parameters
ci.set_parallel_mode(1)
ci.set_prefetch_nthread(0)
ci.set_omp_num_threads(64)
# initialize params
sindex.set_parallel_mode(1)
sindex.set_prefetch_nthread(0)
sindex.set_omp_num_threads(64)
def ivecs_read(fname):
a = np.fromfile(fname, dtype='int32')
d = a[0]
return a.reshape(-1, d + 1)[:, 1:].copy()
def fvecs_read(fname):
return ivecs_read(fname).view('float32')
deep1bdir = "/datasets01_101/simsearch/041218/deep1b/"
xq = fvecs_read(deep1bdir + "deep1B_queries.fvecs")
gt_fname = deep1bdir + "deep1B_groundtruth.ivecs"
gt = ivecs_read(gt_fname)
for nprobe in 1, 10, 100, 1000:
sindex.set_nprobe(nprobe)
t0 = time.time()
D, I = sindex.search(xq, 100)
t1 = time.time()
print('nprobe=%d 1-recall@1=%.4f t=%.2fs' % (
nprobe, (I[:, 0] == gt[:, 0]).sum() / len(xq),
t1 - t0
))

View File

@@ -0,0 +1,117 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import time
import numpy as np
import faiss
import argparse
from multiprocessing.pool import ThreadPool
def ivecs_mmap(fname):
a = np.memmap(fname, dtype='int32', mode='r')
d = a[0]
return a.reshape(-1, d + 1)[:, 1:]
def fvecs_mmap(fname):
return ivecs_mmap(fname).view('float32')
def produce_batches(args):
x = fvecs_mmap(args.input)
if args.i1 == -1:
args.i1 = len(x)
print("Iterating on vectors %d:%d from %s by batches of size %d" % (
args.i0, args.i1, args.input, args.bs))
for j0 in range(args.i0, args.i1, args.bs):
j1 = min(j0 + args.bs, args.i1)
yield np.arange(j0, j1), x[j0:j1]
def rate_limited_iter(l):
'a thread pre-processes the next element'
pool = ThreadPool(1)
res = None
def next_or_None():
try:
return next(l)
except StopIteration:
return None
while True:
res_next = pool.apply_async(next_or_None)
if res is not None:
res = res.get()
if res is None:
return
yield res
res = res_next
deep1bdir = "/datasets01_101/simsearch/041218/deep1b/"
workdir = "/checkpoint/matthijs/ondisk_distributed/"
def main():
parser = argparse.ArgumentParser(
description='make index for a subset of the data')
def aa(*args, **kwargs):
group.add_argument(*args, **kwargs)
group = parser.add_argument_group('index type')
aa('--inputindex',
default=workdir + 'trained.faissindex',
help='empty input index to fill in')
aa('--nt', default=-1, type=int, help='nb of openmp threads to use')
group = parser.add_argument_group('db options')
aa('--input', default=deep1bdir + "base.fvecs")
aa('--bs', default=2**18, type=int,
help='batch size for db access')
aa('--i0', default=0, type=int, help='lower bound to index')
aa('--i1', default=-1, type=int, help='upper bound of vectors to index')
group = parser.add_argument_group('output')
aa('-o', default='/tmp/x', help='output index')
aa('--keepquantizer', default=False, action='store_true',
help='by default we remove the data from the quantizer to save space')
args = parser.parse_args()
print('args=', args)
print('start accessing data')
src = produce_batches(args)
print('loading index', args.inputindex)
index = faiss.read_index(args.inputindex)
if args.nt != -1:
faiss.omp_set_num_threads(args.nt)
t0 = time.time()
ntot = 0
for ids, x in rate_limited_iter(src):
print('add %d:%d (%.3f s)' % (ntot, ntot + ids.size, time.time() - t0))
index.add_with_ids(np.ascontiguousarray(x, dtype='float32'), ids)
ntot += ids.size
index_ivf = faiss.extract_index_ivf(index)
print('invlists stats: imbalance %.3f' % index_ivf.invlists.imbalance_factor())
index_ivf.invlists.print_stats()
if not args.keepquantizer:
print('resetting quantizer content')
index_ivf = faiss.extract_index_ivf(index)
index_ivf.quantizer.reset()
print('store output', args.o)
faiss.write_index(index, args.o)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,52 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import numpy as np
import faiss
deep1bdir = "/datasets01_101/simsearch/041218/deep1b/"
workdir = "/checkpoint/matthijs/ondisk_distributed/"
print('Load centroids')
centroids = np.load(workdir + '1M_centroids.npy')
ncent, d = centroids.shape
print('apply random rotation')
rrot = faiss.RandomRotationMatrix(d, d)
rrot.init(1234)
centroids = rrot.apply_py(centroids)
print('make HNSW index as quantizer')
quantizer = faiss.IndexHNSWFlat(d, 32)
quantizer.hnsw.efSearch = 1024
quantizer.hnsw.efConstruction = 200
quantizer.add(centroids)
print('build index')
index = faiss.IndexPreTransform(
rrot,
faiss.IndexIVFScalarQuantizer(
quantizer, d, ncent, faiss.ScalarQuantizer.QT_6bit
)
)
def ivecs_mmap(fname):
a = np.memmap(fname, dtype='int32', mode='r')
d = a[0]
return a.reshape(-1, d + 1)[:, 1:]
def fvecs_mmap(fname):
return ivecs_mmap(fname).view('float32')
print('finish training index')
xt = fvecs_mmap(deep1bdir + 'learn.fvecs')
xt = np.ascontiguousarray(xt[:256 * 1000], dtype='float32')
index.train(xt)
print('write output')
faiss.write_index(index, workdir + 'trained.faissindex')

View File

@@ -0,0 +1,96 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import faiss
import argparse
from multiprocessing.pool import ThreadPool
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--inputs', nargs='*', required=True,
help='input indexes to merge')
parser.add_argument('--l0', type=int, default=0)
parser.add_argument('--l1', type=int, default=-1)
parser.add_argument('--nt', default=-1,
help='nb threads')
parser.add_argument('--output', required=True,
help='output index filename')
parser.add_argument('--outputIL',
help='output invfile filename')
args = parser.parse_args()
if args.nt != -1:
print('set nb of threads to', args.nt)
ils = faiss.InvertedListsPtrVector()
ils_dont_dealloc = []
pool = ThreadPool(20)
def load_index(fname):
print("loading", fname)
try:
index = faiss.read_index(fname, faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)
except RuntimeError as e:
print('could not load %s: %s' % (fname, e))
return fname, None
print(" %d entries" % index.ntotal)
return fname, index
index0 = None
for _, index in pool.imap(load_index, args.inputs):
if index is None:
continue
index_ivf = faiss.extract_index_ivf(index)
il = faiss.downcast_InvertedLists(index_ivf.invlists)
index_ivf.invlists = None
il.this.own()
ils_dont_dealloc.append(il)
if (args.l0, args.l1) != (0, -1):
print('restricting to lists %d:%d' % (args.l0, args.l1))
# il = faiss.SliceInvertedLists(il, args.l0, args.l1)
il.crop_invlists(args.l0, args.l1)
ils_dont_dealloc.append(il)
ils.push_back(il)
if index0 is None:
index0 = index
print("loaded %d invlists" % ils.size())
if not args.outputIL:
args.outputIL = args.output + '_invlists'
il0 = ils.at(0)
il = faiss.OnDiskInvertedLists(
il0.nlist, il0.code_size,
args.outputIL)
print("perform merge")
ntotal = il.merge_from(ils.data(), ils.size(), True)
print("swap into index0")
index0_ivf = faiss.extract_index_ivf(index0)
index0_ivf.nlist = il0.nlist
index0_ivf.ntotal = index0.ntotal = ntotal
index0_ivf.invlists = il
index0_ivf.own_invlists = False
print("write", args.output)
faiss.write_index(index0, args.output)

View File

@@ -0,0 +1,263 @@
#! /bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
set -e
todo=$1
# other options can be transmitted
shift
# the training data of the Deep1B dataset
deep1bdir=/datasets01_101/simsearch/041218/deep1b
traindata=$deep1bdir/learn.fvecs
# this is for small tests
nvec=1000000
k=4000
# for the real run
# nvec=50000000
# k=1000000
# working directory for the real run
workdir=/checkpoint/matthijs/ondisk_distributed
mkdir -p $workdir/{vslices,hslices}
if [ -z "$todo" ]; then
echo "nothing to do"
exit 1
elif [ $todo == test_kmeans_0 ]; then
# non distributed baseline
python distributed_kmeans.py \
--indata $traindata --i1 $nvec \
--k $k
elif [ $todo == test_kmeans_1 ]; then
# using all the machine's GPUs
python distributed_kmeans.py \
--indata $traindata --i1 $nvec \
--k $k --gpu -1
elif [ $todo == test_kmeans_2 ]; then
# distrbuted run, with one local server per GPU
ngpu=$( echo /dev/nvidia? | wc -w )
baseport=12012
# kill background porcesses on output of this script
trap 'kill -HUP 0' 0
hostports=''
for((gpu=0;gpu<ngpu;gpu++)); do
# range of vectors to assign to each sever
i0=$((nvec * gpu / ngpu))
i1=$((nvec * (gpu + 1) / ngpu))
port=$(( baseport + gpu ))
echo "start server $gpu for range $i0:$i1"
python distributed_kmeans.py \
--indata $traindata \
--i0 $i0 --i1 $i1 \
--server --gpu $gpu \
--port $port --ipv4 &
hostports="$hostports localhost:$port"
done
# lame way of making sure all servers are running
sleep 5s
python distributed_kmeans.py \
--client --servers "$hostports" \
--k $k --ipv4
elif [ $todo == slurm_distributed_kmeans ]; then
nserv=5
srun -n$nserv \
--time=48:00:00 \
--cpus-per-task=40 --gres=gpu:4 --mem=100G \
--partition=priority --comment='priority is the only one that works' \
-l bash $( realpath $0 ) slurm_within_kmeans_server
elif [ $todo == slurm_within_kmeans_server ]; then
nserv=$SLURM_NPROCS
[ ! -z "$nserv" ] || (echo "should be run by slurm"; exit 1)
rank=$SLURM_PROCID
baseport=12012
i0=$((nvec * rank / nserv))
i1=$((nvec * (rank + 1) / nserv))
port=$(( baseport + rank ))
echo "host $(hostname) start server $rank for range $i0:$i1 port $port"
if [ $rank != 0 ]; then
python -u distributed_kmeans.py \
--indata $traindata \
--i0 $i0 --i1 $i1 \
--server --gpu -1 \
--port $port --ipv4
else
# master process
# kill background processes on output of this script
trap 'kill -HUP 0' 0
python -u distributed_kmeans.py \
--indata $traindata \
--i0 $i0 --i1 $i1 \
--server --gpu -1 \
--port $port --ipv4 &
# Slurm has a somewhat convoluted way of specifying the nodes
# assigned to each task. This is to parse the SLURM_TASKS_PER_NODE variable
function parse_tasks_per_node () {
local blocks=$1
for block in ${blocks//,/ }; do
if [ ${block/x/} != $block ]; then
tpn="${block%(*}"
repeat=${block#*x}
repeat=${repeat%?}
for((i=0;i<repeat;i++)); do
echo $tpn
done
else
echo $block
fi
done
}
hostports=""
port=$baseport
echo VARS $SLURM_TASKS_PER_NODE $SLURM_JOB_NODELIST
tasks_per_node=( $( parse_tasks_per_node $SLURM_TASKS_PER_NODE ) )
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
n=${#nodes[*]}
for((i=0;i<n;i++)); do
hostname=${nodes[i]}
for((j=0;j<tasks_per_node[i];j++)); do
hostports="$hostports $hostname:$port"
((port++))
done
done
echo HOSTPORTS $hostports
sleep 20s
# run client
python distributed_kmeans.py \
--client --servers "$hostports" \
--k $k --ipv4 "$@"
echo "Done, kill the job"
scancel $SLURM_JOBID
fi
elif [ $todo == deep1b_clustering ]; then
# also set nvec=500M and k=10M in the top of the file
nserv=20
srun -n$nserv \
--time=48:00:00 \
--cpus-per-task=40 --gres=gpu:4 --mem=100G \
--partition=priority --comment='priority is the only one that works' \
-l bash $( realpath $0 ) slurm_within_kmeans_server \
--out $workdir/1M_centroids.npy
elif [ $todo == make_index_vslices ]; then
# vslice: slice per database shards
nvec=1000000000
nslice=200
for((i=0;i<nslice;i++)); do
i0=$((nvec * i / nslice))
i1=$((nvec * (i + 1) / nslice))
# make the script to be run by sbatch
cat > $workdir/vslices/slice$i.bash <<EOF
#!/bin/bash
srun python -u make_index_vslice.py \
--inputindex $workdir/trained.faissindex \
--input $deep1bdir/base.fvecs \
--nt 40 \
--i0 $i0 --i1 $i1 \
-o $workdir/vslices/slice$i.faissindex
EOF
# specify resources for script and run it
sbatch -n1 \
--time=48:00:00 \
--cpus-per-task=40 --gres=gpu:0 --mem=200G \
--output=$workdir/vslices/slice$i.log \
--job-name=vslice$i.c \
$workdir/vslices/slice$i.bash
echo "logs in $workdir/vslices/slice$i.log"
done
elif [ $todo == make_index_hslices ]; then
# hslice: slice per inverted lists
nlist=1000000
nslice=50
for((i=0;i<nslice;i++)); do
i0=$((nlist * i / nslice))
i1=$((nlist * (i + 1) / nslice))
# make the script to be run by sbatch
cat > $workdir/hslices/slice$i.bash <<EOF
#!/bin/bash
srun python -u merge_to_ondisk.py \
--input $workdir/vslices/slice{0..199}.faissindex \
--nt 20 \
--l0 $i0 --l1 $i1 \
--output $workdir/hslices/slice$i.faissindex \
--outputIL $workdir/hslices/slice$i.invlists
EOF
# specify resources for script and run it
sbatch -n1 \
--time=48:00:00 \
--cpus-per-task=20 --gres=gpu:0 --mem=200G \
--output=$workdir/hslices/slice$i.log \
--job-name=hslice$i.a \
--constraint=pascal \
$workdir/hslices/slice$i.bash
echo "logs in $workdir/hslices/slice$i.log"
done
elif [ $todo == run_search_servers ]; then
nserv=3
srun -n$nserv \
--time=48:00:00 \
--cpus-per-task=64 --gres=gpu:0 --mem=100G \
--constraint=pascal \
--partition=priority --comment='priority is the only one that works' \
-l python -u search_server.py --port 12012
else
echo "unknown todo $todo"
exit 1
fi

View File

@@ -0,0 +1,222 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import time
from faiss.contrib import rpc
import combined_index
import argparse
############################################################
# Server implementation
############################################################
class MyServer(rpc.Server):
""" Assign version that can be exposed via RPC """
def __init__(self, s, index):
rpc.Server.__init__(self, s)
self.index = index
def __getattr__(self, f):
return getattr(self.index, f)
def main():
parser = argparse.ArgumentParser()
def aa(*args, **kwargs):
group.add_argument(*args, **kwargs)
group = parser.add_argument_group('server options')
aa('--port', default=12012, type=int, help='server port')
aa('--when_ready_dir', default=None,
help='store host:port to this file when ready')
aa('--ipv4', default=False, action='store_true', help='force ipv4')
aa('--rank', default=0, type=int,
help='rank used as index in the client table')
args = parser.parse_args()
when_ready = None
if args.when_ready_dir:
when_ready = '%s/%d' % (args.when_ready_dir, args.rank)
print('loading index')
index = combined_index.CombinedIndexDeep1B()
print('starting server')
rpc.run_server(
lambda s: MyServer(s, index),
args.port, report_to_file=when_ready,
v6=not args.ipv4)
if __name__ == '__main__':
main()
############################################################
# Client implementation
############################################################
from multiprocessing.pool import ThreadPool
import faiss
import numpy as np
class ResultHeap:
""" Combine query results from a sliced dataset (for k-nn search) """
def __init__(self, nq, k):
" nq: number of query vectors, k: number of results per query "
self.I = np.zeros((nq, k), dtype='int64')
self.D = np.zeros((nq, k), dtype='float32')
self.nq, self.k = nq, k
heaps = faiss.float_maxheap_array_t()
heaps.k = k
heaps.nh = nq
heaps.val = faiss.swig_ptr(self.D)
heaps.ids = faiss.swig_ptr(self.I)
heaps.heapify()
self.heaps = heaps
def add_batch_result(self, D, I, i0):
assert D.shape == (self.nq, self.k)
assert I.shape == (self.nq, self.k)
I += i0
self.heaps.addn_with_ids(
self.k, faiss.swig_ptr(D),
faiss.swig_ptr(I), self.k)
def finalize(self):
self.heaps.reorder()
def distribute_weights(weights, nbin):
""" assign a set of weights to a smaller set of bins to balance them """
nw = weights.size
o = weights.argsort()
bins = np.zeros(nbin)
assign = np.ones(nw, dtype=int)
for i in o[::-1]:
b = bins.argmin()
assign[i] = b
bins[b] += weights[i]
return bins, assign
class SplitPerListIndex:
"""manages a local index, that does the coarse quantization and a set
of sub_indexes. The sub_indexes search a subset of the inverted
lists. The SplitPerListIndex merges results from the sub-indexes"""
def __init__(self, index, sub_indexes):
self.index = index
self.code_size = faiss.extract_index_ivf(index.index).code_size
self.sub_indexes = sub_indexes
self.ni = len(self.sub_indexes)
# pool of threads. Each thread manages one sub-index.
self.pool = ThreadPool(self.ni)
self.verbose = False
def set_nprobe(self, nprobe):
self.index.set_nprobe(nprobe)
self.pool.map(
lambda i: self.sub_indexes[i].set_nprobe(nprobe),
range(self.ni)
)
def set_omp_num_threads(self, nt):
faiss.omp_set_num_threads(nt)
self.pool.map(
lambda idx: idx.set_omp_num_threads(nt),
self.sub_indexes
)
def set_parallel_mode(self, pm):
self.index.set_parallel_mode(pm)
self.pool.map(
lambda idx: idx.set_parallel_mode(pm),
self.sub_indexes
)
def set_prefetch_nthread(self, nt):
self.index.set_prefetch_nthread(nt)
self.pool.map(
lambda idx: idx.set_prefetch_nthread(nt),
self.sub_indexes
)
def balance_lists(self, list_nos):
big_il = self.index.big_il
weights = np.array([big_il.list_size(int(i))
for i in list_nos.ravel()])
bins, assign = distribute_weights(weights, self.ni)
if self.verbose:
print('bins weight range %d:%d total %d (%.2f MiB)' % (
bins.min(), bins.max(), bins.sum(),
bins.sum() * (self.code_size + 8) / 2 ** 20))
self.nscan = bins.sum()
return assign.reshape(list_nos.shape)
def search(self, x, k):
xqo, list_nos, coarse_dis = self.index.transform_and_assign(x)
assign = self.balance_lists(list_nos)
def do_query(i):
sub_index = self.sub_indexes[i]
list_nos_i = list_nos.copy()
list_nos_i[assign != i] = -1
t0 = time.time()
Di, Ii = sub_index.ivf_search_preassigned(
xqo, list_nos_i, coarse_dis, k)
#print(list_nos_i, Ii)
if self.verbose:
print('client %d: %.3f s' % (i, time.time() - t0))
return Di, Ii
rh = ResultHeap(x.shape[0], k)
for Di, Ii in self.pool.imap(do_query, range(self.ni)):
#print("ADD", Ii, rh.I)
rh.add_batch_result(Di, Ii, 0)
rh.finalize()
return rh.D, rh.I
def range_search(self, x, radius):
xqo, list_nos, coarse_dis = self.index.transform_and_assign(x)
assign = self.balance_lists(list_nos)
nq = len(x)
def do_query(i):
sub_index = self.sub_indexes[i]
list_nos_i = list_nos.copy()
list_nos_i[assign != i] = -1
t0 = time.time()
limi, Di, Ii = sub_index.ivf_range_search_preassigned(
xqo, list_nos_i, coarse_dis, radius)
if self.verbose:
print('slice %d: %.3f s' % (i, time.time() - t0))
return limi, Di, Ii
D = [[] for i in range(nq)]
I = [[] for i in range(nq)]
sizes = np.zeros(nq, dtype=int)
for lims, Di, Ii in self.pool.imap(do_query, range(self.ni)):
for i in range(nq):
l0, l1 = lims[i:i + 2]
D[i].append(Di[l0:l1])
I[i].append(Ii[l0:l1])
sizes[i] += l1 - l0
lims = np.zeros(nq + 1, dtype=int)
lims[1:] = np.cumsum(sizes)
D = np.hstack([j for i in D for j in i])
I = np.hstack([j for i in I for j in i])
return lims, D, I