Commit Graph

34 Commits

Author SHA1 Message Date
Andy Lee
46f6f76fc3 refactor: Unify examples interface with BaseRAGExample
- Create BaseRAGExample base class for all RAG examples
- Refactor 4 examples to use unified interface:
  - document_rag.py (replaces main_cli_example.py)
  - email_rag.py (replaces mail_reader_leann.py)
  - browser_rag.py (replaces google_history_reader_leann.py)
  - wechat_rag.py (replaces wechat_history_reader_leann.py)
- Maintain 100% parameter compatibility with original files
- Add interactive mode support for all examples
- Unify parameter names (--max-items replaces --max-emails/--max-entries)
- Update README.md with new examples usage
- Add PARAMETER_CONSISTENCY.md documenting all parameter mappings
- Keep main_cli_example.py for backward compatibility with migration notice

All default values, LeannBuilder parameters, and chunking settings
remain identical to ensure full compatibility with existing indexes.
2025-07-28 23:11:16 -07:00
yichuan520030910320
8356e3c668 changr to openai main cli 2025-07-28 17:39:14 -07:00
Andy Lee
4671ed9b36 Fix macos ABI by using system default clang (#11)
* fix: auto-detect normalized embeddings and use cosine distance

- Add automatic detection for normalized embedding models (OpenAI, Voyage AI, Cohere)
- Automatically set distance_metric='cosine' for normalized embeddings
- Add warnings when using non-optimal distance metrics
- Implement manual L2 normalization in HNSW backend (custom Faiss build lacks normalize_L2)
- Fix DiskANN zmq_port compatibility with lazy loading strategy
- Add documentation for normalized embeddings feature

This fixes the low accuracy issue when using OpenAI text-embedding-3-small model with default MIPS metric.

* style: format

* feat: add OpenAI embeddings support to google_history_reader_leann.py

- Add --embedding-model and --embedding-mode arguments
- Support automatic detection of normalized embeddings
- Works correctly with cosine distance for OpenAI embeddings

* feat: add --use-existing-index option to google_history_reader_leann.py

- Allow using existing index without rebuilding
- Useful for testing pre-built indices

* fix: Improve OpenAI embeddings handling in HNSW backend

* fix: improve macOS C++ compatibility and add CI tests

* refactor: improve test structure and fix main_cli example

- Move pytest configuration from pytest.ini to pyproject.toml
- Remove unnecessary run_tests.py script (use test extras instead)
- Fix main_cli_example.py to properly use command line arguments for LLM config
- Add test_readme_examples.py to test code examples from README
- Refactor tests to use pytest fixtures and parametrization
- Update test documentation to reflect new structure
- Set proper environment variables in CI for test execution

* fix: add --distance-metric support to DiskANN embedding server and remove obsolete macOS ABI test markers

- Add --distance-metric parameter to diskann_embedding_server.py for consistency with other backends
- Remove pytest.skip and pytest.xfail markers for macOS C++ ABI issues as they have been fixed
- Fix test assertions to handle SearchResult objects correctly
- All tests now pass on macOS with the C++ ABI compatibility fixes

* chore: update lock file with test dependencies

* docs: remove obsolete C++ ABI compatibility warnings

- Remove outdated macOS C++ compatibility warnings from README
- Simplify CI workflow by removing macOS-specific failure handling
- All tests now pass consistently on macOS after ABI fixes

* fix: update macOS deployment target for DiskANN to 13.3

- DiskANN uses sgesdd_ LAPACK function which is only available on macOS 13.3+
- Update MACOSX_DEPLOYMENT_TARGET from 11.0 to 13.3 for DiskANN builds
- This fixes the compilation error on GitHub Actions macOS runners

* fix: align Python version requirements to 3.9

- Update root project to support Python 3.9, matching subpackages
- Restore macOS Python 3.9 support in CI
- This fixes the CI failure for Python 3.9 environments

* fix: handle MPS memory issues in CI tests

- Use smaller MiniLM-L6-v2 model (384 dimensions) for README tests in CI
- Skip other memory-intensive tests in CI environment
- Add minimal CI tests that don't require model loading
- Set CI environment variable and disable MPS fallback
- Ensure README examples always run correctly in CI

* fix: remove Python 3.10+ dependencies for compatibility

- Comment out llama-index-readers-docling and llama-index-node-parser-docling
- These packages require Python >= 3.10 and were causing CI failures on Python 3.9
- Regenerate uv.lock file to resolve dependency conflicts

* fix: use virtual environment in CI instead of system packages

- uv-managed Python environments don't allow --system installs
- Create and activate virtual environment before installing packages
- Update all CI steps to use the virtual environment

* add some env in ci

* fix: use --find-links to install platform-specific wheels

- Let uv automatically select the correct wheel for the current platform
- Fixes error when trying to install macOS wheels on Linux
- Simplifies the installation logic

* fix: disable OpenMP parallelism in CI to avoid libomp crashes

- Set OMP_NUM_THREADS=1 to avoid OpenMP thread synchronization issues
- Set MKL_NUM_THREADS=1 for single-threaded MKL operations
- This prevents segfaults in LayerNorm on macOS CI runners
- Addresses the libomp compatibility issues with PyTorch on Apple Silicon

* skip several macos test because strange issue on ci

---------

Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
2025-07-28 17:14:42 -07:00
Andy Lee
5c8921673a fix: auto-detect normalized embeddings and use cosine distance (#8)
* fix: auto-detect normalized embeddings and use cosine distance

- Add automatic detection for normalized embedding models (OpenAI, Voyage AI, Cohere)
- Automatically set distance_metric='cosine' for normalized embeddings
- Add warnings when using non-optimal distance metrics
- Implement manual L2 normalization in HNSW backend (custom Faiss build lacks normalize_L2)
- Fix DiskANN zmq_port compatibility with lazy loading strategy
- Add documentation for normalized embeddings feature

This fixes the low accuracy issue when using OpenAI text-embedding-3-small model with default MIPS metric.

* style: format
2025-07-27 21:19:29 -07:00
Andy Lee
b3e9ee96fa fix: resolve all ruff linting errors and add lint CI check
- Fix ambiguous fullwidth characters (commas, parentheses) in strings and comments
- Replace Chinese comments with English equivalents
- Fix unused imports with proper noqa annotations for intentional imports
- Fix bare except clauses with specific exception types
- Fix redefined variables and undefined names
- Add ruff noqa annotations for generated protobuf files
- Add lint and format check to GitHub Actions CI pipeline
2025-07-26 22:38:13 -07:00
yichuan520030910320
b6d43f5fd9 add gif 2025-07-25 00:12:35 -07:00
yichuan520030910320
0544f96b79 default main cli to openai add data dict as a args 2025-07-22 21:56:30 -07:00
yichuan520030910320
2ebb29de65 default main cli to openai 2025-07-22 21:55:18 -07:00
Andy Lee
43155d2811 fix: supress resources leak logs 2025-07-22 19:53:45 -07:00
Andy Lee
b3970793cf fix: cache the loaded model 2025-07-21 21:20:53 -07:00
yichuan520030910320
90d9f27383 update readme and main example 2025-07-17 15:03:22 -07:00
Andy Lee
f77c4e38cb perf: reuse embedding server for query embed 2025-07-16 16:12:15 -07:00
Andy Lee
e595bbb5fb feat: hint for users about wrong model name 2025-07-15 22:40:40 -07:00
Andy Lee
125c1f6f25 fix: model name 2025-07-15 21:48:45 -07:00
yichuan520030910320
dec3ee85fd fix main cli 2025-07-15 21:19:16 -07:00
Andy Lee
53c58fa755 perf: switch to tranditional pdf reader 2025-07-13 17:04:23 -07:00
Andy Lee
eb6f504789 Datastore reproduce (#3)
* fix: diskann zmq port and passages

* feat: auto discovery of packages and fix passage gen for diskann

* docs: embedding pruning

* refactor: passage structure

* feat: reproducible research datas, rpj_wiki & dpr

* refactor: chat and base searcher

* feat: chat on mps
2025-07-11 23:37:23 -07:00
yichuan520030910320
dafb2aacab update macos env 2025-07-08 14:37:41 -07:00
yichuan520030910320
6497e17671 add gpu chunk embedd and add complexity in hnsw 2025-07-08 18:40:52 +00:00
yichuan520030910320
dfca00c21b add mac support in this repo 2025-07-07 18:22:24 -07:00
yichuan520030910320
637dab379e add workaround code 2025-07-07 23:13:47 +00:00
yichuan520030910320
6fc57eb48e add reuse code 2025-07-07 21:07:00 +00:00
yichuan520030910320
95a653993a rm useless 2025-07-06 06:47:20 +00:00
Andy Lee
cf17c85607 Make DiskANN and HNSW work on main example (#2)
* fix: diskann zmq port and passages

* feat: auto discovery of packages and fix passage gen for diskann
2025-07-05 22:18:12 -07:00
yichuan520030910320
df63526503 merge main 2025-07-06 00:50:58 +00:00
yichuan520030910320
e92deee1e8 fix larger file read and add faq 2025-07-06 00:48:57 +00:00
Andy Lee
0aa84e147b feat: hnsw embedding server and csr format 2025-07-05 23:04:41 +00:00
yichuan520030910320
368474d036 fix larger file read and add faq 2025-07-03 23:25:36 +00:00
yichuan520030910320
a627abe794 fix file path bug still compatiable bug in hnsw search 2025-07-03 02:02:42 +00:00
yichuan520030910320
44815ee7fd add configuable funcname 2025-07-02 05:18:00 +00:00
yichuan520030910320
371e3de04e add configuable funcname 2025-07-01 05:02:01 +00:00
yichuan520030910320
b81b5d0f86 256 cannot work but increase chunk size can 2025-07-01 04:09:18 +00:00
yichuan520030910320
ee507bfe7a Initial commit 2025-06-30 11:01:12 +00:00
yichuan520030910320
46f6cc100b Initial commit 2025-06-30 09:05:05 +00:00