- Test script to reproduce slow search performance issue
- Generates ~90K chunks (~180MB) similar to user's dataset
- Tests search performance with different complexity values (8, 16, 32, 64)
- Demonstrates that complexity=16-32 achieves ~2s search time
- Validates the performance analysis findings
- Reproduced the slow search performance issue (15-30s vs expected ~2s)
- Identified root cause: default complexity=64 is too high for fast search
- Created test script demonstrating performance with different complexity values
- Test results show complexity=16-32 achieves ~2s search time (matching paper)
- Added comprehensive analysis document with solutions and recommendations
Key findings:
- Default complexity=64 results in ~36s search time
- Reducing complexity to 16-32 achieves ~2s search time
- beam_width parameter is mainly for DiskANN, not HNSW
- Paper likely used smaller embedding model (~100M) and lower complexity
Solutions provided:
1. Reduce complexity parameter to 16-32 for faster search
2. Consider DiskANN backend for better performance on large datasets
3. Use smaller embedding model if speed is critical
* feat: enhance token limits with dynamic discovery + AST metadata
Improves upon upstream PR #154 with two major enhancements:
1. **Hybrid Token Limit Discovery**
- Dynamic: Query Ollama /api/show for context limits
- Fallback: Registry for LM Studio/OpenAI
- Zero maintenance for Ollama users
- Respects custom num_ctx settings
2. **AST Metadata Preservation**
- create_ast_chunks() returns dict format with metadata
- Preserves file_path, file_name, timestamps
- Includes astchunk metadata (line numbers, node counts)
- Fixes content extraction bug (checks "content" key)
- Enables --show-metadata flag
3. **Better Token Limits**
- nomic-embed-text: 2048 tokens (vs 512)
- nomic-embed-text-v1.5: 2048 tokens
- Added OpenAI models: 8192 tokens
4. **Comprehensive Tests**
- 11 tests for token truncation
- 545 new lines in test_astchunk_integration.py
- All metadata preservation tests passing
* fix: merge EMBEDDING_MODEL_LIMITS and remove redundant validation
- Merged upstream's model list with our corrected token limits
- Kept our corrected nomic-embed-text: 2048 (not 512)
- Removed post-chunking validation (redundant with embedding-time truncation)
- All tests passing except 2 pre-existing integration test failures
* style: apply ruff formatting and restore PR #154 version handling
- Remove duplicate truncate_to_token_limit and get_model_token_limit functions
- Restore version handling logic (model:latest -> model) from PR #154
- Restore partial matching fallback for model name variations
- Apply ruff formatting to all modified files
- All 11 token truncation tests passing
* style: sort imports alphabetically (pre-commit auto-fix)
* fix: show AST token limit warning only once per session
- Add module-level flag to track if warning shown
- Prevents spam when processing multiple files
- Add clarifying note that auto-truncation happens at embedding time
- Addresses issue where warning appeared for every code file
* enhance: add detailed logging for token truncation
- Track and report truncation statistics (count, tokens removed, max length)
- Show first 3 individual truncations with exact token counts
- Provide comprehensive summary when truncation occurs
- Use WARNING level for data loss visibility
- Silent (DEBUG level only) when no truncation needed
Replaces misleading "truncated where necessary" message that appeared
even when nothing was truncated.
* feat: add metadata output to search results
- Add --show-metadata flag to display file paths in search results
- Preserve document metadata (file_path, file_name, timestamps) during chunking
- Update MCP tool schema to support show_metadata parameter
- Enhance CLI search output to display metadata when requested
- Fix pre-existing bug: args.backend -> args.backend_name
Resolves yichuan-w/LEANN#144
* fix: resolve ZMQ linking issues in Python extension
- Use pkg_check_modules IMPORTED_TARGET to create PkgConfig::ZMQ
- Set PKG_CONFIG_PATH to prioritize ARM64 Homebrew on Apple Silicon
- Override macOS -undefined dynamic_lookup to force proper symbol resolution
- Use PUBLIC linkage for ZMQ in faiss library for transitive linking
- Mark cppzmq includes as SYSTEM to suppress warnings
Fixes editable install ZMQ symbol errors while maintaining compatibility
across Linux, macOS Intel, and macOS ARM64 platforms.
* style: apply ruff formatting
* chore: update faiss submodule to use ww2283 fork
Use ww2283/faiss fork with fix/zmq-linking branch to resolve CI checkout
failures. The ZMQ linking fixes are not yet merged upstream.
* feat: implement true batch processing for Ollama embeddings
Migrate from deprecated /api/embeddings to modern /api/embed endpoint
which supports batch inputs. This reduces HTTP overhead by sending
32 texts per request instead of making individual API calls.
Changes:
- Update endpoint from /api/embeddings to /api/embed
- Change parameter from 'prompt' (single) to 'input' (array)
- Update response parsing for batch embeddings array
- Increase timeout to 60s for batch processing
- Improve error handling for batch requests
Performance:
- Reduces API calls by 32x (batch size)
- Eliminates HTTP connection overhead per text
- Note: Ollama still processes batch items sequentially internally
Related: #151
* fall back to original faiss as i merge the PR
---------
Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
Remove paru-bin directory that was incorrectly added as a git submodule.
This directory is an AUR build artifact and should not be tracked.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Printing querying time
* Adding source name to chunks
Adding source name as metadata to chunks, then printing the sources when searching
* Printing the context provided to LLM
To check the data transmitted to the LLMs : display the relevance, ID, content, and source of each sent chunk.
* Correcting source as metadata for chunks
* Applying ruff format
* Applying Ruff formatting
* Ruff formatting
* fix: Fix Twitter bookmarks anchor link
- Convert Twitter Bookmarks from collapsible details to proper header
- Update internal link to match new anchor format
- Ensures external links to #twitter-bookmarks-your-personal-tweet-library work correctly
Fixes broken link: https://github.com/yichuan-w/LEANN?tab=readme-ov-file#twitter-bookmarks-your-personal-tweet-library
* fix: Fix Slack messages anchor link as well
- Convert Slack Messages from collapsible details to proper header
- Update internal link to match new anchor format
- Ensures external links to #slack-messages-search-your-team-conversations work correctly
Both Twitter and Slack MCP sections now have reliable anchor links.
* fix: Point Slack and Twitter links to main MCP section
- Both Slack and Twitter are subsections under MCP Integration
- Links should point to #mcp-integration-rag-on-live-data-from-any-platform
- Users will land on the MCP section and can find both Slack and Twitter subsections there
This matches the actual document structure where Slack and Twitter are under the MCP Integration section.
* Improve Slack MCP integration with retry logic and comprehensive setup guide
- Add retry mechanism with exponential backoff for cache sync issues
- Handle 'users cache is not ready yet' errors gracefully
- Add max-retries and retry-delay CLI arguments for better control
- Create comprehensive Slack setup guide with troubleshooting
- Update README with link to detailed setup guide
- Improve error messages and user experience
* Fix trailing whitespace in slack setup guide
Pre-commit hooks formatting fixes
* Add comprehensive Slack setup guide with success screenshot
- Create detailed setup guide with step-by-step instructions
- Add troubleshooting section for common issues like cache sync errors
- Include real terminal output example from successful integration
- Add screenshot showing VS Code interface with Slack channel data
- Remove excessive emojis for more professional documentation
- Document retry logic improvements and CLI arguments
* Fix formatting issues in Slack setup guide
- Remove trailing whitespace
- Fix end of file formatting
- Pre-commit hooks formatting fixes
* Add real RAG example showing intelligent Slack query functionality
- Add detailed example of asking 'What is LEANN about?'
- Show retrieved messages from Slack channels
- Demonstrate intelligent answer generation based on context
- Add command example for running real RAG queries
- Explain the 4-step process: retrieve, index, generate, cite
* Update Slack setup guide with bot invitation requirements
- Add important section about inviting bot to channels before RAG queries
- Explain the 'not_in_channel' errors and their meaning
- Provide clear steps for bot invitation process
- Document realistic scenario where bot needs explicit channel access
- Update documentation to be more professional and less cursor-style
* Docs: add real RAG example for Sky Lab #random
- Embed screenshot videos/rag-sky-random.png
- Add step-by-step commands and notes
- Include helper test script tests/test_channel_by_id_or_name.py
- Redact example tokens from docs
* Docs/CI: fix broken image paths and ruff lint\n\n- Move screenshot to docs/videos and update references\n- Remove obsolete rag-query-results image\n- Rename variable to satisfy ruff
* Docs: fix image path for lychee (use videos/ relative under docs/)
* Docs: finalize Slack setup guide with Sky random RAG example and image path fixes\n\n- Redact example tokens from docs
* Fix Slack MCP integration and update documentation
- Fix SlackMCPReader to use conversations_history instead of channels_list
- Add fallback imports for leann.interactive_utils and leann.settings
- Update slack-setup-guide.md with real screenshots and improved text
- Remove old screenshot files
* Add Slack integration screenshots to docs/videos
- Add slack_integration.png showing RAG query results
- Add slack_integration_2.png showing additional demo functionality
- Fixes lychee link checker errors for missing image files
* Update Slack integration screenshot with latest changes
* Remove test_channel_by_id_or_name.py
- Clean up temporary test file that was used for debugging
- Keep only the main slack_rag.py application for production use
* Update Slack RAG example to show LEANN announcement retrieval
- Change query from 'PUBPOL 290' to 'What is LEANN about?' for more challenging retrieval
- Update command to use python -m apps.slack_rag instead of test script
- Add expected response showing Yichuan Wang's LEANN announcement message
- Emphasize this demonstrates ability to find specific announcements in conversation history
- Update description to highlight challenging query capabilities
* Update Slack RAG integration with improved CSV parsing and new screenshots
- Fixed CSV message parsing in slack_mcp_reader.py to properly handle individual messages
- Updated slack_rag.py to filter empty channel strings
- Enhanced slack-setup-guide.md with two new query examples:
- Advisor Models query: 'train black-box models to adopt to your personal data'
- Barbarians at the Gate query: 'AI-driven research systems ADRS'
- Replaced old screenshots with four new ones showing both query examples
- Updated documentation to use User OAuth Token (xoxp-) instead of Bot Token (xoxb-)
- Added proper command examples with --no-concatenate-conversations and --force-rebuild flags
* Update Slack RAG documentation with Ollama integration and new screenshots
- Updated slack-setup-guide.md with comprehensive Ollama setup instructions
- Added 6 new screenshots showing complete RAG workflow:
- Command setup, search results, and LLM responses for both queries
- Removed simulated LLM references, now uses real Ollama with llama3.2:1b
- Enhanced documentation with step-by-step Ollama installation
- Updated troubleshooting checklist to include Ollama-specific checks
- Fixed command syntax and added proper Ollama configuration
- Demonstrates working Slack RAG with real AI-generated responses
* Remove Key Features section from Slack RAG examples
- Simplified documentation by removing the bullet point list
- Keeps the focus on the actual examples and screenshots
* fix: Fix Twitter bookmarks anchor link
- Convert Twitter Bookmarks from collapsible details to proper header
- Update internal link to match new anchor format
- Ensures external links to #twitter-bookmarks-your-personal-tweet-library work correctly
Fixes broken link: https://github.com/yichuan-w/LEANN?tab=readme-ov-file#twitter-bookmarks-your-personal-tweet-library
* fix: Fix Slack messages anchor link as well
- Convert Slack Messages from collapsible details to proper header
- Update internal link to match new anchor format
- Ensures external links to #slack-messages-search-your-team-conversations work correctly
Both Twitter and Slack MCP sections now have reliable anchor links.
* fix: Point Slack and Twitter links to main MCP section
- Both Slack and Twitter are subsections under MCP Integration
- Links should point to #mcp-integration-rag-on-live-data-from-any-platform
- Users will land on the MCP section and can find both Slack and Twitter subsections there
This matches the actual document structure where Slack and Twitter are under the MCP Integration section.
BREAKING CHANGE: trust_remote_code now defaults to False for security
- Set trust_remote_code=False by default in HFChat class
- Add explicit trust_remote_code parameter to HFChat.__init__()
- Add security warning when trust_remote_code=True is used
- Update get_llm() function to support trust_remote_code parameter
- Update benchmark utilities (load_hf_model, load_vllm_model, load_qwen_vl_model)
- Add comprehensive documentation for security implications
Security Benefits:
- Prevents arbitrary code execution from compromised model repositories
- Requires explicit opt-in for models that need remote code execution
- Shows clear warnings when security is reduced
- Follows security-by-default principle
Migration Guide:
- Most users: No changes needed (more secure by default)
- Users with models requiring remote code: Add trust_remote_code=True explicitly
- Config users: Add 'trust_remote_code': true to LLM config if needed
Fixes#136
* feat: Add MCP integration support for Slack and Twitter
- Implement SlackMCPReader for connecting to Slack MCP servers
- Implement TwitterMCPReader for connecting to Twitter MCP servers
- Add SlackRAG and TwitterRAG applications with full CLI support
- Support live data fetching via Model Context Protocol (MCP)
- Add comprehensive documentation and usage examples
- Include connection testing capabilities with --test-connection flag
- Add standalone tests for core functionality
- Update README with detailed MCP integration guide
- Add Aakash Suresh to Active Contributors
Resolves#36
* fix: Resolve linting issues in MCP integration
- Replace deprecated typing.Dict/List with built-in dict/list
- Fix boolean comparisons (== True/False) to direct checks
- Remove unused variables in demo script
- Update type annotations to use modern Python syntax
All pre-commit hooks should now pass.
* fix: Apply final formatting fixes for pre-commit hooks
- Remove unused imports (asyncio, pathlib.Path)
- Remove unused class imports in demo script
- Ensure all files pass ruff format and pre-commit checks
This should resolve all remaining CI linting issues.
* fix: Apply pre-commit formatting changes
- Fix trailing whitespace in all files
- Apply ruff formatting to match project standards
- Ensure consistent code style across all MCP integration files
This commit applies the exact changes that pre-commit hooks expect.
* fix: Apply pre-commit hooks formatting fixes
- Remove trailing whitespace from all files
- Fix ruff formatting issues (2 errors resolved)
- Apply consistent code formatting across 3 files
- Ensure all files pass pre-commit validation
This resolves all CI formatting failures.
* fix: Update MCP RAG classes to match BaseRAGExample signature
- Fix SlackMCPRAG and TwitterMCPRAG __init__ methods to provide required parameters
- Add name, description, and default_index_name to super().__init__ calls
- Resolves test failures: test_slack_rag_initialization and test_twitter_rag_initialization
This fixes the TypeError caused by BaseRAGExample requiring additional parameters.
* style: Apply ruff formatting - add trailing commas
- Add trailing commas to super().__init__ calls in SlackMCPRAG and TwitterMCPRAG
- Fixes ruff format pre-commit hook requirements
* fix: Resolve SentenceTransformer model_kwargs parameter conflict
- Fix local_files_only parameter conflict in embedding_compute.py
- Create separate copies of model_kwargs and tokenizer_kwargs for local vs network loading
- Prevents parameter conflicts when falling back from local to network loading
- Resolves TypeError in test_readme_examples.py tests
This addresses the SentenceTransformer initialization issues in CI tests.
* fix: Add comprehensive SentenceTransformer version compatibility
- Handle both old and new sentence-transformers versions
- Gracefully fallback from advanced parameters to basic initialization
- Catch TypeError for model_kwargs/tokenizer_kwargs and use basic SentenceTransformer init
- Ensures compatibility across different CI environments and local setups
- Maintains optimization benefits where supported while ensuring broad compatibility
This resolves test failures in CI environments with older sentence-transformers versions.
* style: Apply ruff formatting to embedding_compute.py
- Break long logger.warning lines for better readability
- Fixes pre-commit hook formatting requirements
* docs: Comprehensive documentation improvements for better user experience
- Add clear step-by-step Getting Started Guide for new users
- Add comprehensive CLI Reference with all commands and options
- Improve installation instructions with clear steps and verification
- Add detailed troubleshooting section for common issues (Ollama, OpenAI, etc.)
- Clarify difference between CLI commands and specialized apps
- Add environment variables documentation
- Improve MCP integration documentation with CLI integration examples
- Address user feedback about confusing installation and setup process
This resolves documentation gaps that made LEANN difficult for non-specialists to use.
* style: Remove trailing whitespace from README.md
- Fix trailing whitespace issues found by pre-commit hooks
- Ensures consistent formatting across documentation
* docs: Simplify README by removing excessive documentation
- Remove overly complex CLI reference and getting started sections (lines 61-334)
- Remove emojis from section headers for cleaner appearance
- Keep README simple and focused as requested
- Maintain essential MCP integration documentation
This addresses feedback to keep documentation minimal and avoid auto-generated content.
* docs: Address maintainer feedback on README improvements
- Restore emojis in section headers (Prerequisites and Quick Install)
- Add MCP live data feature mention in line 23 with links to Slack and Twitter
- Add detailed API credential setup instructions for Slack:
- Step-by-step Slack App creation process
- Required OAuth scopes and permissions
- Clear token identification (xoxb- vs xapp-)
- Add detailed API credential setup instructions for Twitter:
- Twitter Developer Account application process
- API v2 requirements for bookmarks access
- Required permissions and scopes
This addresses maintainer feedback to make API setup more user-friendly.
* Add readline support to interactive command line interfaces
- Implement readline history, navigation, and editing for CLI, API, and RAG chat modes
- Create shared InteractiveSession class to consolidate readline functionality
- Add command history persistence across sessions with separate files per context
- Support built-in commands: help, clear, history, quit/exit
- Enable arrow key navigation and command editing in all interactive modes
* Improvements based on feedback
* system wide semantic file search with temporal awareness
* ruff checking passed
* graceful exit for empty dump
* error thrown for time only search
* fixes
* feat: finance bench
* docs: results
* chore: ignroe data README
* feat: fix financebench
* feat: laion, also required idmaps support
* style: format
* style: format
* fix: resolve ruff linting errors
- Remove unused variables in benchmark scripts
- Rename unused loop variables to follow convention
* feat: enron email bench
* experiments for running DiskANN & BM25 on Arch 4090
* style: format
* chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules
* docs: data
* docs: data updated
* fix: as package
* fix(ci): only run pre-commit
* chore: use http url of astchunk; use group for some dev deps
* fix(ci): should checkout modules as well since `uv sync` checks
* fix(ci): run with lint only
* fix: find links to install wheels available
* CI: force local wheels in uv install step
* CI: install local wheels via file paths
* CI: pick wheels matching current Python tag
* CI: handle python tag mismatches for local wheels
* CI: use matrix python venv and set macOS deployment target
* CI: revert install step to match main
* CI: use uv group install with local wheel selection
* CI: rely on setup-uv for Python and tighten group install
* CI: install build deps with uv python interpreter
* CI: use temporary uv venv for build deps
* CI: add build venv scripts path for wheel repair
* feat: Add GitHub PR and issue templates for better contributor experience
* simplify: Make templates more concise and user-friendly
* fix: enable is_compact=False, is_recompute=True
* feat: update when recompute
* test
* fix: real recompute
* refactor
* fix: compare with no-recompute
* fix: test