fix: improve gitignore and Jupyter notebook support

- Add nbconvert dependency for .ipynb file support - Replace manual gitignore parsing with gitignore-parser library - Proper recursive .gitignore handling (all subdirectories) - Fix compliance with Git gitignore behavior - Simplify code and improve reliability 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
[Readme]update embedding model config according to reddit feedback
2025-08-10 18:52:55 -07:00 · 2025-08-09 21:33:33 -07:00 · 2025-08-10 03:39:45 +00:00 · 2025-08-09 20:37:17 -07:00
9 changed files with 3900 additions and 3543 deletions
@@ -189,7 +189,7 @@ All RAG examples share these common parameters. **Interactive mode** is availabl
 --force-rebuild         # Force rebuild index even if it exists

 # Embedding Parameters
--embedding-model MODEL  # e.g., facebook/contriever, text-embedding-3-small, nomic-embed-text, or mlx-community/multilingual-e5-base-mlx
+--embedding-model MODEL  # e.g., facebook/contriever, text-embedding-3-small, nomic-embed-text, mlx-community/Qwen3-Embedding-0.6B-8bit or nomic-embed-text
 --embedding-mode MODE    # sentence-transformers, openai, mlx, or ollama

 # LLM Parameters (Text generation models)
@@ -222,9 +222,15 @@ python apps/document_rag.py --query "What are the main techniques LEANN explores

 3. **Use MLX on Apple Silicon** (optional optimization):
   ```bash
-   --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
+   --embedding-mode mlx --embedding-model mlx-community/Qwen3-Embedding-0.6B-8bit
   ```
+    MLX might not be the best choice, as we tested and found that it only offers 1.3x acceleration compared to HF, so maybe using ollama is a better choice for embedding generation

+4. **Use Ollama**
+   ```bash
+   --embedding-mode ollama --embedding-model nomic-embed-text
+   ```
+   To discover additional embedding models in ollama, check out https://ollama.com/search?c=embedding or read more about embedding models at https://ollama.com/blog/embedding-models, please do check the model size that works best for you
 ### If Search Quality is Poor

 1. **Increase retrieval count**:
@@ -4,8 +4,8 @@ build-backend = "scikit_build_core.build"

 [project]
 name = "leann-backend-diskann"
-version = "0.2.5"
-dependencies = ["leann-core==0.2.5", "numpy", "protobuf>=3.19.0"]
+version = "0.2.6"
+dependencies = ["leann-core==0.2.6", "numpy", "protobuf>=3.19.0"]

 [tool.scikit-build]
 # Key: simplified CMake path
@@ -6,10 +6,10 @@ build-backend = "scikit_build_core.build"

 [project]
 name = "leann-backend-hnsw"
-version = "0.2.5"
+version = "0.2.6"
 description = "Custom-built HNSW (Faiss) backend for the Leann toolkit."
 dependencies = [
-    "leann-core==0.2.5",
+    "leann-core==0.2.6",
    "numpy",
    "pyzmq>=23.0.0",
    "msgpack>=1.0.0",
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "leann-core"
-version = "0.2.5"
+version = "0.2.6"
 description = "Core API and plugin system for LEANN"
 readme = "README.md"
 requires-python = ">=3.9"
@@ -31,6 +31,8 @@ dependencies = [
    "PyPDF2>=3.0.0",
    "pymupdf>=1.23.0",
    "pdfplumber>=0.10.0",
+    "nbconvert>=7.0.0",  # For .ipynb file support
+    "gitignore-parser>=0.1.12",  # For proper .gitignore handling
    "mlx>=0.26.3; sys_platform == 'darwin'",
    "mlx-lm>=0.26.0; sys_platform == 'darwin'",
 ]
@@ -203,62 +203,36 @@ Examples:
        with open(global_registry, "w") as f:
            json.dump(projects, f, indent=2)

-    def _read_gitignore_patterns(self, docs_dir: str) -> list[str]:
-        """Read .gitignore file and return patterns for exclusion."""
-        gitignore_path = Path(docs_dir) / ".gitignore"
-        patterns = []
+    def _build_gitignore_parser(self, docs_dir: str):
+        """Build gitignore parser using gitignore-parser library."""
+        from gitignore_parser import parse_gitignore

-        # Add some essential patterns that should always be excluded
-        essential_patterns = [
-            ".git",
-            ".DS_Store",
-        ]
-        patterns.extend(essential_patterns)
+        # Try to parse the root .gitignore
+        gitignore_path = Path(docs_dir) / ".gitignore"

        if gitignore_path.exists():
            try:
-                with open(gitignore_path, encoding="utf-8") as f:
-                    for line in f:
-                        line = line.strip()
-                        # Skip empty lines and comments
-                        if line and not line.startswith("#"):
-                            # Remove leading slash if present (make it relative)
-                            if line.startswith("/"):
-                                line = line[1:]
-                            patterns.append(line)
-                print(
-                    f"📋 Loaded {len(patterns) - len(essential_patterns)} patterns from .gitignore"
-                )
+                # gitignore-parser automatically handles all subdirectory .gitignore files!
+                matches = parse_gitignore(str(gitignore_path))
+                print(f"📋 Loaded .gitignore from {docs_dir} (includes all subdirectories)")
+                return matches
            except Exception as e:
-                print(f"Warning: Could not read .gitignore: {e}")
+                print(f"Warning: Could not parse .gitignore: {e}")
        else:
-            print("📋 No .gitignore found, using minimal exclusion patterns")
+            print("📋 No .gitignore found")

-        return patterns
+        # Fallback: basic pattern matching for essential files
+        essential_patterns = {".git", ".DS_Store", "__pycache__", "node_modules", ".venv", "venv"}

-    def _should_exclude_file(self, relative_path: Path, exclude_patterns: list[str]) -> bool:
-        """Check if a file should be excluded based on gitignore-style patterns."""
-        path_str = str(relative_path)
+        def basic_matches(file_path):
+            path_parts = Path(file_path).parts
+            return any(part in essential_patterns for part in path_parts)

-        for pattern in exclude_patterns:
-            # Simple pattern matching (could be enhanced with full gitignore syntax)
-            if pattern.endswith("*"):
-                # Wildcard pattern
-                prefix = pattern[:-1]
-                if path_str.startswith(prefix):
-                    return True
-            elif "*" in pattern:
-                # Contains wildcard - simple glob-like matching
-                import fnmatch
+        return basic_matches

-                if fnmatch.fnmatch(path_str, pattern):
-                    return True
-            else:
-                # Exact match or directory match
-                if path_str == pattern or path_str.startswith(pattern + "/"):
-                    return True
-
-        return False
+    def _should_exclude_file(self, relative_path: Path, gitignore_matches) -> bool:
+        """Check if a file should be excluded using gitignore parser."""
+        return gitignore_matches(str(relative_path))

    def list_indexes(self):
        print("Stored LEANN indexes:")
@@ -341,8 +315,8 @@ Examples:
        if custom_file_types:
            print(f"Using custom file types: {custom_file_types}")

-        # Read .gitignore patterns first
-        exclude_patterns = self._read_gitignore_patterns(docs_dir)
+        # Build gitignore parser
+        gitignore_matches = self._build_gitignore_parser(docs_dir)

        # Try to use better PDF parsers first, but only if PDFs are requested
        documents = []
@@ -355,7 +329,7 @@ Examples:
            for file_path in docs_path.rglob("*.pdf"):
                # Check if file matches any exclude pattern
                relative_path = file_path.relative_to(docs_path)
-                if self._should_exclude_file(relative_path, exclude_patterns):
+                if self._should_exclude_file(relative_path, gitignore_matches):
                    continue

                print(f"Processing PDF: {file_path}")
@@ -449,14 +423,34 @@ Examples:
            ]
        # Try to load other file types, but don't fail if none are found
        try:
+            # Create a custom file filter function using our PathSpec
+            def file_filter(file_path: str) -> bool:
+                """Return True if file should be included (not excluded)"""
+                try:
+                    docs_path_obj = Path(docs_dir)
+                    file_path_obj = Path(file_path)
+                    relative_path = file_path_obj.relative_to(docs_path_obj)
+                    return not self._should_exclude_file(relative_path, gitignore_matches)
+                except (ValueError, OSError):
+                    return True  # Include files that can't be processed
+
            other_docs = SimpleDirectoryReader(
                docs_dir,
                recursive=True,
                encoding="utf-8",
                required_exts=code_extensions,
-                exclude=exclude_patterns,
+                file_extractor={},  # Use default extractors
+                filename_as_id=True,
            ).load_data(show_progress=True)
-            documents.extend(other_docs)
+
+            # Filter documents after loading based on gitignore rules
+            filtered_docs = []
+            for doc in other_docs:
+                file_path = doc.metadata.get("file_path", "")
+                if file_filter(file_path):
+                    filtered_docs.append(doc)
+
+            documents.extend(filtered_docs)
        except ValueError as e:
            if "No files found" in str(e):
                print("No additional files found for other supported types.")
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "leann"
-version = "0.2.5"
+version = "0.2.6"
 description = "LEANN - The smallest vector index in the world. RAG Everything with LEANN!"
 readme = "README.md"
 requires-python = ">=3.9"
@@ -32,7 +32,7 @@ dependencies = [
    "pypdfium2>=4.30.0",
    # LlamaIndex core and readers - updated versions
    "llama-index>=0.12.44",
-    "llama-index-readers-file>=0.4.0",  # Essential for PDF parsing
+    "llama-index-readers-file>=0.4.0", # Essential for PDF parsing
    # "llama-index-readers-docling",  # Requires Python >= 3.10
    # "llama-index-node-parser-docling",  # Requires Python >= 3.10
    "llama-index-vector-stores-faiss>=0.4.0",
@@ -43,6 +43,9 @@ dependencies = [
    "mlx>=0.26.3; sys_platform == 'darwin'",
    "mlx-lm>=0.26.0; sys_platform == 'darwin'",
    "psutil>=5.8.0",
+    "pathspec>=0.12.1",
+    "nbconvert>=7.16.6",
+    "gitignore-parser>=0.1.12",
 ]

 [project.optional-dependencies]
Author	SHA1	Message	Date
Andy Lee	fe942329d6	fix: improve gitignore and Jupyter notebook support - Add nbconvert dependency for .ipynb file support - Replace manual gitignore parsing with gitignore-parser library - Proper recursive .gitignore handling (all subdirectories) - Fix compliance with Git gitignore behavior - Simplify code and improve reliability 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 18:52:55 -07:00
yichuan520030910320	9801aa581b	[Readme]update embedding model config according to reddit feedback	2025-08-09 21:33:33 -07:00
GitHub Actions	5e97916608	chore: release v0.2.6	2025-08-10 03:39:45 +00:00
Andy Lee	8b9c2be8c9	Feat/claude code refine (#24 ) * feat: Add Ollama embedding support for local embedding models * docs: Add clear documentation for Ollama embedding usage * fix: remove leann_ask * docs: remove ollama embedding extra instructions * simplify MCP interface for Claude Code - Remove unnecessary search parameters: search_mode, recompute_embeddings, file_types, min_score - Remove leann_clear tool (not needed for Claude Code workflow) - Streamline search to only use: query, index_name, top_k, complexity - Keep core tools: leann_index, leann_search, leann_status, leann_list 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * remove leann_index from MCP interface Users should use CLI command 'leann build' to create indexes first. MCP now only provides search functionality: - leann_search: search existing indexes - leann_status: check index health - leann_list: list available indexes This separates index creation (CLI) from search (Claude Code). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * improve CLI with auto project name and .gitignore support - Make index_name optional, auto-use current directory name - Read .gitignore patterns and respect them during indexing - Add _read_gitignore_patterns() to parse .gitignore files - Add _should_exclude_file() for pattern matching - Apply exclusion patterns to both PDF and general file processing - Show helpful messages about gitignore usage Now users can simply run: leann build And it will use project name + respect .gitignore patterns. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-08-09 20:37:17 -07:00