Goondex/README.md

145 lines
3.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🧠 PornPics Gallery Importer (Porndex System)
**Version 0.4.2 — Unified Importer & ML Pipeline**
A modular and well-documented gallery importer for [PornPics.com](https://www.pornpics.com) built for the **Porndex** ecosystem.
Supports importing, tagging, metadata enrichment, and machine learningready dataset generation.
---
## 📂 Project Structure
src/ → Core source
├── importer/ → Gallery importers, tag tools, and TPDB bridge
│ ├── cli.py → Unified CLI (porndex-importer)
│ ├── gallery_importer.py → Gallery parsing/downloading
│ ├── tag_gallery.py → Tag management & YAML dictionaries
│ ├── reports/ → Tag and enrichment summaries
│ ├── db/ → Cached sources & enrichment data
│ ├── secrets/ → API keys and credentials (ignored in Git)
│ └── tag_dictionaries/ → YAML-based tag definitions
├── ml/ → Machine learning modules
│ ├── ml_dataset_builder.py → Build JSONL dataset
│ ├── ml_embeddings.py → Generate CLIP+Text embeddings
│ ├── ml_dataset_inspector.py → Inspect or visualize dataset (planned)
│ └── ml_vision_detector.py → GroundingDINO + SAM integration (planned)
├── docs/ → Documentation & changelogs
├── tests/ → Unit and integration tests
└── assets/ → Static data or sample media
yaml
Copy code
---
## ⚙️ Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Then, from the root of the project:
bash
Copy code
export PYTHONPATH=src
🚀 Quick Start
Import a Gallery
bash
Copy code
porndex-importer import "https://www.pornpics.com/galleries/example-gallery-id/"
Automatically:
Downloads images and metadata
Saves to Galleries/<timestamp>_<model>_<title>/
Creates metadata.json
Runs auto-tagging (refresh-one)
Updates library index
🧩 Core Features
Feature Description
Importer Downloads and parses galleries from PornPics
Auto-Tagging Generates tags based on YAML dictionaries
Metadata Refresh Updates all galleries with new metadata
Source Management Track and bulk-update content sources
CLI Tool Unified command: porndex-importer
TPDB Bridge Enrich performers and metadata via ThePornDB API
ML Dataset Builder Generates a unified dataset (JSONL)
Hybrid Embeddings Builds combined CLIP + text vectors for semantic search
🤖 Machine Learning Pipeline
1⃣ Build Dataset
bash
Copy code
python -m ml.ml_dataset_builder
Creates:
bash
Copy code
ML/porndex_dataset.jsonl
Each record includes title, models, tags, and full image paths (no file duplication).
2⃣ Build Embeddings
bash
Copy code
python -m ml.ml_embeddings build --img-samples 8 --device auto
Generates:
bash
Copy code
ML/embeddings/<gallery_id>.npz
ML/embeddings_index.jsonl
Uses:
SentenceTransformer for text
OpenCLIP (ViT-B/32) for images
and produces a combined hybrid vector.
3⃣ Search Your Library
bash
Copy code
# Semantic search (default)
python -m ml.ml_embeddings search "japanese redhead creampie"
# Strict literal search
python -m ml.ml_embeddings search "interracial bbc" --mode strict
4⃣ Verify Integrity
bash
Copy code
python -m ml.ml_embeddings verify
Displays:
Total indexed records
Images sampled
NPZ validation summary
🧠 Development Guidelines
No emojis in code or commits.
Use descriptive variable names.
Commit only verified working features.
Document all new features in docs/CHANGELOG.md.
Keep docs and CLI output in sync with docs/CLI_USAGE.md.
🗺️ Roadmap (v0.4.x → v0.5.x)
Stage Feature Description
✅ ML Embedding Search Hybrid text+image similarity
⚙️ Gender & Ethnicity Detection Person-level classification
⏳ GroundingDINO Integration Object/region localization
⏳ Grounded SAM + BLIP Visual attribute extraction (clothing, actions)
🔜 Active Learning Re-train from gallery metadata and tags
📄 License
MIT — Internal Research Use Only
Author: Leak Technologies