Goondex/README.md

3.9 KiB
Raw Blame History

🧠 PornPics Gallery Importer (Porndex System)

Version 0.4.2 — Unified Importer & ML Pipeline

A modular and well-documented gallery importer for PornPics.com built for the Porndex ecosystem.
Supports importing, tagging, metadata enrichment, and machine learningready dataset generation.


📂 Project Structure

src/ → Core source ├── importer/ → Gallery importers, tag tools, and TPDB bridge │ ├── cli.py → Unified CLI (porndex-importer) │ ├── gallery_importer.py → Gallery parsing/downloading │ ├── tag_gallery.py → Tag management & YAML dictionaries │ ├── reports/ → Tag and enrichment summaries │ ├── db/ → Cached sources & enrichment data │ ├── secrets/ → API keys and credentials (ignored in Git) │ └── tag_dictionaries/ → YAML-based tag definitions │ ├── ml/ → Machine learning modules │ ├── ml_dataset_builder.py → Build JSONL dataset │ ├── ml_embeddings.py → Generate CLIP+Text embeddings │ ├── ml_dataset_inspector.py → Inspect or visualize dataset (planned) │ └── ml_vision_detector.py → GroundingDINO + SAM integration (planned) │ ├── docs/ → Documentation & changelogs ├── tests/ → Unit and integration tests └── assets/ → Static data or sample media

yaml Copy code


⚙️ Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Then, from the root of the project:

bash
Copy code
export PYTHONPATH=src
🚀 Quick Start
Import a Gallery
bash
Copy code
porndex-importer import "https://www.pornpics.com/galleries/example-gallery-id/"
Automatically:

Downloads images and metadata

Saves to Galleries/<timestamp>_<model>_<title>/

Creates metadata.json

Runs auto-tagging (refresh-one)

Updates library index

🧩 Core Features
Feature	Description
Importer	Downloads and parses galleries from PornPics
Auto-Tagging	Generates tags based on YAML dictionaries
Metadata Refresh	Updates all galleries with new metadata
Source Management	Track and bulk-update content sources
CLI Tool	Unified command: porndex-importer
TPDB Bridge	Enrich performers and metadata via ThePornDB API
ML Dataset Builder	Generates a unified dataset (JSONL)
Hybrid Embeddings	Builds combined CLIP + text vectors for semantic search

🤖 Machine Learning Pipeline
1⃣ Build Dataset
bash
Copy code
python -m ml.ml_dataset_builder
Creates:

bash
Copy code
ML/porndex_dataset.jsonl
Each record includes title, models, tags, and full image paths (no file duplication).

2⃣ Build Embeddings
bash
Copy code
python -m ml.ml_embeddings build --img-samples 8 --device auto
Generates:

bash
Copy code
ML/embeddings/<gallery_id>.npz
ML/embeddings_index.jsonl
Uses:

SentenceTransformer for text

OpenCLIP (ViT-B/32) for images
and produces a combined hybrid vector.

3⃣ Search Your Library
bash
Copy code
# Semantic search (default)
python -m ml.ml_embeddings search "japanese redhead creampie"

# Strict literal search
python -m ml.ml_embeddings search "interracial bbc" --mode strict
4⃣ Verify Integrity
bash
Copy code
python -m ml.ml_embeddings verify
Displays:

Total indexed records

Images sampled

NPZ validation summary

🧠 Development Guidelines
No emojis in code or commits.

Use descriptive variable names.

Commit only verified working features.

Document all new features in docs/CHANGELOG.md.

Keep docs and CLI output in sync with docs/CLI_USAGE.md.

🗺️ Roadmap (v0.4.x → v0.5.x)
Stage	Feature	Description
✅	ML Embedding Search	Hybrid text+image similarity
⚙️	Gender & Ethnicity Detection	Person-level classification
⏳	GroundingDINO Integration	Object/region localization
⏳	Grounded SAM + BLIP	Visual attribute extraction (clothing, actions)
🔜	Active Learning	Re-train from gallery metadata and tags

📄 License
MIT — Internal Research Use Only
Author: Leak Technologies