Goondex/docs/README.md

177 lines
6.0 KiB
Markdown

File: docs/README.md
Version: v0.3.4
Last updated: November 2025
Maintainer: Leak Technologies
Project: Goondex
------------------------------------------------------------
Goondex — PornPics Importer & ML Pipeline
------------------------------------------------------------
A modular, documented gallery importer for PornPics.com, forming the foundation of the Goondex ecosystem.
Supports importing, tagging, metadata enrichment, and generation of ML-ready datasets for semantic search and classification.
------------------------------------------------------------
1. Project Overview
------------------------------------------------------------
Goondex automates the process of:
- Downloading and organizing galleries from PornPics.com
- Generating structured metadata and tag inference
- Enriching galleries via ThePornDB (TPDB) performer API
- Building machine-learning datasets and embeddings
- Enabling semantic, hybrid (text + image) search
All operations are handled locally — no cloud dependencies or external databases are required.
The system is modular, transparent, and designed for research and personal archival use.
------------------------------------------------------------
2. Project Structure
------------------------------------------------------------
src/
├── importer/ → Core importer logic and CLI tools
│ ├── cli.py → Unified CLI entrypoint (goondex command)
│ ├── gallery_importer.py → Gallery parser and downloader
│ ├── tag_gallery.py → Tag inference and YAML management
│ ├── reports/ → Auto-generated validation and tag stats
│ ├── db/ → TPDB performer cache and local databases
│ ├── secrets/ → Local-only API keys (ignored by Git)
│ └── tag_dictionaries/ → Modular YAML tag dictionaries
├── ml/ → Machine learning and semantic search
│ ├── ml_dataset_builder.py → Builds JSONL dataset for embeddings
│ ├── ml_embeddings.py → Generates CLIP + text hybrid vectors
│ ├── ml_dataset_inspector.py → (planned) visual dataset viewer
│ └── ml_vision_detector.py → (planned) DINO + SAM visual tagging
├── docs/ → Documentation, changelogs, and brand files
├── tests/ → Unit and integration testing suite
└── assets/ → Static samples and test assets
------------------------------------------------------------
3. Environment Setup
------------------------------------------------------------
Create a virtual environment and install dependencies:
bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Set the source path for development:
bash
export PYTHONPATH=src
------------------------------------------------------------
4. Quick Start
------------------------------------------------------------
Import a gallery from PornPics:
bash
goondex import "https://www.pornpics.com/galleries/example-id/"
Automatically:
- Downloads images and metadata
- Saves to Galleries/<timestamp>_<model>_<title>/
- Generates metadata.json
- Runs auto-tagging (refresh-one)
- Updates the central gallery index
------------------------------------------------------------
5. CLI Overview
------------------------------------------------------------
All commands are run via:
goondex <command> [args...]
Examples:
goondex refresh-all
goondex refresh-one "<folder>"
goondex validate-tags
goondex tag-stats
goondex list-tags "<folder>"
goondex add "<folder>" "TagName"
goondex source bulk set "PornPics"
The CLI automatically detects YAML tag dictionaries and applies them during refresh or import.
------------------------------------------------------------
6. Machine Learning Pipeline
------------------------------------------------------------
Build dataset:
bash
python -m ml.ml_dataset_builder
Output:
ML/porndex_dataset.jsonl
Each record includes:
{
"gallery_id": "...",
"title": "...",
"models": ["..."],
"tags": ["..."],
"categories": ["..."],
"image_paths": ["..."]
}
Build embeddings:
bash
python -m ml.ml_embeddings build --img-samples 8 --device auto
Output:
ML/embeddings/<gallery_id>.npz
ML/embeddings_index.jsonl
Search:
bash
python -m ml.ml_embeddings search "asian redhead solo"
Modes:
- semantic (default) — hybrid vector cosine similarity
- text — text-only search
- strict — literal keyword matching
Verify:
bash
python -m ml.ml_embeddings verify
------------------------------------------------------------
7. Development Guidelines
------------------------------------------------------------
- Use descriptive variable names and structured commits
- Avoid emojis in code and commit messages
- Always document new features in docs/CHANGELOG.md
- Keep CLI text synchronized with docs/CLI_USAGE.md
- Use version tagging for all major commits
------------------------------------------------------------
8. Roadmap Summary
------------------------------------------------------------
Stage Feature Description
----------- -------------------------------- -----------------------------
✅ v0.3.x Stable CLI & Tagging Unified CLI and YAML cleanup
⚙️ v0.4.x ML Embeddings & Dataset Builder Build hybrid vectors for search
⏳ v0.5.x Visual Intelligence DINO + SAM + attribute detection
🔜 v0.6.x Local Web UI Lightweight gallery browser
🚀 v1.0.0 Full Stable Release Plugin importers + visual ML tools
------------------------------------------------------------
9. Licensing
------------------------------------------------------------
License: Research-Use MIT Variant
Author: Leak Technologies
Maintainer: Stu Leak
For personal, non-commercial, and research use only.
------------------------------------------------------------
End of File
------------------------------------------------------------