145 lines
3.9 KiB
Markdown
145 lines
3.9 KiB
Markdown
# 🧠 PornPics Gallery Importer (Porndex System)
|
||
**Version 0.4.2 — Unified Importer & ML Pipeline**
|
||
|
||
A modular and well-documented gallery importer for [PornPics.com](https://www.pornpics.com) built for the **Porndex** ecosystem.
|
||
Supports importing, tagging, metadata enrichment, and machine learning–ready dataset generation.
|
||
|
||
---
|
||
|
||
## 📂 Project Structure
|
||
|
||
src/ → Core source
|
||
├── importer/ → Gallery importers, tag tools, and TPDB bridge
|
||
│ ├── cli.py → Unified CLI (porndex-importer)
|
||
│ ├── gallery_importer.py → Gallery parsing/downloading
|
||
│ ├── tag_gallery.py → Tag management & YAML dictionaries
|
||
│ ├── reports/ → Tag and enrichment summaries
|
||
│ ├── db/ → Cached sources & enrichment data
|
||
│ ├── secrets/ → API keys and credentials (ignored in Git)
|
||
│ └── tag_dictionaries/ → YAML-based tag definitions
|
||
│
|
||
├── ml/ → Machine learning modules
|
||
│ ├── ml_dataset_builder.py → Build JSONL dataset
|
||
│ ├── ml_embeddings.py → Generate CLIP+Text embeddings
|
||
│ ├── ml_dataset_inspector.py → Inspect or visualize dataset (planned)
|
||
│ └── ml_vision_detector.py → GroundingDINO + SAM integration (planned)
|
||
│
|
||
├── docs/ → Documentation & changelogs
|
||
├── tests/ → Unit and integration tests
|
||
└── assets/ → Static data or sample media
|
||
|
||
yaml
|
||
Copy code
|
||
|
||
---
|
||
|
||
## ⚙️ Setup
|
||
|
||
```bash
|
||
python3 -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
Then, from the root of the project:
|
||
|
||
bash
|
||
Copy code
|
||
export PYTHONPATH=src
|
||
🚀 Quick Start
|
||
Import a Gallery
|
||
bash
|
||
Copy code
|
||
porndex-importer import "https://www.pornpics.com/galleries/example-gallery-id/"
|
||
Automatically:
|
||
|
||
Downloads images and metadata
|
||
|
||
Saves to Galleries/<timestamp>_<model>_<title>/
|
||
|
||
Creates metadata.json
|
||
|
||
Runs auto-tagging (refresh-one)
|
||
|
||
Updates library index
|
||
|
||
🧩 Core Features
|
||
Feature Description
|
||
Importer Downloads and parses galleries from PornPics
|
||
Auto-Tagging Generates tags based on YAML dictionaries
|
||
Metadata Refresh Updates all galleries with new metadata
|
||
Source Management Track and bulk-update content sources
|
||
CLI Tool Unified command: porndex-importer
|
||
TPDB Bridge Enrich performers and metadata via ThePornDB API
|
||
ML Dataset Builder Generates a unified dataset (JSONL)
|
||
Hybrid Embeddings Builds combined CLIP + text vectors for semantic search
|
||
|
||
🤖 Machine Learning Pipeline
|
||
1️⃣ Build Dataset
|
||
bash
|
||
Copy code
|
||
python -m ml.ml_dataset_builder
|
||
Creates:
|
||
|
||
bash
|
||
Copy code
|
||
ML/porndex_dataset.jsonl
|
||
Each record includes title, models, tags, and full image paths (no file duplication).
|
||
|
||
2️⃣ Build Embeddings
|
||
bash
|
||
Copy code
|
||
python -m ml.ml_embeddings build --img-samples 8 --device auto
|
||
Generates:
|
||
|
||
bash
|
||
Copy code
|
||
ML/embeddings/<gallery_id>.npz
|
||
ML/embeddings_index.jsonl
|
||
Uses:
|
||
|
||
SentenceTransformer for text
|
||
|
||
OpenCLIP (ViT-B/32) for images
|
||
and produces a combined hybrid vector.
|
||
|
||
3️⃣ Search Your Library
|
||
bash
|
||
Copy code
|
||
# Semantic search (default)
|
||
python -m ml.ml_embeddings search "japanese redhead creampie"
|
||
|
||
# Strict literal search
|
||
python -m ml.ml_embeddings search "interracial bbc" --mode strict
|
||
4️⃣ Verify Integrity
|
||
bash
|
||
Copy code
|
||
python -m ml.ml_embeddings verify
|
||
Displays:
|
||
|
||
Total indexed records
|
||
|
||
Images sampled
|
||
|
||
NPZ validation summary
|
||
|
||
🧠 Development Guidelines
|
||
No emojis in code or commits.
|
||
|
||
Use descriptive variable names.
|
||
|
||
Commit only verified working features.
|
||
|
||
Document all new features in docs/CHANGELOG.md.
|
||
|
||
Keep docs and CLI output in sync with docs/CLI_USAGE.md.
|
||
|
||
🗺️ Roadmap (v0.4.x → v0.5.x)
|
||
Stage Feature Description
|
||
✅ ML Embedding Search Hybrid text+image similarity
|
||
⚙️ Gender & Ethnicity Detection Person-level classification
|
||
⏳ GroundingDINO Integration Object/region localization
|
||
⏳ Grounded SAM + BLIP Visual attribute extraction (clothing, actions)
|
||
🔜 Active Learning Re-train from gallery metadata and tags
|
||
|
||
📄 License
|
||
MIT — Internal Research Use Only
|
||
Author: Leak Technologies |