# Goondex Tagging System Architecture ## Vision Enable ML-driven search queries like: - "3 black men in a scene where a blonde milf wears pink panties and black heels" - Image-based scene detection and recommendation - Auto-tagging from PornPics image imports ## Core Requirements ### 1. Tag Categories (Hierarchical Structure) Tags need to be organized by category for efficient filtering and ML training: ``` performers/ └─ [already implemented via performers table] people/ ├─ count/ (1, 2, 3, 4, 5+, orgy, etc.) ├─ ethnicity/ (black, white, asian, latina, etc.) ├─ age_category/ (teen, milf, mature, etc.) ├─ body_type/ (slim, athletic, curvy, bbw, etc.) └─ hair/ ├─ color/ (blonde, brunette, redhead, etc.) └─ length/ (short, long, bald, etc.) clothing/ ├─ type/ (lingerie, uniform, casual, etc.) ├─ color/ (pink, black, red, white, etc.) ├─ specific/ ├─ top/ (bra, corset, tank_top, etc.) ├─ bottom/ (panties, skirt, jeans, etc.) └─ footwear/ (heels, boots, stockings, etc.) position/ ├─ category/ (standing, lying, sitting, etc.) └─ specific/ (missionary, doggy, cowgirl, etc.) action/ ├─ sexual/ (oral, penetration, etc.) └─ non_sexual/ (kissing, undressing, etc.) setting/ ├─ location/ (bedroom, office, outdoor, etc.) └─ time/ (day, night, etc.) production/ ├─ quality/ (hd, 4k, vr, etc.) └─ style/ (pov, amateur, professional, etc.) ``` ### 2. Database Schema Extensions #### Enhanced Tags Table ```sql CREATE TABLE IF NOT EXISTS tag_categories ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL UNIQUE, -- e.g., "clothing/color" parent_id INTEGER, -- for hierarchical categories description TEXT, created_at TEXT NOT NULL DEFAULT (datetime('now')), FOREIGN KEY (parent_id) REFERENCES tag_categories(id) ON DELETE CASCADE ); CREATE TABLE IF NOT EXISTS tags ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, -- e.g., "pink" category_id INTEGER NOT NULL, -- links to "clothing/color" aliases TEXT, -- comma-separated: "hot pink,rose" description TEXT, source TEXT, -- tpdb, user, ml source_id TEXT, created_at TEXT NOT NULL DEFAULT (datetime('now')), updated_at TEXT NOT NULL DEFAULT (datetime('now')), UNIQUE(category_id, name), FOREIGN KEY (category_id) REFERENCES tag_categories(id) ON DELETE CASCADE ); -- Enhanced scene-tag junction with ML confidence CREATE TABLE IF NOT EXISTS scene_tags ( scene_id INTEGER NOT NULL, tag_id INTEGER NOT NULL, confidence REAL DEFAULT 1.0, -- 0.0-1.0 for ML predictions source TEXT NOT NULL DEFAULT 'user', -- 'user', 'ml', 'tpdb' verified BOOLEAN DEFAULT 0, -- human verification flag created_at TEXT NOT NULL DEFAULT (datetime('now')), PRIMARY KEY (scene_id, tag_id), FOREIGN KEY (scene_id) REFERENCES scenes(id) ON DELETE CASCADE, FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE ); -- Track images associated with scenes (for ML training) CREATE TABLE IF NOT EXISTS scene_images ( id INTEGER PRIMARY KEY AUTOINCREMENT, scene_id INTEGER NOT NULL, image_url TEXT NOT NULL, image_path TEXT, -- local storage path source TEXT, -- pornpics, tpdb, user source_id TEXT, width INTEGER, height INTEGER, file_size INTEGER, created_at TEXT NOT NULL DEFAULT (datetime('now')), FOREIGN KEY (scene_id) REFERENCES scenes(id) ON DELETE CASCADE ); -- ML model predictions for future reference CREATE TABLE IF NOT EXISTS ml_predictions ( id INTEGER PRIMARY KEY AUTOINCREMENT, scene_id INTEGER, image_id INTEGER, model_version TEXT NOT NULL, -- track which ML model made prediction predictions TEXT NOT NULL, -- JSON: [{"tag_id": 123, "confidence": 0.95}, ...] created_at TEXT NOT NULL DEFAULT (datetime('now')), FOREIGN KEY (scene_id) REFERENCES scenes(id) ON DELETE CASCADE, FOREIGN KEY (image_id) REFERENCES scene_images(id) ON DELETE CASCADE ); ``` #### Indexes for ML Performance ```sql -- Tag search performance CREATE INDEX IF NOT EXISTS idx_tags_category ON tags(category_id); CREATE INDEX IF NOT EXISTS idx_tags_name ON tags(name); -- Scene tag filtering (critical for complex queries) CREATE INDEX IF NOT EXISTS idx_scene_tags_tag ON scene_tags(tag_id); CREATE INDEX IF NOT EXISTS idx_scene_tags_confidence ON scene_tags(confidence); CREATE INDEX IF NOT EXISTS idx_scene_tags_verified ON scene_tags(verified); -- Image processing CREATE INDEX IF NOT EXISTS idx_scene_images_scene ON scene_images(scene_id); CREATE INDEX IF NOT EXISTS idx_scene_images_source ON scene_images(source, source_id); ``` ### 3. Complex Query Architecture For queries like "3 black men + blonde milf + pink panties + black heels": ```sql -- Step 1: Find scenes with all required tags WITH required_tags AS ( SELECT scene_id, COUNT(DISTINCT tag_id) as tag_count FROM scene_tags st JOIN tags t ON st.tag_id = t.id WHERE (t.name = 'black' AND category_id = (SELECT id FROM tag_categories WHERE name = 'people/ethnicity')) OR (t.name = 'blonde' AND category_id = (SELECT id FROM tag_categories WHERE name = 'people/hair/color')) OR (t.name = 'pink' AND category_id = (SELECT id FROM tag_categories WHERE name = 'clothing/color')) -- etc. AND st.verified = 1 -- only human-verified tags AND st.confidence >= 0.8 -- or ML predictions above threshold GROUP BY scene_id HAVING tag_count >= 4 -- all required tags present ) SELECT s.* FROM scenes s JOIN required_tags rt ON s.id = rt.scene_id -- Additional filtering for performer count, etc. ``` ### 4. ML Integration Points #### Phase 1: Data Collection (Current) - Import scenes from TPDB with metadata - Import images from PornPics - Manual tagging to build training dataset #### Phase 2: Tag Suggestion (Future) - ML model suggests tags based on images - Store predictions with confidence scores - Human verification workflow #### Phase 3: Auto-tagging (Future) - High-confidence predictions auto-applied - Periodic retraining with verified data - Confidence thresholds per tag category ### 5. Data Quality Safeguards **Prevent Tag Spam:** - Tag category constraints (can't tag "bedroom" as "clothing/color") - Minimum confidence thresholds - Rate limiting on ML predictions **Ensure Consistency:** - Tag aliases for variations (pink/rose/hot_pink) - Batch tag operations - Tag merging/splitting tools **Human Oversight:** - Verification workflow for ML tags - Tag dispute resolution - Quality metrics per tagger (user/ml) ### 6. API Design (Future) ```go // TagService interface type TagService interface { // Basic CRUD CreateTag(categoryID int64, name string, aliases []string) (*Tag, error) GetTagByID(id int64) (*Tag, error) SearchTags(query string, categoryID *int64) ([]Tag, error) // Scene tagging AddTagToScene(sceneID, tagID int64, source string, confidence float64) error RemoveTagFromScene(sceneID, tagID int64) error GetSceneTags(sceneID int64, verified bool) ([]Tag, error) // Complex queries SearchScenesByTags(requirements TagRequirements) ([]Scene, error) // ML integration StorePrediction(sceneID int64, predictions []TagPrediction) error VerifyTag(sceneID, tagID int64) error BulkVerifyTags(sceneID int64, tagIDs []int64) error } type TagRequirements struct { Required []TagFilter // must have ALL Optional []TagFilter // nice to have (scoring) Excluded []TagFilter // must NOT have MinConfidence float64 VerifiedOnly bool } type TagFilter struct { CategoryPath string // "clothing/color" Value string // "pink" Operator string // "equals", "contains", "gt", "lt" } ``` ## Implementation Roadmap ### v0.2.0: Enhanced Tagging Foundation 1. ✅ Fix NULL handling (completed) 2. Implement tag_categories table and seed data 3. Update tags table with category_id foreign key 4. Enhance scene_tags with confidence/source/verified 5. Add scene_images table for PornPics integration 6. Create TagService with basic CRUD ### v0.3.0: Advanced Search 1. Implement complex tag query builder 2. Add tag filtering UI/CLI commands 3. Performance optimization with proper indexes 4. Tag statistics and reporting ### v0.4.0: ML Preparation 1. Image import from PornPics 2. ML prediction storage table 3. Tag verification workflow 4. Training dataset export ### v0.5.0: ML Integration 1. Image classification model 2. Auto-tagging pipeline 3. Confidence threshold tuning 4. Retraining automation ## Notes - **Backwards Compatibility**: Current tags table can migrate by adding category_id = (category "general") - **Storage Consideration**: Images may require significant disk space - consider cloud storage integration - **Privacy**: All personal data remains local unless explicitly synced - **Performance**: Proper indexing critical - complex queries with 10+ tags need optimization ## Example User Flow 1. User imports scene from TPDB → Basic metadata populated 2. User uploads/links images from PornPics → scene_images populated 3. ML model scans images → scene_tags created with confidence < 1.0, source = 'ml' 4. User reviews suggestions → verified = 1 for accepted tags 5. User searches "blonde + heels" → Query filters by verified tags or confidence > 0.9 6. System returns ranked results based on tag match confidence