diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md new file mode 100644 index 0000000..5f14edc --- /dev/null +++ b/docs/CHANGELOG.md @@ -0,0 +1,325 @@ +# 📜 Goondex — Full Changelog +> **Repository:** Leak Technologies +> **Branch:** main +> **Version Line:** v0.3.x Development Cycle +> _Formerly: Porndex Importer (PornPics Importer Module)_ + + +--- + +## [v0.3.2-rebuild] — Repository Cleanup & Stabilization (2025-11-02) + +### ✨ Added +- Introduced project-wide `.gitignore` to exclude gallery media and model weights. +- Added `VERSION` file (v0.3.2) for synchronized CLI and metadata versioning. +- Implemented environment fix for Fish-shell virtualenv activation. +- Ensured unified `porndex` CLI entrypoint under `/src/importer/cli.py`. + +### 🧹 Maintenance +- Removed redundant and outdated tags (v0.3.0–v0.4.1) from remote. +- Normalized repository tree and re-pushed clean 4.6 GiB → base v0.3.2. +- Prepared groundwork for `--help` and `--version` CLI arguments. + +--- + +## [v0.3.0] — Modular Tagging Framework Foundation (2025-10-18) + +### ✨ Added +- Introduced **YAML-based Tag Dictionaries** stored under `/src/importer/tagging/` for modular, human-readable tag definitions. +- Implemented initial **`refresh-all`** and **`refresh-one`** commands for reapplying tag inference to galleries. +- Added **persistent `inferred_tags` field** in `metadata.json` to differentiate between automated and manual tags. +- Implemented **automatic source inference** for known networks (e.g., Brazzers, FTV Girls, PornPics). +- Enhanced CLI output with colorized progress indicators and summary totals. + +### 🛠 Changed +- Refactored `tag_gallery.py` for modular tagging architecture. +- Centralized configuration paths to `/src/importer/config/` for easier project-wide access. + +### 🧹 Maintenance +- Improved exception handling for missing or malformed tag dictionaries. +- Added consistent emoji/logging system across CLI commands. + +--- + +## [v0.3.1] — CLI Polishing & Dictionary Improvements (2025-10-19) + +### ✨ Added +- Introduced **CLI argument parsing** with `argparse` for a unified user interface. +- Added `--verbose` flag for detailed debugging output. +- Added **metadata validation** to ensure all tag dictionaries contain unique keywords. + +### 🛠 Changed +- Adjusted internal path resolution to work from both installed and development environments. +- Improved `load_all_tag_maps()` with caching and better error resilience. + +### 🧹 Maintenance +- Cleaned duplicate mappings within YAML files. +- Improved documentation and inline docstrings throughout importer modules. + +--- + +## [v0.3.2] — TPDB Bridge Integration (2025-10-21) + +### ✨ Added +- Introduced **`tpdb_bridge.py`** for importing performer data from *ThePornDB* API. +- Added local **SQLite performer database** under `/src/importer/db/performers.db`. +- Added commands: + - `fetch` — Import performers in a single batch. + - `fill-index` — Continuously pull until a limit is reached. + - `enrich` — Fetch and merge extended performer metadata. + - `sync-all` — Hybrid incremental fetch + enrich loop. +- Introduced **local API key management** using `tpdb_api_key.txt` under `/secrets/`. + +### 🧹 Maintenance +- Verified importer against TPDB rate limits and ensured safe error recovery. +- Added initial test data exports to `/src/importer/reports/`. + +--- + +## [v0.3.3] — YAML Tag Inference Update (2025-10-20) + +### ✨ Added +- Dynamic **YAML tag dictionary loader** for modular tag categories. +- Introduced **automatic source inference** for common networks. +- Added **`refresh-all`** bulk operation to reapply tag inference globally. + +### 🛠 Changed +- Refactored `infer_tags()` to merge results from multiple YAML files dynamically. +- Enhanced progress and summary reporting for tag inference. + +### 🧹 Maintenance +- Fixed `AttributeError: 'int' object has no attribute 'lower'` when parsing numeric YAML values. +- Standardized internal naming conventions. + +--- + +## [v0.3.4] — Tag Dictionary Validation & Cleanup (2025-10-20) + +### ✨ Added +- **`validate-tags`** CLI command for verifying YAML tag dictionaries. + - Detects duplicates, empty entries, and conflicting keywords. + - Outputs detailed summaries with per-keyword conflict listings. + +### 🛠 Changed +- Standardized YAML structure enforcement (consistent key capitalization and layout). +- Added human-readable validation summaries. + +### 🧹 Maintenance +- General code cleanup and consistent logging system updates. + +--- + +## [v0.3.5] — Tag Statistics & Unified CLI Update (2025-10-20) + +### ✨ Added +- **Tag Statistics System** + - Introduced `tag-stats` command to generate frequency analytics across all gallery metadata. + - Produces both console summaries and saved reports: + - `reports/tag_stats.json` — JSON-formatted tag counts. + - `reports/tag_stats_sorted.txt` — human-readable ranked list. +- **Unified CLI Interface (`cli.py`)** + - Consolidated all tagging and maintenance operations into a single entrypoint: + - `refresh-all`, `refresh-one`, `validate-tags`, `tag-stats`, `list`, `list-tags`, `add`, `remove`, `add-multi`, `show-metadata`, `source` + - Standardized command syntax and output formatting across all operations. + +### 🛠 Changed +- Centralized tag frequency logic into `tag_gallery.py`. +- Refactored CLI dispatch system for scalability and better error handling. +- Standardized output style (headers, dividers, alignment). + +### 🧹 Maintenance +- Automatic creation of `/src/importer/reports/` when missing. +- Verified all tag operations across 60+ galleries. +- Unified terminology and capitalization across CLI help text and docstrings. + +### 🧭 Next Steps +- Add color-coded CLI output for readability. +- Implement `--export-csv` flag for `tag-stats` output. +- Begin roadmap for **v0.4.0** introducing ML-based tag confidence scoring and category weighting. + +--- + +## [v0.3.6] — Enrichment Verification & Freshness Tracking (2025-10-26) + +### ✨ Added +- **verify-enrichment command** + - Scans performer database for missing metadata (e.g., `url`, `last_updated`). + - Reports enriched vs incomplete entries, with preview via `--show-missing`. +- **Freshness tracking** + - Displays oldest and most recent enrichment timestamps. + - Warns if data is older than the freshness threshold. +- **Automatic TPDB key validation** + - Checks for valid API key and provides setup help if missing. + +### 🛠 Changed +- Enrichment logic now guarantees `url` and `last_updated` fields for all performers. +- Improved emoji-based CLI logs for clarity. +- CLI outputs enrichment stats after each batch during `sync-all`. + +### 🧹 Maintenance +- Cleanup and refactor of `tpdb_bridge.py` for readability and modular design. +- Verified completeness: **5,087 performers enriched** and up to date. +- Improved sleep timing and network error recovery during long sync runs. + +### 🧭 Next Steps +- Add `--stale-days` CLI flag for user-defined freshness thresholds. +- Implement automatic enrichment scheduling via cron or systemd. +- Add shortcut alias `porndex-importer verify` for database status checks. + +--- + +[v0.3.7] — Scene-Based Enrichment & Channel Auto-Upgrade (2025-10-26) +✨ Added + +Scene-based enrichment system + +New flag --use-scenes enables intelligent inference of performer studios/channels using recent scene data from ThePornDB. + +Automatically scans /performers/{id}/scenes for studio, site, or network fields when direct metadata is missing. + +Dynamically upgrades performer entries from “Unknown” to valid channel names (e.g., “Desire Room”, “I Want Clips: Princess Chanel”). + +Enhanced enrichment diagnostics + +--debug-channels now outputs detailed channel inference logs with origin type (e.g., “via scene” or “via performer metadata”). + +Emoji-coded output for improved clarity: + +🎞 Scene-based upgrades + +🎬 Direct metadata + +⚫ Missing channel info + +Progress verification + +verify-enrichment now reports precise completion percentages and lists the most recent 20 upgraded performers. + +🛠 Changed + +Enrichment process now performs automatic in-place upgrades of performer_sources without overwriting other fields. + +Optimized query logic to prioritize unverified performers and handle large datasets efficiently. + +Added fine-grained sleep control between API requests to stay compliant with TPDB rate limits. + +🧹 Maintenance + +Refactored enrichment functions for modularity: + +_fetch_studio_from_scenes() introduced for scene scanning. + +Simplified argument handling and enriched exception tracing. + +Verified enrichment stability across 100 performers with 44% successful channel discovery in live test. + +Improved timestamp consistency in verification logs and upgraded database schema resilience. + +[v0.4.2] — Unified Importer, ML Pipeline, and Semantic Search (2025-10-27) +✨ Added + +Unified Importer CLI (porndex-importer) + +Replaces legacy multi-script workflow with a single command entrypoint. + +Introduced import, refresh-all, refresh-one, validate-tags, tag-stats, and source subcommands. + +Includes colorized CLI summaries and consistent emoji headers. + +Machine Learning Dataset Builder + +New module: ml/ml_dataset_builder.py + +Generates structured dataset in ML/porndex_dataset.jsonl from all indexed galleries. + +Each record includes title, models, tags, and image paths for hybrid ML ingestion. + +Embedding Generation Module + +Added ml/ml_embeddings.py to create hybrid text + image embeddings. + +Builds per-gallery NPZ files under ML/embeddings/ and a consolidated embeddings_index.jsonl. + +Supports configurable --img-samples and automatic device detection (--device auto). + +Semantic & Strict Search + +search command supports three modes: + +semantic: CLIP + text hybrid cosine similarity (default) + +text: text-only vector space search + +strict: literal match filtering before vector ranking + +Results show top-ranked galleries, confidence scores, and gallery IDs. + +ML Verification Command + +verify confirms index consistency, embedding count, and file integrity. + +Directory Auto-Creation + +Automatically generates ML/embeddings/ and ML/ if missing. + +🛠 Changed + +Importer Pipeline Refactor + +Moved all CLI handling into src/importer/cli.py. + +Centralized environment setup and config loading. + +Replaced direct Python script calls with porndex-importer entrypoint. + +Tagging System + +Unified YAML dictionary loading for clothing, acts, body, and context. + +Improved tag inference logging and duplicate suppression. + +Output Formatting + +Standardized headers, dividers, and indentation across all CLI commands. + +Added readable time and path indicators for long-running operations. + +🧹 Maintenance + +Verified full ML dataset build across 150 test galleries (100% JSONL completion). + +Added fallback for empty or missing image lists in dataset builder. + +Improved error handling for partial downloads and interrupted imports. + +Streamlined path resolution for consistent operation across dev and installed modes. + +Updated documentation: + +/docs/CLI_USAGE.md rewritten for v0.4.2. + +/README.md modernized with full project tree and ML pipeline overview. + +🧭 Next Steps + +Begin v0.4.3–v0.5.x roadmap: + +Integrate GroundingDINO + GroundedSAM for visual region detection. + +Implement attribute extraction (gender → ethnicity → clothing). + +Build visual verification tool (ml_dataset_inspector.py). + +Add tag-confidence weighting system. + +Extend TPDB bridge to cross-link enriched performer metadata into ML training records. + +🧩 Summary of Current State (as of v0.4.2) + +✅ Fully unified CLI under porndex-importer +✅ Stable YAML tagging + validation +✅ Complete ML dataset and embedding generation workflow +✅ Working hybrid semantic search +✅ Verified 150-gallery dataset index + +© 2025 Leak Technologies — Porndex Importer Project \ No newline at end of file