309 lines
11 KiB
Markdown
309 lines
11 KiB
Markdown
# 📜 Porndex Importer — Full Changelog
|
||
> **Project:** Porndex_PornpicsImporter
|
||
> **Repository:** Leak Technologies
|
||
> **Branch:** main
|
||
> **Version Line:** v0.3.x Development Cycle
|
||
|
||
---
|
||
|
||
## [v0.3.0] — Modular Tagging Framework Foundation (2025-10-18)
|
||
|
||
### ✨ Added
|
||
- Introduced **YAML-based Tag Dictionaries** stored under `/src/importer/tagging/` for modular, human-readable tag definitions.
|
||
- Implemented initial **`refresh-all`** and **`refresh-one`** commands for reapplying tag inference to galleries.
|
||
- Added **persistent `inferred_tags` field** in `metadata.json` to differentiate between automated and manual tags.
|
||
- Implemented **automatic source inference** for known networks (e.g., Brazzers, FTV Girls, PornPics).
|
||
- Enhanced CLI output with colorized progress indicators and summary totals.
|
||
|
||
### 🛠 Changed
|
||
- Refactored `tag_gallery.py` for modular tagging architecture.
|
||
- Centralized configuration paths to `/src/importer/config/` for easier project-wide access.
|
||
|
||
### 🧹 Maintenance
|
||
- Improved exception handling for missing or malformed tag dictionaries.
|
||
- Added consistent emoji/logging system across CLI commands.
|
||
|
||
---
|
||
|
||
## [v0.3.1] — CLI Polishing & Dictionary Improvements (2025-10-19)
|
||
|
||
### ✨ Added
|
||
- Introduced **CLI argument parsing** with `argparse` for a unified user interface.
|
||
- Added `--verbose` flag for detailed debugging output.
|
||
- Added **metadata validation** to ensure all tag dictionaries contain unique keywords.
|
||
|
||
### 🛠 Changed
|
||
- Adjusted internal path resolution to work from both installed and development environments.
|
||
- Improved `load_all_tag_maps()` with caching and better error resilience.
|
||
|
||
### 🧹 Maintenance
|
||
- Cleaned duplicate mappings within YAML files.
|
||
- Improved documentation and inline docstrings throughout importer modules.
|
||
|
||
---
|
||
|
||
## [v0.3.2] — TPDB Bridge Integration (2025-10-21)
|
||
|
||
### ✨ Added
|
||
- Introduced **`tpdb_bridge.py`** for importing performer data from *ThePornDB* API.
|
||
- Added local **SQLite performer database** under `/src/importer/db/performers.db`.
|
||
- Added commands:
|
||
- `fetch` — Import performers in a single batch.
|
||
- `fill-index` — Continuously pull until a limit is reached.
|
||
- `enrich` — Fetch and merge extended performer metadata.
|
||
- `sync-all` — Hybrid incremental fetch + enrich loop.
|
||
- Introduced **local API key management** using `tpdb_api_key.txt` under `/secrets/`.
|
||
|
||
### 🧹 Maintenance
|
||
- Verified importer against TPDB rate limits and ensured safe error recovery.
|
||
- Added initial test data exports to `/src/importer/reports/`.
|
||
|
||
---
|
||
|
||
## [v0.3.3] — YAML Tag Inference Update (2025-10-20)
|
||
|
||
### ✨ Added
|
||
- Dynamic **YAML tag dictionary loader** for modular tag categories.
|
||
- Introduced **automatic source inference** for common networks.
|
||
- Added **`refresh-all`** bulk operation to reapply tag inference globally.
|
||
|
||
### 🛠 Changed
|
||
- Refactored `infer_tags()` to merge results from multiple YAML files dynamically.
|
||
- Enhanced progress and summary reporting for tag inference.
|
||
|
||
### 🧹 Maintenance
|
||
- Fixed `AttributeError: 'int' object has no attribute 'lower'` when parsing numeric YAML values.
|
||
- Standardized internal naming conventions.
|
||
|
||
---
|
||
|
||
## [v0.3.4] — Tag Dictionary Validation & Cleanup (2025-10-20)
|
||
|
||
### ✨ Added
|
||
- **`validate-tags`** CLI command for verifying YAML tag dictionaries.
|
||
- Detects duplicates, empty entries, and conflicting keywords.
|
||
- Outputs detailed summaries with per-keyword conflict listings.
|
||
|
||
### 🛠 Changed
|
||
- Standardized YAML structure enforcement (consistent key capitalization and layout).
|
||
- Added human-readable validation summaries.
|
||
|
||
### 🧹 Maintenance
|
||
- General code cleanup and consistent logging system updates.
|
||
|
||
---
|
||
|
||
## [v0.3.5] — Tag Statistics & Unified CLI Update (2025-10-20)
|
||
|
||
### ✨ Added
|
||
- **Tag Statistics System**
|
||
- Introduced `tag-stats` command to generate frequency analytics across all gallery metadata.
|
||
- Produces both console summaries and saved reports:
|
||
- `reports/tag_stats.json` — JSON-formatted tag counts.
|
||
- `reports/tag_stats_sorted.txt` — human-readable ranked list.
|
||
- **Unified CLI Interface (`cli.py`)**
|
||
- Consolidated all tagging and maintenance operations into a single entrypoint:
|
||
- `refresh-all`, `refresh-one`, `validate-tags`, `tag-stats`, `list`, `list-tags`, `add`, `remove`, `add-multi`, `show-metadata`, `source`
|
||
- Standardized command syntax and output formatting across all operations.
|
||
|
||
### 🛠 Changed
|
||
- Centralized tag frequency logic into `tag_gallery.py`.
|
||
- Refactored CLI dispatch system for scalability and better error handling.
|
||
- Standardized output style (headers, dividers, alignment).
|
||
|
||
### 🧹 Maintenance
|
||
- Automatic creation of `/src/importer/reports/` when missing.
|
||
- Verified all tag operations across 60+ galleries.
|
||
- Unified terminology and capitalization across CLI help text and docstrings.
|
||
|
||
### 🧭 Next Steps
|
||
- Add color-coded CLI output for readability.
|
||
- Implement `--export-csv` flag for `tag-stats` output.
|
||
- Begin roadmap for **v0.4.0** introducing ML-based tag confidence scoring and category weighting.
|
||
|
||
---
|
||
|
||
## [v0.3.6] — Enrichment Verification & Freshness Tracking (2025-10-26)
|
||
|
||
### ✨ Added
|
||
- **verify-enrichment command**
|
||
- Scans performer database for missing metadata (e.g., `url`, `last_updated`).
|
||
- Reports enriched vs incomplete entries, with preview via `--show-missing`.
|
||
- **Freshness tracking**
|
||
- Displays oldest and most recent enrichment timestamps.
|
||
- Warns if data is older than the freshness threshold.
|
||
- **Automatic TPDB key validation**
|
||
- Checks for valid API key and provides setup help if missing.
|
||
|
||
### 🛠 Changed
|
||
- Enrichment logic now guarantees `url` and `last_updated` fields for all performers.
|
||
- Improved emoji-based CLI logs for clarity.
|
||
- CLI outputs enrichment stats after each batch during `sync-all`.
|
||
|
||
### 🧹 Maintenance
|
||
- Cleanup and refactor of `tpdb_bridge.py` for readability and modular design.
|
||
- Verified completeness: **5,087 performers enriched** and up to date.
|
||
- Improved sleep timing and network error recovery during long sync runs.
|
||
|
||
### 🧭 Next Steps
|
||
- Add `--stale-days` CLI flag for user-defined freshness thresholds.
|
||
- Implement automatic enrichment scheduling via cron or systemd.
|
||
- Add shortcut alias `porndex-importer verify` for database status checks.
|
||
|
||
---
|
||
|
||
[v0.3.7] — Scene-Based Enrichment & Channel Auto-Upgrade (2025-10-26)
|
||
✨ Added
|
||
|
||
Scene-based enrichment system
|
||
|
||
New flag --use-scenes enables intelligent inference of performer studios/channels using recent scene data from ThePornDB.
|
||
|
||
Automatically scans /performers/{id}/scenes for studio, site, or network fields when direct metadata is missing.
|
||
|
||
Dynamically upgrades performer entries from “Unknown” to valid channel names (e.g., “Desire Room”, “I Want Clips: Princess Chanel”).
|
||
|
||
Enhanced enrichment diagnostics
|
||
|
||
--debug-channels now outputs detailed channel inference logs with origin type (e.g., “via scene” or “via performer metadata”).
|
||
|
||
Emoji-coded output for improved clarity:
|
||
|
||
🎞 Scene-based upgrades
|
||
|
||
🎬 Direct metadata
|
||
|
||
⚫ Missing channel info
|
||
|
||
Progress verification
|
||
|
||
verify-enrichment now reports precise completion percentages and lists the most recent 20 upgraded performers.
|
||
|
||
🛠 Changed
|
||
|
||
Enrichment process now performs automatic in-place upgrades of performer_sources without overwriting other fields.
|
||
|
||
Optimized query logic to prioritize unverified performers and handle large datasets efficiently.
|
||
|
||
Added fine-grained sleep control between API requests to stay compliant with TPDB rate limits.
|
||
|
||
🧹 Maintenance
|
||
|
||
Refactored enrichment functions for modularity:
|
||
|
||
_fetch_studio_from_scenes() introduced for scene scanning.
|
||
|
||
Simplified argument handling and enriched exception tracing.
|
||
|
||
Verified enrichment stability across 100 performers with 44% successful channel discovery in live test.
|
||
|
||
Improved timestamp consistency in verification logs and upgraded database schema resilience.
|
||
|
||
[v0.4.2] — Unified Importer, ML Pipeline, and Semantic Search (2025-10-27)
|
||
✨ Added
|
||
|
||
Unified Importer CLI (porndex-importer)
|
||
|
||
Replaces legacy multi-script workflow with a single command entrypoint.
|
||
|
||
Introduced import, refresh-all, refresh-one, validate-tags, tag-stats, and source subcommands.
|
||
|
||
Includes colorized CLI summaries and consistent emoji headers.
|
||
|
||
Machine Learning Dataset Builder
|
||
|
||
New module: ml/ml_dataset_builder.py
|
||
|
||
Generates structured dataset in ML/porndex_dataset.jsonl from all indexed galleries.
|
||
|
||
Each record includes title, models, tags, and image paths for hybrid ML ingestion.
|
||
|
||
Embedding Generation Module
|
||
|
||
Added ml/ml_embeddings.py to create hybrid text + image embeddings.
|
||
|
||
Builds per-gallery NPZ files under ML/embeddings/ and a consolidated embeddings_index.jsonl.
|
||
|
||
Supports configurable --img-samples and automatic device detection (--device auto).
|
||
|
||
Semantic & Strict Search
|
||
|
||
search command supports three modes:
|
||
|
||
semantic: CLIP + text hybrid cosine similarity (default)
|
||
|
||
text: text-only vector space search
|
||
|
||
strict: literal match filtering before vector ranking
|
||
|
||
Results show top-ranked galleries, confidence scores, and gallery IDs.
|
||
|
||
ML Verification Command
|
||
|
||
verify confirms index consistency, embedding count, and file integrity.
|
||
|
||
Directory Auto-Creation
|
||
|
||
Automatically generates ML/embeddings/ and ML/ if missing.
|
||
|
||
🛠 Changed
|
||
|
||
Importer Pipeline Refactor
|
||
|
||
Moved all CLI handling into src/importer/cli.py.
|
||
|
||
Centralized environment setup and config loading.
|
||
|
||
Replaced direct Python script calls with porndex-importer entrypoint.
|
||
|
||
Tagging System
|
||
|
||
Unified YAML dictionary loading for clothing, acts, body, and context.
|
||
|
||
Improved tag inference logging and duplicate suppression.
|
||
|
||
Output Formatting
|
||
|
||
Standardized headers, dividers, and indentation across all CLI commands.
|
||
|
||
Added readable time and path indicators for long-running operations.
|
||
|
||
🧹 Maintenance
|
||
|
||
Verified full ML dataset build across 150 test galleries (100% JSONL completion).
|
||
|
||
Added fallback for empty or missing image lists in dataset builder.
|
||
|
||
Improved error handling for partial downloads and interrupted imports.
|
||
|
||
Streamlined path resolution for consistent operation across dev and installed modes.
|
||
|
||
Updated documentation:
|
||
|
||
/docs/CLI_USAGE.md rewritten for v0.4.2.
|
||
|
||
/README.md modernized with full project tree and ML pipeline overview.
|
||
|
||
🧭 Next Steps
|
||
|
||
Begin v0.4.3–v0.5.x roadmap:
|
||
|
||
Integrate GroundingDINO + GroundedSAM for visual region detection.
|
||
|
||
Implement attribute extraction (gender → ethnicity → clothing).
|
||
|
||
Build visual verification tool (ml_dataset_inspector.py).
|
||
|
||
Add tag-confidence weighting system.
|
||
|
||
Extend TPDB bridge to cross-link enriched performer metadata into ML training records.
|
||
|
||
🧩 Summary of Current State (as of v0.4.2)
|
||
|
||
✅ Fully unified CLI under porndex-importer
|
||
✅ Stable YAML tagging + validation
|
||
✅ Complete ML dataset and embedding generation workflow
|
||
✅ Working hybrid semantic search
|
||
✅ Verified 150-gallery dataset index
|
||
|
||
© 2025 Leak Technologies — Porndex Importer Project |