11 KiB
📜 Porndex Importer — Full Changelog
Project: Porndex_PornpicsImporter
Repository: Leak Technologies
Branch: main
Version Line: v0.3.x Development Cycle
[v0.3.0] — Modular Tagging Framework Foundation (2025-10-18)
✨ Added
- Introduced YAML-based Tag Dictionaries stored under
/src/importer/tagging/for modular, human-readable tag definitions. - Implemented initial
refresh-allandrefresh-onecommands for reapplying tag inference to galleries. - Added persistent
inferred_tagsfield inmetadata.jsonto differentiate between automated and manual tags. - Implemented automatic source inference for known networks (e.g., Brazzers, FTV Girls, PornPics).
- Enhanced CLI output with colorized progress indicators and summary totals.
🛠 Changed
- Refactored
tag_gallery.pyfor modular tagging architecture. - Centralized configuration paths to
/src/importer/config/for easier project-wide access.
🧹 Maintenance
- Improved exception handling for missing or malformed tag dictionaries.
- Added consistent emoji/logging system across CLI commands.
[v0.3.1] — CLI Polishing & Dictionary Improvements (2025-10-19)
✨ Added
- Introduced CLI argument parsing with
argparsefor a unified user interface. - Added
--verboseflag for detailed debugging output. - Added metadata validation to ensure all tag dictionaries contain unique keywords.
🛠 Changed
- Adjusted internal path resolution to work from both installed and development environments.
- Improved
load_all_tag_maps()with caching and better error resilience.
🧹 Maintenance
- Cleaned duplicate mappings within YAML files.
- Improved documentation and inline docstrings throughout importer modules.
[v0.3.2] — TPDB Bridge Integration (2025-10-21)
✨ Added
- Introduced
tpdb_bridge.pyfor importing performer data from ThePornDB API. - Added local SQLite performer database under
/src/importer/db/performers.db. - Added commands:
fetch— Import performers in a single batch.fill-index— Continuously pull until a limit is reached.enrich— Fetch and merge extended performer metadata.sync-all— Hybrid incremental fetch + enrich loop.
- Introduced local API key management using
tpdb_api_key.txtunder/secrets/.
🧹 Maintenance
- Verified importer against TPDB rate limits and ensured safe error recovery.
- Added initial test data exports to
/src/importer/reports/.
[v0.3.3] — YAML Tag Inference Update (2025-10-20)
✨ Added
- Dynamic YAML tag dictionary loader for modular tag categories.
- Introduced automatic source inference for common networks.
- Added
refresh-allbulk operation to reapply tag inference globally.
🛠 Changed
- Refactored
infer_tags()to merge results from multiple YAML files dynamically. - Enhanced progress and summary reporting for tag inference.
🧹 Maintenance
- Fixed
AttributeError: 'int' object has no attribute 'lower'when parsing numeric YAML values. - Standardized internal naming conventions.
[v0.3.4] — Tag Dictionary Validation & Cleanup (2025-10-20)
✨ Added
validate-tagsCLI command for verifying YAML tag dictionaries.- Detects duplicates, empty entries, and conflicting keywords.
- Outputs detailed summaries with per-keyword conflict listings.
🛠 Changed
- Standardized YAML structure enforcement (consistent key capitalization and layout).
- Added human-readable validation summaries.
🧹 Maintenance
- General code cleanup and consistent logging system updates.
[v0.3.5] — Tag Statistics & Unified CLI Update (2025-10-20)
✨ Added
- Tag Statistics System
- Introduced
tag-statscommand to generate frequency analytics across all gallery metadata. - Produces both console summaries and saved reports:
reports/tag_stats.json— JSON-formatted tag counts.reports/tag_stats_sorted.txt— human-readable ranked list.
- Introduced
- Unified CLI Interface (
cli.py)- Consolidated all tagging and maintenance operations into a single entrypoint:
refresh-all,refresh-one,validate-tags,tag-stats,list,list-tags,add,remove,add-multi,show-metadata,source
- Standardized command syntax and output formatting across all operations.
- Consolidated all tagging and maintenance operations into a single entrypoint:
🛠 Changed
- Centralized tag frequency logic into
tag_gallery.py. - Refactored CLI dispatch system for scalability and better error handling.
- Standardized output style (headers, dividers, alignment).
🧹 Maintenance
- Automatic creation of
/src/importer/reports/when missing. - Verified all tag operations across 60+ galleries.
- Unified terminology and capitalization across CLI help text and docstrings.
🧭 Next Steps
- Add color-coded CLI output for readability.
- Implement
--export-csvflag fortag-statsoutput. - Begin roadmap for v0.4.0 introducing ML-based tag confidence scoring and category weighting.
[v0.3.6] — Enrichment Verification & Freshness Tracking (2025-10-26)
✨ Added
- verify-enrichment command
- Scans performer database for missing metadata (e.g.,
url,last_updated). - Reports enriched vs incomplete entries, with preview via
--show-missing.
- Scans performer database for missing metadata (e.g.,
- Freshness tracking
- Displays oldest and most recent enrichment timestamps.
- Warns if data is older than the freshness threshold.
- Automatic TPDB key validation
- Checks for valid API key and provides setup help if missing.
🛠 Changed
- Enrichment logic now guarantees
urlandlast_updatedfields for all performers. - Improved emoji-based CLI logs for clarity.
- CLI outputs enrichment stats after each batch during
sync-all.
🧹 Maintenance
- Cleanup and refactor of
tpdb_bridge.pyfor readability and modular design. - Verified completeness: 5,087 performers enriched and up to date.
- Improved sleep timing and network error recovery during long sync runs.
🧭 Next Steps
- Add
--stale-daysCLI flag for user-defined freshness thresholds. - Implement automatic enrichment scheduling via cron or systemd.
- Add shortcut alias
porndex-importer verifyfor database status checks.
[v0.3.7] — Scene-Based Enrichment & Channel Auto-Upgrade (2025-10-26) ✨ Added
Scene-based enrichment system
New flag --use-scenes enables intelligent inference of performer studios/channels using recent scene data from ThePornDB.
Automatically scans /performers/{id}/scenes for studio, site, or network fields when direct metadata is missing.
Dynamically upgrades performer entries from “Unknown” to valid channel names (e.g., “Desire Room”, “I Want Clips: Princess Chanel”).
Enhanced enrichment diagnostics
--debug-channels now outputs detailed channel inference logs with origin type (e.g., “via scene” or “via performer metadata”).
Emoji-coded output for improved clarity:
🎞 Scene-based upgrades
🎬 Direct metadata
⚫ Missing channel info
Progress verification
verify-enrichment now reports precise completion percentages and lists the most recent 20 upgraded performers.
🛠 Changed
Enrichment process now performs automatic in-place upgrades of performer_sources without overwriting other fields.
Optimized query logic to prioritize unverified performers and handle large datasets efficiently.
Added fine-grained sleep control between API requests to stay compliant with TPDB rate limits.
🧹 Maintenance
Refactored enrichment functions for modularity:
_fetch_studio_from_scenes() introduced for scene scanning.
Simplified argument handling and enriched exception tracing.
Verified enrichment stability across 100 performers with 44% successful channel discovery in live test.
Improved timestamp consistency in verification logs and upgraded database schema resilience.
[v0.4.2] — Unified Importer, ML Pipeline, and Semantic Search (2025-10-27) ✨ Added
Unified Importer CLI (porndex-importer)
Replaces legacy multi-script workflow with a single command entrypoint.
Introduced import, refresh-all, refresh-one, validate-tags, tag-stats, and source subcommands.
Includes colorized CLI summaries and consistent emoji headers.
Machine Learning Dataset Builder
New module: ml/ml_dataset_builder.py
Generates structured dataset in ML/porndex_dataset.jsonl from all indexed galleries.
Each record includes title, models, tags, and image paths for hybrid ML ingestion.
Embedding Generation Module
Added ml/ml_embeddings.py to create hybrid text + image embeddings.
Builds per-gallery NPZ files under ML/embeddings/ and a consolidated embeddings_index.jsonl.
Supports configurable --img-samples and automatic device detection (--device auto).
Semantic & Strict Search
search command supports three modes:
semantic: CLIP + text hybrid cosine similarity (default)
text: text-only vector space search
strict: literal match filtering before vector ranking
Results show top-ranked galleries, confidence scores, and gallery IDs.
ML Verification Command
verify confirms index consistency, embedding count, and file integrity.
Directory Auto-Creation
Automatically generates ML/embeddings/ and ML/ if missing.
🛠 Changed
Importer Pipeline Refactor
Moved all CLI handling into src/importer/cli.py.
Centralized environment setup and config loading.
Replaced direct Python script calls with porndex-importer entrypoint.
Tagging System
Unified YAML dictionary loading for clothing, acts, body, and context.
Improved tag inference logging and duplicate suppression.
Output Formatting
Standardized headers, dividers, and indentation across all CLI commands.
Added readable time and path indicators for long-running operations.
🧹 Maintenance
Verified full ML dataset build across 150 test galleries (100% JSONL completion).
Added fallback for empty or missing image lists in dataset builder.
Improved error handling for partial downloads and interrupted imports.
Streamlined path resolution for consistent operation across dev and installed modes.
Updated documentation:
/docs/CLI_USAGE.md rewritten for v0.4.2.
/README.md modernized with full project tree and ML pipeline overview.
🧭 Next Steps
Begin v0.4.3–v0.5.x roadmap:
Integrate GroundingDINO + GroundedSAM for visual region detection.
Implement attribute extraction (gender → ethnicity → clothing).
Build visual verification tool (ml_dataset_inspector.py).
Add tag-confidence weighting system.
Extend TPDB bridge to cross-link enriched performer metadata into ML training records.
🧩 Summary of Current State (as of v0.4.2)
✅ Fully unified CLI under porndex-importer ✅ Stable YAML tagging + validation ✅ Complete ML dataset and embedding generation workflow ✅ Working hybrid semantic search ✅ Verified 150-gallery dataset index
© 2025 Leak Technologies — Porndex Importer Project