Goondex/docs/ADULT_EMPIRE_SCRAPER.md
2025-11-17 14:38:58 -05:00

333 lines
9.2 KiB
Markdown

# Adult Empire Scraper Integration
**Version**: v0.1.0-dev5
**Last Updated**: 2025-11-17
## Overview
Goondex now includes a full-featured Adult Empire scraper based on the Stash app's scraping architecture. This allows you to fetch metadata, cover art, and performer information directly from Adult Empire (adultdvdempire.com).
## Features
### ✅ Scene Scraping
- Extract scene title, description, release date
- Download cover art/thumbnails
- Retrieve studio information
- Get performer lists
- Extract tags/categories
- Scene code/SKU
- Director information
### ✅ Performer Scraping
- Extract performer name, aliases
- Download profile images
- Retrieve birthdate, ethnicity, nationality
- Physical attributes (height, measurements, hair/eye color)
- Biography text
### ✅ Search Functionality
- Search scenes by title
- Search performers by name
- Get search results with thumbnails
## Architecture
The Adult Empire scraper is implemented in `/internal/scraper/adultemp/` with the following components:
### Files
1. **`types.go`** - Data structures for scraped content
2. **`client.go`** - HTTP client with cookie/session management
3. **`xpath.go`** - XPath parsing utilities for HTML extraction
4. **`scraper.go`** - Main scraper implementation
### Components
```
┌─────────────────┐
│ Scraper API │ - ScrapeSceneByURL()
│ │ - ScrapePerformerByURL()
│ │ - SearchScenesByName()
│ │ - SearchPerformersByName()
└────────┬────────┘
┌─────────────────┐
│ HTTP Client │ - Cookie jar for sessions
│ │ - Age verification
│ │ - Auth token support
└────────┬────────┘
┌─────────────────┐
│ XPath Parser │ - Extract data from HTML
│ │ - Parse dates, heights
│ │ - Clean text content
└─────────────────┘
```
## Usage
### Authentication (Optional)
For full access to Adult Empire content, you can set an authentication token:
```go
scraper, err := adultemp.NewScraper()
if err != nil {
log.Fatal(err)
}
// Optional: Set your Adult Empire session token
scraper.SetAuthToken("your-etoken-here")
```
**Getting your etoken:**
1. Log into adultdvdempire.com
2. Open browser DevTools (F12)
3. Go to Application → Cookies → adultdvdempire.com
4. Copy the value of the `etoken` cookie
### Scrape a Scene by URL
```go
ctx := context.Background()
sceneData, err := scraper.ScrapeSceneByURL(ctx, "https://www.adultdvdempire.com/12345/scene-name")
if err != nil {
log.Fatal(err)
}
// Convert to Goondex model
scene := scraper.ConvertSceneToModel(sceneData)
// Save to database
// db.Scenes.Create(scene)
```
### Search for Scenes
```go
results, err := scraper.SearchScenesByName(ctx, "scene title")
if err != nil {
log.Fatal(err)
}
for _, result := range results {
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("URL: %s\n", result.URL)
fmt.Printf("Image: %s\n", result.Image)
}
```
### Scrape a Performer
```go
performerData, err := scraper.ScrapePerformerByURL(ctx, "https://www.adultdvdempire.com/performer/12345/name")
if err != nil {
log.Fatal(err)
}
// Convert to Goondex model
performer := scraper.ConvertPerformerToModel(performerData)
```
### Search for Performers
```go
results, err := scraper.SearchPerformersByName(ctx, "performer name")
if err != nil {
log.Fatal(err)
}
for _, result := range results {
fmt.Printf("Name: %s\n", result.Title)
fmt.Printf("URL: %s\n", result.URL)
}
```
## Data Structures
### SceneData
```go
type SceneData struct {
Title string // Scene title
URL string // Adult Empire URL
Date string // Release date
Studio string // Studio name
Image string // Cover image URL
Description string // Synopsis/description
Performers []string // List of performer names
Tags []string // Categories/tags
Code string // Scene code/SKU
Director string // Director name
}
```
### PerformerData
```go
type PerformerData struct {
Name string // Performer name
URL string // Adult Empire URL
Image string // Profile image URL
Birthdate string // Date of birth
Ethnicity string // Ethnicity
Country string // Country of origin
Height string // Height (converted to cm)
Measurements string // Body measurements
HairColor string // Hair color
EyeColor string // Eye color
Biography string // Bio text
Aliases []string // Alternative names
}
```
## XPath Selectors
The scraper uses XPath to extract data from Adult Empire pages. Key selectors include:
### Scene Selectors
- **Title**: `//h1[@class='title']`
- **Date**: `//div[@class='release-date']/text()`
- **Studio**: `//a[contains(@href, '/studio/')]/text()`
- **Image**: `//div[@class='item-image']//img/@src`
- **Description**: `//div[@class='synopsis']`
- **Performers**: `//a[contains(@href, '/performer/')]/text()`
- **Tags**: `//a[contains(@href, '/category/')]/text()`
### Performer Selectors
- **Name**: `//h1[@class='performer-name']`
- **Image**: `//div[@class='performer-image']//img/@src`
- **Birthdate**: `//span[@class='birthdate']/text()`
- **Height**: `//span[@class='height']/text()`
- **Bio**: `//div[@class='bio']`
**Note**: Adult Empire may change their HTML structure. If scraping fails, XPath selectors in `scraper.go` may need updates.
## Utilities
### Date Parsing
```go
dateStr := ParseDate("Jan 15, 2024") // Handles various formats
```
### Height Conversion
```go
heightCm := ParseHeight("5'6\"") // Converts feet/inches to cm (168)
```
### Text Cleaning
```go
cleanedText := CleanText(rawHTML) // Removes "Show More/Less" and extra whitespace
```
### URL Normalization
```go
fullURL := ExtractURL("/path/to/scene", "https://www.adultdvdempire.com")
// Returns: "https://www.adultdvdempire.com/path/to/scene"
```
## Integration with Goondex
The Adult Empire scraper integrates seamlessly with the existing Goondex architecture:
1. **Scrape** data from Adult Empire using the scraper
2. **Convert** to Goondex models using converter functions
3. **Save** to the database using existing stores
4. **Display** in the web UI with cover art and metadata
### Example Workflow
```go
// 1. Search for a scene
results, _ := scraper.SearchScenesByName(ctx, "scene name")
// 2. Pick the first result and scrape full details
sceneData, _ := scraper.ScrapeSceneByURL(ctx, results[0].URL)
// 3. Convert to Goondex model
scene := scraper.ConvertSceneToModel(sceneData)
// 4. Save to database
sceneStore := db.NewSceneStore(database)
sceneStore.Create(scene)
// 5. Now it appears in the web UI!
```
## Future Enhancements
Planned improvements for the Adult Empire scraper:
-**Bulk Import** - Import entire studios or series
-**Auto-Update** - Periodically refresh metadata
-**Image Caching** - Download and cache cover art locally
-**Duplicate Detection** - Avoid importing the same scene twice
-**Advanced Search** - Filter by studio, date range, tags
-**Web UI Integration** - Search and import from the dashboard
## Troubleshooting
### "Failed to parse HTML"
- The Adult Empire page structure may have changed
- Update XPath selectors in `scraper.go`
### "Request failed: 403 Forbidden"
- You may need to set an auth token
- Adult Empire may be blocking automated requests
- Try setting a valid `etoken` cookie
### "No results found"
- Check that the search query is correct
- Adult Empire search may have different spelling
- Try broader search terms
### Scene/Performer data incomplete
- Some fields may not be present on all pages
- XPath selectors may need adjustment
- Check the raw HTML to verify field availability
## Comparison with TPDB Scraper
| Feature | TPDB | Adult Empire |
|---------|------|--------------|
| **API** | ✅ Official JSON API | ❌ HTML scraping |
| **Auth** | ✅ API key | ⚠️ Session cookie |
| **Rate Limits** | ✅ Documented | ⚠️ Unknown |
| **Stability** | ✅ Stable schema | ⚠️ May change |
| **Coverage** | ✅ Comprehensive | ✅ Comprehensive |
| **Images** | ✅ High quality | ✅ High quality |
**Recommendation**: Use TPDB as the primary source and Adult Empire as a fallback or supplemental source.
## Contributing
To improve Adult Empire scraping:
1. Update XPath selectors if Adult Empire changes their HTML
2. Add support for additional fields
3. Improve date/height parsing
4. Add more robust error handling
## Version History
- **v0.1.0-dev5** (2025-11-17): Documentation refresh for TPDB bulk-import release
- Updated version metadata and changelog references
- Clarified rebuild steps for the CLI additions
- **v0.1.0-dev4** (2025-11-16): Initial Adult Empire scraper implementation
- HTTP client with cookie support
- XPath parsing utilities
- Scene and performer scraping
- Search functionality
- Model conversion utilities
---
**Last Updated**: 2025-11-17
**Maintainer**: Goondex Team