Goondex/docs/ADULT_EMPIRE_SCRAPER.md

# Adult Empire Scraper Integration

**Version**: v0.1.0-dev5
**Last Updated**: 2025-11-17

## Overview

Goondex now includes a full-featured Adult Empire scraper based on the Stash app's scraping architecture. This allows you to fetch metadata, cover art, and performer information directly from Adult Empire (adultdvdempire.com).

## Features

### ✅ Scene Scraping
- Extract scene title, description, release date
- Download cover art/thumbnails
- Retrieve studio information
- Get performer lists
- Extract tags/categories
- Scene code/SKU
- Director information

### ✅ Performer Scraping
- Extract performer name, aliases
- Download profile images
- Retrieve birthdate, ethnicity, nationality
- Physical attributes (height, measurements, hair/eye color)
- Biography text

### ✅ Search Functionality
- Search scenes by title
- Search performers by name
- Get search results with thumbnails

## Architecture

The Adult Empire scraper is implemented in `/internal/scraper/adultemp/` with the following components:

### Files

1. **`types.go`** - Data structures for scraped content
2. **`client.go`** - HTTP client with cookie/session management
3. **`xpath.go`** - XPath parsing utilities for HTML extraction
4. **`scraper.go`** - Main scraper implementation

### Components

```
┌─────────────────┐
│  Scraper API    │  - ScrapeSceneByURL()
│                 │  - ScrapePerformerByURL()
│                 │  - SearchScenesByName()
│                 │  - SearchPerformersByName()
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  HTTP Client    │  - Cookie jar for sessions
│                 │  - Age verification
│                 │  - Auth token support
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  XPath Parser   │  - Extract data from HTML
│                 │  - Parse dates, heights
│                 │  - Clean text content
└─────────────────┘
```

## Usage

### Authentication (Optional)

For full access to Adult Empire content, you can set an authentication token:

```go
scraper, err := adultemp.NewScraper()
if err != nil {
    log.Fatal(err)
}

// Optional: Set your Adult Empire session token
scraper.SetAuthToken("your-etoken-here")
```

**Getting your etoken:**
1. Log into adultdvdempire.com
2. Open browser DevTools (F12)
3. Go to Application → Cookies → adultdvdempire.com
4. Copy the value of the `etoken` cookie

### Scrape a Scene by URL

```go
ctx := context.Background()
sceneData, err := scraper.ScrapeSceneByURL(ctx, "https://www.adultdvdempire.com/12345/scene-name")
if err != nil {
    log.Fatal(err)
}

// Convert to Goondex model
scene := scraper.ConvertSceneToModel(sceneData)

// Save to database
// db.Scenes.Create(scene)
```

### Search for Scenes

```go
results, err := scraper.SearchScenesByName(ctx, "scene title")
if err != nil {
    log.Fatal(err)
}

for _, result := range results {
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("URL: %s\n", result.URL)
    fmt.Printf("Image: %s\n", result.Image)
}
```

### Scrape a Performer

```go
performerData, err := scraper.ScrapePerformerByURL(ctx, "https://www.adultdvdempire.com/performer/12345/name")
if err != nil {
    log.Fatal(err)
}

// Convert to Goondex model
performer := scraper.ConvertPerformerToModel(performerData)
```

### Search for Performers

```go
results, err := scraper.SearchPerformersByName(ctx, "performer name")
if err != nil {
    log.Fatal(err)
}

for _, result := range results {
    fmt.Printf("Name: %s\n", result.Title)
    fmt.Printf("URL: %s\n", result.URL)
}
```

## Data Structures

### SceneData

```go
type SceneData struct {
    Title       string      // Scene title
    URL         string      // Adult Empire URL
    Date        string      // Release date
    Studio      string      // Studio name
    Image       string      // Cover image URL
    Description string      // Synopsis/description
    Performers  []string    // List of performer names
    Tags        []string    // Categories/tags
    Code        string      // Scene code/SKU
    Director    string      // Director name
}
```

### PerformerData

```go
type PerformerData struct {
    Name         string      // Performer name
    URL          string      // Adult Empire URL
    Image        string      // Profile image URL
    Birthdate    string      // Date of birth
    Ethnicity    string      // Ethnicity
    Country      string      // Country of origin
    Height       string      // Height (converted to cm)
    Measurements string      // Body measurements
    HairColor    string      // Hair color
    EyeColor     string      // Eye color
    Biography    string      // Bio text
    Aliases      []string    // Alternative names
}
```

## XPath Selectors

The scraper uses XPath to extract data from Adult Empire pages. Key selectors include:

### Scene Selectors
- **Title**: `//h1[@class='title']`
- **Date**: `//div[@class='release-date']/text()`
- **Studio**: `//a[contains(@href, '/studio/')]/text()`
- **Image**: `//div[@class='item-image']//img/@src`
- **Description**: `//div[@class='synopsis']`
- **Performers**: `//a[contains(@href, '/performer/')]/text()`
- **Tags**: `//a[contains(@href, '/category/')]/text()`

### Performer Selectors
- **Name**: `//h1[@class='performer-name']`
- **Image**: `//div[@class='performer-image']//img/@src`
- **Birthdate**: `//span[@class='birthdate']/text()`
- **Height**: `//span[@class='height']/text()`
- **Bio**: `//div[@class='bio']`

**Note**: Adult Empire may change their HTML structure. If scraping fails, XPath selectors in `scraper.go` may need updates.

## Utilities

### Date Parsing

```go
dateStr := ParseDate("Jan 15, 2024")  // Handles various formats
```

### Height Conversion

```go
heightCm := ParseHeight("5'6\"")  // Converts feet/inches to cm (168)
```

### Text Cleaning

```go
cleanedText := CleanText(rawHTML)  // Removes "Show More/Less" and extra whitespace
```

### URL Normalization

```go
fullURL := ExtractURL("/path/to/scene", "https://www.adultdvdempire.com")
// Returns: "https://www.adultdvdempire.com/path/to/scene"
```

## Integration with Goondex

The Adult Empire scraper integrates seamlessly with the existing Goondex architecture:

1. **Scrape** data from Adult Empire using the scraper
2. **Convert** to Goondex models using converter functions
3. **Save** to the database using existing stores
4. **Display** in the web UI with cover art and metadata

### Example Workflow

```go
// 1. Search for a scene
results, _ := scraper.SearchScenesByName(ctx, "scene name")

// 2. Pick the first result and scrape full details
sceneData, _ := scraper.ScrapeSceneByURL(ctx, results[0].URL)

// 3. Convert to Goondex model
scene := scraper.ConvertSceneToModel(sceneData)

// 4. Save to database
sceneStore := db.NewSceneStore(database)
sceneStore.Create(scene)

// 5. Now it appears in the web UI!
```

## Future Enhancements

Planned improvements for the Adult Empire scraper:

- ⏳ **Bulk Import** - Import entire studios or series
- ⏳ **Auto-Update** - Periodically refresh metadata
- ⏳ **Image Caching** - Download and cache cover art locally
- ⏳ **Duplicate Detection** - Avoid importing the same scene twice
- ⏳ **Advanced Search** - Filter by studio, date range, tags
- ⏳ **Web UI Integration** - Search and import from the dashboard

## Troubleshooting

### "Failed to parse HTML"
- The Adult Empire page structure may have changed
- Update XPath selectors in `scraper.go`

### "Request failed: 403 Forbidden"
- You may need to set an auth token
- Adult Empire may be blocking automated requests
- Try setting a valid `etoken` cookie

### "No results found"
- Check that the search query is correct
- Adult Empire search may have different spelling
- Try broader search terms

### Scene/Performer data incomplete
- Some fields may not be present on all pages
- XPath selectors may need adjustment
- Check the raw HTML to verify field availability

## Comparison with TPDB Scraper

| Feature | TPDB | Adult Empire |
|---------|------|--------------|
| **API** | ✅ Official JSON API | ❌ HTML scraping |
| **Auth** | ✅ API key | ⚠️ Session cookie |
| **Rate Limits** | ✅ Documented | ⚠️ Unknown |
| **Stability** | ✅ Stable schema | ⚠️ May change |
| **Coverage** | ✅ Comprehensive | ✅ Comprehensive |
| **Images** | ✅ High quality | ✅ High quality |

**Recommendation**: Use TPDB as the primary source and Adult Empire as a fallback or supplemental source.

## Contributing

To improve Adult Empire scraping:

1. Update XPath selectors if Adult Empire changes their HTML
2. Add support for additional fields
3. Improve date/height parsing
4. Add more robust error handling

## Version History

- **v0.1.0-dev5** (2025-11-17): Documentation refresh for TPDB bulk-import release
  - Updated version metadata and changelog references
  - Clarified rebuild steps for the CLI additions
- **v0.1.0-dev4** (2025-11-16): Initial Adult Empire scraper implementation
  - HTTP client with cookie support
  - XPath parsing utilities
  - Scene and performer scraping
  - Search functionality
  - Model conversion utilities

---

**Last Updated**: 2025-11-17
**Maintainer**: Goondex Team