333 lines
9.2 KiB
Markdown
333 lines
9.2 KiB
Markdown
# Adult Empire Scraper Integration
|
|
|
|
**Version**: v0.1.0-dev5
|
|
**Last Updated**: 2025-11-17
|
|
|
|
## Overview
|
|
|
|
Goondex now includes a full-featured Adult Empire scraper based on the Stash app's scraping architecture. This allows you to fetch metadata, cover art, and performer information directly from Adult Empire (adultdvdempire.com).
|
|
|
|
## Features
|
|
|
|
### ✅ Scene Scraping
|
|
- Extract scene title, description, release date
|
|
- Download cover art/thumbnails
|
|
- Retrieve studio information
|
|
- Get performer lists
|
|
- Extract tags/categories
|
|
- Scene code/SKU
|
|
- Director information
|
|
|
|
### ✅ Performer Scraping
|
|
- Extract performer name, aliases
|
|
- Download profile images
|
|
- Retrieve birthdate, ethnicity, nationality
|
|
- Physical attributes (height, measurements, hair/eye color)
|
|
- Biography text
|
|
|
|
### ✅ Search Functionality
|
|
- Search scenes by title
|
|
- Search performers by name
|
|
- Get search results with thumbnails
|
|
|
|
## Architecture
|
|
|
|
The Adult Empire scraper is implemented in `/internal/scraper/adultemp/` with the following components:
|
|
|
|
### Files
|
|
|
|
1. **`types.go`** - Data structures for scraped content
|
|
2. **`client.go`** - HTTP client with cookie/session management
|
|
3. **`xpath.go`** - XPath parsing utilities for HTML extraction
|
|
4. **`scraper.go`** - Main scraper implementation
|
|
|
|
### Components
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Scraper API │ - ScrapeSceneByURL()
|
|
│ │ - ScrapePerformerByURL()
|
|
│ │ - SearchScenesByName()
|
|
│ │ - SearchPerformersByName()
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ HTTP Client │ - Cookie jar for sessions
|
|
│ │ - Age verification
|
|
│ │ - Auth token support
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ XPath Parser │ - Extract data from HTML
|
|
│ │ - Parse dates, heights
|
|
│ │ - Clean text content
|
|
└─────────────────┘
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Authentication (Optional)
|
|
|
|
For full access to Adult Empire content, you can set an authentication token:
|
|
|
|
```go
|
|
scraper, err := adultemp.NewScraper()
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
// Optional: Set your Adult Empire session token
|
|
scraper.SetAuthToken("your-etoken-here")
|
|
```
|
|
|
|
**Getting your etoken:**
|
|
1. Log into adultdvdempire.com
|
|
2. Open browser DevTools (F12)
|
|
3. Go to Application → Cookies → adultdvdempire.com
|
|
4. Copy the value of the `etoken` cookie
|
|
|
|
### Scrape a Scene by URL
|
|
|
|
```go
|
|
ctx := context.Background()
|
|
sceneData, err := scraper.ScrapeSceneByURL(ctx, "https://www.adultdvdempire.com/12345/scene-name")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
// Convert to Goondex model
|
|
scene := scraper.ConvertSceneToModel(sceneData)
|
|
|
|
// Save to database
|
|
// db.Scenes.Create(scene)
|
|
```
|
|
|
|
### Search for Scenes
|
|
|
|
```go
|
|
results, err := scraper.SearchScenesByName(ctx, "scene title")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
for _, result := range results {
|
|
fmt.Printf("Title: %s\n", result.Title)
|
|
fmt.Printf("URL: %s\n", result.URL)
|
|
fmt.Printf("Image: %s\n", result.Image)
|
|
}
|
|
```
|
|
|
|
### Scrape a Performer
|
|
|
|
```go
|
|
performerData, err := scraper.ScrapePerformerByURL(ctx, "https://www.adultdvdempire.com/performer/12345/name")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
// Convert to Goondex model
|
|
performer := scraper.ConvertPerformerToModel(performerData)
|
|
```
|
|
|
|
### Search for Performers
|
|
|
|
```go
|
|
results, err := scraper.SearchPerformersByName(ctx, "performer name")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
for _, result := range results {
|
|
fmt.Printf("Name: %s\n", result.Title)
|
|
fmt.Printf("URL: %s\n", result.URL)
|
|
}
|
|
```
|
|
|
|
## Data Structures
|
|
|
|
### SceneData
|
|
|
|
```go
|
|
type SceneData struct {
|
|
Title string // Scene title
|
|
URL string // Adult Empire URL
|
|
Date string // Release date
|
|
Studio string // Studio name
|
|
Image string // Cover image URL
|
|
Description string // Synopsis/description
|
|
Performers []string // List of performer names
|
|
Tags []string // Categories/tags
|
|
Code string // Scene code/SKU
|
|
Director string // Director name
|
|
}
|
|
```
|
|
|
|
### PerformerData
|
|
|
|
```go
|
|
type PerformerData struct {
|
|
Name string // Performer name
|
|
URL string // Adult Empire URL
|
|
Image string // Profile image URL
|
|
Birthdate string // Date of birth
|
|
Ethnicity string // Ethnicity
|
|
Country string // Country of origin
|
|
Height string // Height (converted to cm)
|
|
Measurements string // Body measurements
|
|
HairColor string // Hair color
|
|
EyeColor string // Eye color
|
|
Biography string // Bio text
|
|
Aliases []string // Alternative names
|
|
}
|
|
```
|
|
|
|
## XPath Selectors
|
|
|
|
The scraper uses XPath to extract data from Adult Empire pages. Key selectors include:
|
|
|
|
### Scene Selectors
|
|
- **Title**: `//h1[@class='title']`
|
|
- **Date**: `//div[@class='release-date']/text()`
|
|
- **Studio**: `//a[contains(@href, '/studio/')]/text()`
|
|
- **Image**: `//div[@class='item-image']//img/@src`
|
|
- **Description**: `//div[@class='synopsis']`
|
|
- **Performers**: `//a[contains(@href, '/performer/')]/text()`
|
|
- **Tags**: `//a[contains(@href, '/category/')]/text()`
|
|
|
|
### Performer Selectors
|
|
- **Name**: `//h1[@class='performer-name']`
|
|
- **Image**: `//div[@class='performer-image']//img/@src`
|
|
- **Birthdate**: `//span[@class='birthdate']/text()`
|
|
- **Height**: `//span[@class='height']/text()`
|
|
- **Bio**: `//div[@class='bio']`
|
|
|
|
**Note**: Adult Empire may change their HTML structure. If scraping fails, XPath selectors in `scraper.go` may need updates.
|
|
|
|
## Utilities
|
|
|
|
### Date Parsing
|
|
|
|
```go
|
|
dateStr := ParseDate("Jan 15, 2024") // Handles various formats
|
|
```
|
|
|
|
### Height Conversion
|
|
|
|
```go
|
|
heightCm := ParseHeight("5'6\"") // Converts feet/inches to cm (168)
|
|
```
|
|
|
|
### Text Cleaning
|
|
|
|
```go
|
|
cleanedText := CleanText(rawHTML) // Removes "Show More/Less" and extra whitespace
|
|
```
|
|
|
|
### URL Normalization
|
|
|
|
```go
|
|
fullURL := ExtractURL("/path/to/scene", "https://www.adultdvdempire.com")
|
|
// Returns: "https://www.adultdvdempire.com/path/to/scene"
|
|
```
|
|
|
|
## Integration with Goondex
|
|
|
|
The Adult Empire scraper integrates seamlessly with the existing Goondex architecture:
|
|
|
|
1. **Scrape** data from Adult Empire using the scraper
|
|
2. **Convert** to Goondex models using converter functions
|
|
3. **Save** to the database using existing stores
|
|
4. **Display** in the web UI with cover art and metadata
|
|
|
|
### Example Workflow
|
|
|
|
```go
|
|
// 1. Search for a scene
|
|
results, _ := scraper.SearchScenesByName(ctx, "scene name")
|
|
|
|
// 2. Pick the first result and scrape full details
|
|
sceneData, _ := scraper.ScrapeSceneByURL(ctx, results[0].URL)
|
|
|
|
// 3. Convert to Goondex model
|
|
scene := scraper.ConvertSceneToModel(sceneData)
|
|
|
|
// 4. Save to database
|
|
sceneStore := db.NewSceneStore(database)
|
|
sceneStore.Create(scene)
|
|
|
|
// 5. Now it appears in the web UI!
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
Planned improvements for the Adult Empire scraper:
|
|
|
|
- ⏳ **Bulk Import** - Import entire studios or series
|
|
- ⏳ **Auto-Update** - Periodically refresh metadata
|
|
- ⏳ **Image Caching** - Download and cache cover art locally
|
|
- ⏳ **Duplicate Detection** - Avoid importing the same scene twice
|
|
- ⏳ **Advanced Search** - Filter by studio, date range, tags
|
|
- ⏳ **Web UI Integration** - Search and import from the dashboard
|
|
|
|
## Troubleshooting
|
|
|
|
### "Failed to parse HTML"
|
|
- The Adult Empire page structure may have changed
|
|
- Update XPath selectors in `scraper.go`
|
|
|
|
### "Request failed: 403 Forbidden"
|
|
- You may need to set an auth token
|
|
- Adult Empire may be blocking automated requests
|
|
- Try setting a valid `etoken` cookie
|
|
|
|
### "No results found"
|
|
- Check that the search query is correct
|
|
- Adult Empire search may have different spelling
|
|
- Try broader search terms
|
|
|
|
### Scene/Performer data incomplete
|
|
- Some fields may not be present on all pages
|
|
- XPath selectors may need adjustment
|
|
- Check the raw HTML to verify field availability
|
|
|
|
## Comparison with TPDB Scraper
|
|
|
|
| Feature | TPDB | Adult Empire |
|
|
|---------|------|--------------|
|
|
| **API** | ✅ Official JSON API | ❌ HTML scraping |
|
|
| **Auth** | ✅ API key | ⚠️ Session cookie |
|
|
| **Rate Limits** | ✅ Documented | ⚠️ Unknown |
|
|
| **Stability** | ✅ Stable schema | ⚠️ May change |
|
|
| **Coverage** | ✅ Comprehensive | ✅ Comprehensive |
|
|
| **Images** | ✅ High quality | ✅ High quality |
|
|
|
|
**Recommendation**: Use TPDB as the primary source and Adult Empire as a fallback or supplemental source.
|
|
|
|
## Contributing
|
|
|
|
To improve Adult Empire scraping:
|
|
|
|
1. Update XPath selectors if Adult Empire changes their HTML
|
|
2. Add support for additional fields
|
|
3. Improve date/height parsing
|
|
4. Add more robust error handling
|
|
|
|
## Version History
|
|
|
|
- **v0.1.0-dev5** (2025-11-17): Documentation refresh for TPDB bulk-import release
|
|
- Updated version metadata and changelog references
|
|
- Clarified rebuild steps for the CLI additions
|
|
- **v0.1.0-dev4** (2025-11-16): Initial Adult Empire scraper implementation
|
|
- HTTP client with cookie support
|
|
- XPath parsing utilities
|
|
- Scene and performer scraping
|
|
- Search functionality
|
|
- Model conversion utilities
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-17
|
|
**Maintainer**: Goondex Team
|