Update all default URLs, port mappings, healthcheck endpoints, Dockerfile ENV defaults, and integration test run instructions to use port 3030.
3.1 KiB
3.1 KiB
libnovel Project
Go web scraper for novelfire.net with TTS support via Kokoro-FastAPI.
Architecture
scraper/
├── cmd/scraper/main.go # Entry point: 'run' (one-shot) and 'serve' (HTTP server)
├── internal/
│ ├── orchestrator/orchestrator.go # Coordinates catalogue walk, metadata extraction, chapter scraping
│ ├── browser/ # Browser client (content/scrape/cdp strategies) via Browserless
│ ├── novelfire/scraper.go # novelfire.net specific scraping logic
│ ├── server/server.go # HTTP API (POST /scrape, POST /scrape/book)
│ ├── writer/writer.go # File writer (metadata.yaml, chapter .md files)
│ └── scraper/interfaces.go # NovelScraper interface definition
└── static/books/ # Output directory for scraped content
Key Concepts
- Orchestrator: Manages concurrency - catalogue streaming → per-book metadata goroutines → chapter worker pool
- Browser Client: 3 strategies (content/scrape/cdp) via Browserless Chrome container
- Writer: Writes metadata.yaml and chapter markdown files to
static/books/{slug}/vol-0/1-50/ - Server: HTTP API with async scrape jobs, UI for browsing books/chapters, chapter-text endpoint for TTS
Commands
# Build
cd scraper && go build -o bin/scraper ./cmd/scraper
# One-shot scrape (full catalogue)
./bin/scraper run
# Single book
./bin/scraper run --url https://novelfire.net/book/xxx
# HTTP server
./bin/scraper serve
# Tests
cd scraper && go test ./...
Environment Variables
| Variable | Description | Default |
|---|---|---|
| BROWSERLESS_URL | Browserless Chrome endpoint | http://localhost:3030 |
| BROWSERLESS_STRATEGY | content | scrape | cdp | content |
| SCRAPER_WORKERS | Chapter goroutines | NumCPU |
| SCRAPER_STATIC_ROOT | Output directory | ./static/books |
| SCRAPER_HTTP_ADDR | HTTP listen address | :8080 |
| KOKORO_URL | Kokoro TTS endpoint | http://localhost:8880 |
| KOKORO_VOICE | Default TTS voice | af_bella |
| LOG_LEVEL | debug | info | warn | error | info |
Docker
docker-compose up -d # Starts browserless, kokoro, scraper
Code Patterns
- Uses
log/slogfor structured logging - Context-based cancellation throughout
- Worker pool pattern in orchestrator (channel + goroutines)
- Mutex for single async job (409 on concurrent scrape requests)
AI Context Tips
- Primary files to modify:
orchestrator.go,server.go,scraper.go,browser/*.go - To add new source: implement
NovelScraperinterface frominternal/scraper/interfaces.go - Skip
static/directory - generated content, not source
Speed Up AI Sessions (Optional)
For faster AI context loading, use Context7 (free, local indexing):
# Install and index once
npx @context7/cli@latest index --path . --ignore .aiignore
# After first run, AI tools will query the index instead of re-scanning files
VSCode extension: https://marketplace.visualstudio.com/items?itemName=context7.context7