- Add Kokoro-FastAPI TTS integration to the chapter reader UI: - Browser-side MSE streaming with paragraph-level click-to-start - Voice selector, speed slider, auto-next with prefetch of the next chapter - New GET /ui/chapter-text endpoint that strips Markdown and serves plain text - Add ranking page (novelfire /ranking scraper, WriteRanking/ReadRankingItems in writer, GET /ranking + POST /ranking/refresh + GET /ranking/view routes) with local-library annotation and one-click scrape buttons - Add StrategyDirect (plain HTTP client) as a new browser strategy; the default strategy is now 'direct' for chapter fetching and 'content' for chapter-list URL retrieval (split via BROWSERLESS_URL_STRATEGY) - Fix chapter numbering bug: numbers are now derived from the URL path (/chapter-N) rather than list position, correcting newest-first ordering - Add 'refresh <slug>' CLI sub-command to re-scrape a book from its saved source_url without knowing the original URL - Extend NovelScraper interface with RankingProvider (ScrapeRanking) - Tune scraper timeouts: wait-for-selector reduced to 5 s, GotoOptions timeout set to 60 s, content/scrape client defaults raised to 90 s - Add cover extraction fix (figure.cover > img rather than bare img.cover) - Add AGENTS.md and .aiignore for AI tooling context - Add integration tests for browser client and novelfire scraper (build tag: integration) and unit tests for chapterNumberFromURL and pagination
3.1 KiB
3.1 KiB
libnovel Project
Go web scraper for novelfire.net with TTS support via Kokoro-FastAPI.
Architecture
scraper/
├── cmd/scraper/main.go # Entry point: 'run' (one-shot) and 'serve' (HTTP server)
├── internal/
│ ├── orchestrator/orchestrator.go # Coordinates catalogue walk, metadata extraction, chapter scraping
│ ├── browser/ # Browser client (content/scrape/cdp strategies) via Browserless
│ ├── novelfire/scraper.go # novelfire.net specific scraping logic
│ ├── server/server.go # HTTP API (POST /scrape, POST /scrape/book)
│ ├── writer/writer.go # File writer (metadata.yaml, chapter .md files)
│ └── scraper/interfaces.go # NovelScraper interface definition
└── static/books/ # Output directory for scraped content
Key Concepts
- Orchestrator: Manages concurrency - catalogue streaming → per-book metadata goroutines → chapter worker pool
- Browser Client: 3 strategies (content/scrape/cdp) via Browserless Chrome container
- Writer: Writes metadata.yaml and chapter markdown files to
static/books/{slug}/vol-0/1-50/ - Server: HTTP API with async scrape jobs, UI for browsing books/chapters, chapter-text endpoint for TTS
Commands
# Build
cd scraper && go build -o bin/scraper ./cmd/scraper
# One-shot scrape (full catalogue)
./bin/scraper run
# Single book
./bin/scraper run --url https://novelfire.net/book/xxx
# HTTP server
./bin/scraper serve
# Tests
cd scraper && go test ./...
Environment Variables
| Variable | Description | Default |
|---|---|---|
| BROWSERLESS_URL | Browserless Chrome endpoint | http://localhost:3000 |
| BROWSERLESS_STRATEGY | content | scrape | cdp | content |
| SCRAPER_WORKERS | Chapter goroutines | NumCPU |
| SCRAPER_STATIC_ROOT | Output directory | ./static/books |
| SCRAPER_HTTP_ADDR | HTTP listen address | :8080 |
| KOKORO_URL | Kokoro TTS endpoint | http://localhost:8880 |
| KOKORO_VOICE | Default TTS voice | af_bella |
| LOG_LEVEL | debug | info | warn | error | info |
Docker
docker-compose up -d # Starts browserless, kokoro, scraper
Code Patterns
- Uses
log/slogfor structured logging - Context-based cancellation throughout
- Worker pool pattern in orchestrator (channel + goroutines)
- Mutex for single async job (409 on concurrent scrape requests)
AI Context Tips
- Primary files to modify:
orchestrator.go,server.go,scraper.go,browser/*.go - To add new source: implement
NovelScraperinterface frominternal/scraper/interfaces.go - Skip
static/directory - generated content, not source
Speed Up AI Sessions (Optional)
For faster AI context loading, use Context7 (free, local indexing):
# Install and index once
npx @context7/cli@latest index --path . --ignore .aiignore
# After first run, AI tools will query the index instead of re-scanning files
VSCode extension: https://marketplace.visualstudio.com/items?itemName=context7.context7