Files
libnovel/AGENTS.md
Admin 7879a51fe3 feat: add Kokoro TTS, ranking page, direct HTTP strategy, and chapter-number fix
- Add Kokoro-FastAPI TTS integration to the chapter reader UI:
  - Browser-side MSE streaming with paragraph-level click-to-start
  - Voice selector, speed slider, auto-next with prefetch of the next chapter
  - New GET /ui/chapter-text endpoint that strips Markdown and serves plain text

- Add ranking page (novelfire /ranking scraper, WriteRanking/ReadRankingItems
  in writer, GET /ranking + POST /ranking/refresh + GET /ranking/view routes)
  with local-library annotation and one-click scrape buttons

- Add StrategyDirect (plain HTTP client) as a new browser strategy; the
  default strategy is now 'direct' for chapter fetching and 'content'
  for chapter-list URL retrieval (split via BROWSERLESS_URL_STRATEGY)

- Fix chapter numbering bug: numbers are now derived from the URL path
  (/chapter-N) rather than list position, correcting newest-first ordering

- Add 'refresh <slug>' CLI sub-command to re-scrape a book from its saved
  source_url without knowing the original URL

- Extend NovelScraper interface with RankingProvider (ScrapeRanking)

- Tune scraper timeouts: wait-for-selector reduced to 5 s, GotoOptions
  timeout set to 60 s, content/scrape client defaults raised to 90 s

- Add cover extraction fix (figure.cover > img rather than bare img.cover)

- Add AGENTS.md and .aiignore for AI tooling context

- Add integration tests for browser client and novelfire scraper (build
  tag: integration) and unit tests for chapterNumberFromURL and pagination
2026-03-01 12:25:16 +05:00

3.1 KiB

libnovel Project

Go web scraper for novelfire.net with TTS support via Kokoro-FastAPI.

Architecture

scraper/
├── cmd/scraper/main.go           # Entry point: 'run' (one-shot) and 'serve' (HTTP server)
├── internal/
│   ├── orchestrator/orchestrator.go  # Coordinates catalogue walk, metadata extraction, chapter scraping
│   ├── browser/                       # Browser client (content/scrape/cdp strategies) via Browserless
│   ├── novelfire/scraper.go          # novelfire.net specific scraping logic
│   ├── server/server.go              # HTTP API (POST /scrape, POST /scrape/book)
│   ├── writer/writer.go              # File writer (metadata.yaml, chapter .md files)
│   └── scraper/interfaces.go         # NovelScraper interface definition
└── static/books/                     # Output directory for scraped content

Key Concepts

  • Orchestrator: Manages concurrency - catalogue streaming → per-book metadata goroutines → chapter worker pool
  • Browser Client: 3 strategies (content/scrape/cdp) via Browserless Chrome container
  • Writer: Writes metadata.yaml and chapter markdown files to static/books/{slug}/vol-0/1-50/
  • Server: HTTP API with async scrape jobs, UI for browsing books/chapters, chapter-text endpoint for TTS

Commands

# Build
cd scraper && go build -o bin/scraper ./cmd/scraper

# One-shot scrape (full catalogue)
./bin/scraper run

# Single book
./bin/scraper run --url https://novelfire.net/book/xxx

# HTTP server
./bin/scraper serve

# Tests
cd scraper && go test ./...

Environment Variables

Variable Description Default
BROWSERLESS_URL Browserless Chrome endpoint http://localhost:3000
BROWSERLESS_STRATEGY content | scrape | cdp content
SCRAPER_WORKERS Chapter goroutines NumCPU
SCRAPER_STATIC_ROOT Output directory ./static/books
SCRAPER_HTTP_ADDR HTTP listen address :8080
KOKORO_URL Kokoro TTS endpoint http://localhost:8880
KOKORO_VOICE Default TTS voice af_bella
LOG_LEVEL debug | info | warn | error info

Docker

docker-compose up -d  # Starts browserless, kokoro, scraper

Code Patterns

  • Uses log/slog for structured logging
  • Context-based cancellation throughout
  • Worker pool pattern in orchestrator (channel + goroutines)
  • Mutex for single async job (409 on concurrent scrape requests)

AI Context Tips

  • Primary files to modify: orchestrator.go, server.go, scraper.go, browser/*.go
  • To add new source: implement NovelScraper interface from internal/scraper/interfaces.go
  • Skip static/ directory - generated content, not source

Speed Up AI Sessions (Optional)

For faster AI context loading, use Context7 (free, local indexing):

# Install and index once
npx @context7/cli@latest index --path . --ignore .aiignore

# After first run, AI tools will query the index instead of re-scanning files

VSCode extension: https://marketplace.visualstudio.com/items?itemName=context7.context7