Update all default URLs, port mappings, healthcheck endpoints, Dockerfile ENV defaults, and integration test run instructions to use port 3030.
90 lines
3.1 KiB
Markdown
90 lines
3.1 KiB
Markdown
# libnovel Project
|
|
|
|
Go web scraper for novelfire.net with TTS support via Kokoro-FastAPI.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
scraper/
|
|
├── cmd/scraper/main.go # Entry point: 'run' (one-shot) and 'serve' (HTTP server)
|
|
├── internal/
|
|
│ ├── orchestrator/orchestrator.go # Coordinates catalogue walk, metadata extraction, chapter scraping
|
|
│ ├── browser/ # Browser client (content/scrape/cdp strategies) via Browserless
|
|
│ ├── novelfire/scraper.go # novelfire.net specific scraping logic
|
|
│ ├── server/server.go # HTTP API (POST /scrape, POST /scrape/book)
|
|
│ ├── writer/writer.go # File writer (metadata.yaml, chapter .md files)
|
|
│ └── scraper/interfaces.go # NovelScraper interface definition
|
|
└── static/books/ # Output directory for scraped content
|
|
```
|
|
|
|
## Key Concepts
|
|
|
|
- **Orchestrator**: Manages concurrency - catalogue streaming → per-book metadata goroutines → chapter worker pool
|
|
- **Browser Client**: 3 strategies (content/scrape/cdp) via Browserless Chrome container
|
|
- **Writer**: Writes metadata.yaml and chapter markdown files to `static/books/{slug}/vol-0/1-50/`
|
|
- **Server**: HTTP API with async scrape jobs, UI for browsing books/chapters, chapter-text endpoint for TTS
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Build
|
|
cd scraper && go build -o bin/scraper ./cmd/scraper
|
|
|
|
# One-shot scrape (full catalogue)
|
|
./bin/scraper run
|
|
|
|
# Single book
|
|
./bin/scraper run --url https://novelfire.net/book/xxx
|
|
|
|
# HTTP server
|
|
./bin/scraper serve
|
|
|
|
# Tests
|
|
cd scraper && go test ./...
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| BROWSERLESS_URL | Browserless Chrome endpoint | http://localhost:3030 |
|
|
| BROWSERLESS_STRATEGY | content \| scrape \| cdp | content |
|
|
| SCRAPER_WORKERS | Chapter goroutines | NumCPU |
|
|
| SCRAPER_STATIC_ROOT | Output directory | ./static/books |
|
|
| SCRAPER_HTTP_ADDR | HTTP listen address | :8080 |
|
|
| KOKORO_URL | Kokoro TTS endpoint | http://localhost:8880 |
|
|
| KOKORO_VOICE | Default TTS voice | af_bella |
|
|
| LOG_LEVEL | debug \| info \| warn \| error | info |
|
|
|
|
## Docker
|
|
|
|
```bash
|
|
docker-compose up -d # Starts browserless, kokoro, scraper
|
|
```
|
|
|
|
## Code Patterns
|
|
|
|
- Uses `log/slog` for structured logging
|
|
- Context-based cancellation throughout
|
|
- Worker pool pattern in orchestrator (channel + goroutines)
|
|
- Mutex for single async job (409 on concurrent scrape requests)
|
|
|
|
## AI Context Tips
|
|
|
|
- Primary files to modify: `orchestrator.go`, `server.go`, `scraper.go`, `browser/*.go`
|
|
- To add new source: implement `NovelScraper` interface from `internal/scraper/interfaces.go`
|
|
- Skip `static/` directory - generated content, not source
|
|
|
|
## Speed Up AI Sessions (Optional)
|
|
|
|
For faster AI context loading, use **Context7** (free, local indexing):
|
|
|
|
```bash
|
|
# Install and index once
|
|
npx @context7/cli@latest index --path . --ignore .aiignore
|
|
|
|
# After first run, AI tools will query the index instead of re-scanning files
|
|
```
|
|
|
|
VSCode extension: https://marketplace.visualstudio.com/items?itemName=context7.context7
|