taoguba-crawler

✓Clean

This skill should be used when the user asks to "crawl taoguba", "crawl tgb", "scrape taoguba articles", "run the crawler", "crawl bbs", "crawl home page", "generate article HTML", or needs to run the Taoguba (tgb.cn) web crawlers.

⭐ 0 stars🍴 0 forks↓ 0 installs📄 MIT

Install Command

npx skills add lisniuse/taoguba-crawler-skill

api data-science

Author

lisniuse

Repository

lisniuse/taoguba-crawler-skill

Discovered via

github topic

Weekly installs

Quality score

25/100

Last commit

2/16/2026

SKILL.md

---
name: taoguba-crawler
description: This skill should be used when the user asks to "crawl taoguba", "crawl tgb", "scrape taoguba articles", "run the crawler", "crawl bbs", "crawl home page", "generate article HTML", or needs to run the Taoguba (tgb.cn) web crawlers.
version: 0.1.0
allowed-tools: Bash, Read
---

# Taoguba Crawler

This skill runs the Taoguba (tgb.cn) article crawlers located in the project root.

## Prerequisites

- Python 3 with `requests`, `beautifulsoup4`, `python-dotenv` installed
- A `.env` file in the project root containing `COOKIE` and optionally `USER_AGENT`

## Available Crawlers

### 1. BBS Crawler (`crawler_bbs.py`)

Crawl the forum board at `tgb.cn/bbs/1/1` using HTML scraping.

```bash
python crawler_bbs.py
```

- Extracts article list by parsing `a.overhide.mw300` elements
- Gets each article's main post and author replies
- Downloads images and embeds them as base64 in HTML
- Outputs: `output/bbs_YYYY-MM-DD.json` and `output/bbs_YYYY-MM-DD_HHMMSS.html`

### 2. Home Crawler (`crawler_home.py`)

Crawl the homepage recommendations via JSON API (`/newIndex/getZh`).

```bash
python crawler_home.py
```

- Fetches articles from the JSON API (default 2 pages)
- Same content extraction and HTML generation as BBS crawler
- Outputs: `output/home_YYYY-MM-DD.json` and `output/home_YYYY-MM-DD_HHMMSS.html`

## Common Workflow

To run both crawlers:

```bash
python crawler_bbs.py && python crawler_home.py
```

## Key Implementation Details

- **Authentication**: Both scripts read `COOKIE` from `.env` via `python-dotenv`
- **Rate limiting**: 0.5-1s delay between requests to avoid being blocked
- **Image handling**: Images are downloaded and embedded as base64 in the HTML output
- **Article content**: Extracts main post (`#first`) and author replies (`.comment-data` with author badge)
- **Output directory**: All results saved to `output/` folder

## Scripts

The crawler scripts are bundled in `scripts/`:

- **`scripts/crawler_bbs.py`** - BBS forum crawler (HTML scraping)
- **`scripts/crawler_home.py`** - Homepage crawler (JSON API)

To run the bundled scripts directly:

```bash
python scripts/crawler_bbs.py
python scripts/crawler_home.py
```

## Troubleshooting

- If no articles are returned, check that `.env` contains a valid `COOKIE` value
- If image downloads fail, the HTML will show error messages inline
- Network timeouts default to 10-15 seconds per request

Similar Skills

google-ai-mode-skill✓Clean

Use this skill when the user requests current information, documentation, coding examples, or web research beyond the knowledge cutoff. Queries Google's AI Search mode to retrieve comprehensive AI-generated overviews with source citations from 100+ websites. Returns markdown with inline footnoted references [1][2][3]. You will receive a detailed Markdown file with references and information summarized directly from Google's AI search. Ideal for you to get new information and clues for further research.

api documentation productivity

⭐ 100↓ 0PleasePrompto/google-ai-mode-skill

npx skills add PleasePrompto/google-ai-mode-skill

super-transcribe✓Clean

Unified speech-to-text skill. Use when the user asks to transcribe audio or video, generate subtitles, identify speakers, translate speech, search transcripts, diarize meetings, or perform any speech-to-text task. Also use when a voice message or audio file appears in chat and the user's intent to transcribe it is extremely clear.

data-science productivity

⭐ 0↓ 0ThePlasmak/super-transcribe

npx skills add ThePlasmak/super-transcribe

browser-tools✓Clean

Interact with a web browser. Can start a browser, connect to it, evaluate JavaScript, make screenshots, read console logs and let the user select DOM elements. Use when interacting with unknown websites (e.g. scraping or Userscripts) or debugging browser-stuff. Requires uv.

frontend debugging

⭐ 0↓ 0Brawl345/browser-tools

npx skills add Brawl345/browser-tools

markdown-fetch✓Clean

Use this skill whenever Claude needs to fetch, read, extract, or analyze content from a web URL. Converts web pages into clean, token-efficient markdown using the markdown.new service instead of fetching raw HTML. Trigger when the user provides a URL and wants its content summarized, quoted, analyzed, compared, extracted, or processed. Also trigger when Claude needs to read documentation, blog posts, articles, wikis, release notes, changelogs, or any web-hosted text content. Even if the user just pastes a URL with no instruction, use this skill. Do NOT use for binary files, authenticated pages, or API endpoints returning JSON/XML.

documentation productivity

⭐ 0↓ 0dnh33/markdown-fetch

npx skills add dnh33/markdown-fetch