Table of Contents

You already know which newsletters are worth reading. The problem is there are too many of them.

Researchers tracking a space. Marketers monitoring competitors. Solopreneurs doing weekly roundups. Creators mapping what's already been covered before they write. Everyone in this situation does the same thing: open 15 tabs, skim, take notes, close tabs, repeat. It works until it doesn't.

The obvious solution is automation. Scrape the content, summarize with an LLM, pipe the output somewhere useful. Plenty of people have built this. Most of them quietly stopped running it after the first invoice, because per-token billing at scale is brutal. Summarize a few hundred articles and it's fine. Do it weekly across a dozen newsletters and the math stops working.

This project solves that. An open-source scraping and summarization API, built on Crawl4AI and Featherless, that processes any newsletter or blog at a fixed monthly cost. Models like Qwen3, Mistral Small, and MiniMax M2.5 available from day one. Starts at $25/month. The code is on GitHub.

Here's how it works and what you can actually build with it.

Why not just use RSS

RSS is the first thing anyone suggests for newsletter monitoring, and it's not wrong exactly. It's just limited in ways that matter.

Most publishers truncate their feeds. You get a title, a date, and a paragraph before it cuts off. The actual content, the thing worth summarizing, is behind a click. And a significant chunk of blogs, newsletters on custom domains, and independent sites don't maintain usable feeds at all. You'd be building two separate pipelines and still missing half your sources.

Crawl4AI scrapes the rendered page. Full article, every time, regardless of how the feed is configured or whether one exists. For Substack specifically, it also handles subscription popups automatically, no session management, no manual intervention. Point it at a URL and the content comes back clean.

RSS is fine for counting posts. This is for reading them.

What the API does

Three endpoints, three different levels of automation.

/scrape takes any URL and returns the main content as clean markdown. Auto-detects page type, handles popups, strips the noise. Works on Substack, Ghost, WordPress, custom domains, whatever you throw at it.

/scrape-and-summarize does the extraction and then passes the content to a Featherless-hosted model with a prompt you control. Default output is key topics, tools, and links. But the prompt is yours: extract action items, generate a thread, identify named people and companies, compare against a brief you paste in. The model doesn't care.

/scrape-all is the one that changes the workflow. Give it a newsletter homepage and a limit, up to 100 articles, and it crawls the archive, scrapes each piece, and summarizes them in sequence. One call. An entire newsletter back-catalog, processed and ready. Each article goes through the LLM separately so long-form content never hits a context wall.

The full code, environment setup, and example requests are in the GitHub repo.

What people are actually building with this

  • Niche intelligence feeds. Pick the 10 newsletters that actually matter in your space. Run /scrape-all on each one weekly. You now have a structured summary of everything published across your niche, without reading any of it. Investors do this with equity research. Marketers do it with competitor blogs. Analysts do it with industry publications. All of them are doing it by hand right now.
  • Automated curation newsletters. Some of the most successful newsletters are just good curation: someone reads everything so their audience doesn't have to. That reading layer is now automatable. Pull summaries from a curated source list, pipe them into a template, edit the interesting ones, ship it. The research goes from four hours to twenty minutes.
  • Research archives. If you are writing a book, a report, or a long-form piece on a topic, someone has probably been publishing about it on Substack for two years. Scrape the back-catalog, summarize by theme, and you have structured research material instead of a folder of URLs you will never get back to.
  • Content strategy. Before you plan your next piece, know what's already been covered thoroughly and what the gaps are. Scrape competitor archives, summarize by topic, and map the territory. Takes an afternoon manually. Takes a few API calls with this.
  • Reading apps and aggregators. If you are building any kind of content product, this is infrastructure you don't have to build. Plug it in behind your own API, connect to your frontend, and you have full-text extraction and LLM summarization without writing a crawler.

The cost argument, plainly

Per-token billing punishes volume. That's not a criticism of any particular provider, it's just how the math works. The more you process, the more you pay, which means you are always optimizing for cost instead of coverage. Fewer sources. Smaller batches. Summaries cut short to save tokens.

Featherless charges flat. No per-token meter running underneath, no invoice that scales with how useful the tool becomes. The $25/month Basic plan handles moderate workloads and is enough for most solo use cases. The $25/month Premium plan adds heavier models like MiniMax M2.5 and DeepSeek V3.2 for larger batch jobs.

For a batch summarization pipeline, this is the thing that makes it worth running. Fixed cost means you process everything, not just what you can afford to process this month.

Get started

You need a Featherless API key, Python, and about ten minutes.

  1. Sign up at featherless.ai and grab your API key from the dashboard
  2. Clone the repo and follow the README setup
  3. Add your FEATHERLESS_API_KEY, FEATHERLESS_BASE_URL, and FEATHERLESS_MODEL to your .env
  4. Hit /scrape-all with any newsletter URL and a limit

Github repo link: https://github.com/fullatron/crawl4AI-substack-scraper

Built something with it, or adapted it for a different source? Tell us on Discord. We want to see what people make when infrastructure cost stops being the reason not to ship.