Six months ago I had a channel with no videos and a problem: I had plenty of topic ideas but zero time to film, edit, or even think about thumbnails. So I did what any developer would do — I automated the whole thing.
Today that pipeline is open-source. It’s called ShortEngine, it’s on GitHub, and it turns a topic string into a published YouTube Short in about 90 seconds. Here’s how it works, why I built it this way, and the code behind each step.
Why I Built This
The honest answer: I wanted passive content output without passive income being my full-time job. Faceless channels are a real business model, but only if the production is actually automated — not just “I batch-record on Sundays.”
I wanted to type one command and have a video appear on YouTube. That’s the bar I set. Everything I built was in service of that single workflow:
npx shortengine generate --topic "5 free tools every developer should know"
Output:
✓ Script generated (1.2s)
✓ Voiceover created (3.4s)
✓ Images generated (8.1s)
✓ Video rendered (22.3s)
✓ Uploaded to YouTube (14.2s)
Done in 49.2s — https://youtube.com/shorts/dQw4w9WgXcQ
That’s the real output from a production run. Under 50 seconds from command to live video.
Architecture Overview
ShortEngine is a TypeScript CLI with five sequential stages, each encapsulated in its own module:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Ollama │ → │ edge-tts │ → │ Flux │ → │ FFmpeg │ → │ YouTube │
│ (scripts) │ │ (voice) │ │ (images) │ │ (render) │ │ (upload) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Every stage writes its artifacts to a temp directory. If any stage fails, the pipeline stops cleanly with a useful error. The whole thing is idempotent — re-running with the same topic skips completed stages.
Stage 1: Script Generation with Ollama
I use Ollama running llama3.1:8b locally. No API key, no cost, no rate limits. On an M2 Mac it takes 8–15 seconds to generate a 200-word script.
The prompt engineering matters a lot here. A naive prompt produces rambling output. The version I landed on:
// src/stages/generate-script.ts
export async function generateScript(topic: string): Promise<Script> {
const prompt = `You are a YouTube Shorts scriptwriter.
Write a 60-second script for this topic: "${topic}"
Format your response as JSON:
{
"hook": "Opening 5 seconds that stops the scroll",
"scenes": ["Scene 1 narration", "Scene 2 narration", "Scene 3 narration"],
"cta": "End card call to action (5 seconds)"
}
Rules: No filler words. Every sentence earns its place. Hook must create curiosity.`;
const res = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({
model: "llama3.1:8b",
prompt,
stream: false,
format: "json",
}),
});
const data = await res.json();
return JSON.parse(data.response) as Script;
}
The format: "json" parameter tells Ollama to constrain output to valid JSON. This eliminated about 90% of parsing errors I was seeing with free-form text output.
A real script output looks like this:
{
"hook": "You've been writing TypeScript wrong for years. Here's what senior devs actually do.",
"scenes": [
"First: they never use 'any'. When you need to type an unknown value, use 'unknown' instead — it forces you to narrow before using it.",
"Second: they use const assertions. Add 'as const' to any literal and TypeScript treats it as a readonly tuple — no more widened types.",
"Third: they use discriminated unions instead of optional fields. Your state machine becomes self-documenting and exhaustively checked."
],
"cta": "Link in bio for the full TypeScript cheat sheet. Follow for more."
}
Stage 2: Voiceover with edge-tts
Microsoft’s edge-tts Python package is the best free TTS option I’ve found. It’s the same engine powering Edge’s Read Aloud feature — genuinely natural-sounding with good pacing.
I wrap it in a TypeScript shell call:
// src/stages/generate-voice.ts
import { execSync } from "child_process";
import path from "path";
export async function generateVoiceover(
text: string,
outputDir: string
): Promise<string> {
const outputPath = path.join(outputDir, "voiceover.mp3");
const voice = "en-US-AndrewNeural"; // Best voice for tech content
// Write text to temp file to avoid shell escaping issues
const textPath = path.join(outputDir, "script.txt");
await fs.writeFile(textPath, text);
execSync(
`edge-tts --voice "${voice}" --file "${textPath}" --write-media "${outputPath}"`,
{ stdio: "pipe" }
);
return outputPath;
}
The AndrewNeural voice tested best with developer audiences — confident and clear without sounding robotic. For niches targeting younger audiences, en-US-EmmaNeural performs better.
Stage 3: Image Generation with Flux via Pollinations
Pollinations.ai gives you free access to Flux image generation through a simple URL API. No account, no key, no rate limit that matters at one-video-per-run throughput.
// src/stages/generate-images.ts
export async function generateImages(
scenes: string[],
outputDir: string
): Promise<string[]> {
const imagePrompts = scenes.map(
(scene) =>
`${scene} — cinematic, high quality, dark background, neon accents, 4K`
);
const downloads = imagePrompts.map(async (prompt, i) => {
const url = `https://image.pollinations.ai/prompt/${encodeURIComponent(prompt)}?width=1080&height=1920&model=flux&nologo=true`;
const res = await fetch(url);
if (!res.ok) throw new Error(`Image generation failed for scene ${i}`);
const outputPath = path.join(outputDir, `scene_${i}.jpg`);
await fs.writeFile(outputPath, Buffer.from(await res.arrayBuffer()));
return outputPath;
});
// Generate all images in parallel
return Promise.all(downloads);
}
Parallel generation is key — fetching 6 images sequentially takes ~30 seconds, in parallel it’s ~10 seconds.
One gotcha: Pollinations occasionally returns a 429 or a blank image when their servers are busy. ShortEngine has a retry wrapper with exponential backoff around this call. The open-source version on GitHub includes that logic.
Stage 4: Rendering with FFmpeg
This is the stage that does the most heavy lifting, but FFmpeg makes it clean:
// src/stages/render-video.ts
import { execSync } from "child_process";
export async function renderVideo(
images: string[],
audioPath: string,
outputPath: string
): Promise<void> {
// Build concat file — each image holds for its proportional share of audio
const audioDuration = await getAudioDuration(audioPath);
const perImageDuration = audioDuration / images.length;
const concatContent = images
.map((img) => `file '${img}'\nduration ${perImageDuration.toFixed(2)}`)
.join("\n");
const concatFile = outputPath.replace(".mp4", "_concat.txt");
await fs.writeFile(concatFile, concatContent);
execSync(`
ffmpeg -y \
-f concat -safe 0 -i "${concatFile}" \
-i "${audioPath}" \
-vf "scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,setsar=1,fps=30" \
-c:v libx264 -preset fast -crf 22 -pix_fmt yuv420p \
-c:a aac -b:a 192k \
-movflags +faststart \
-shortest \
"${outputPath}"
`, { stdio: "pipe" });
}
async function getAudioDuration(audioPath: string): Promise<number> {
const output = execSync(
`ffprobe -v quiet -show_entries format=duration -of csv=p=0 "${audioPath}"`
).toString().trim();
return parseFloat(output);
}
The -movflags +faststart flag moves the MP4 metadata to the beginning of the file — YouTube’s ingest system processes these faster. The -pix_fmt yuv420p flag ensures compatibility with all players.
Stage 5: YouTube Upload
The upload stage uses the official Google API client. The first-run OAuth flow is manual (open a URL, click approve), but the refresh token persists so every subsequent run is fully automatic:
// src/stages/upload-youtube.ts
import { google } from "googleapis";
import { createReadStream } from "fs";
export async function uploadToYouTube(
videoPath: string,
script: Script,
topic: string
): Promise<string> {
const auth = await getOAuth2Client(); // Loads cached token or prompts OAuth flow
const youtube = google.youtube({ version: "v3", auth });
const description = [
script.scenes.join("\n\n"),
"",
"---",
"Built with ShortEngine — https://github.com/johnmives/shortengine",
].join("\n");
const response = await youtube.videos.insert({
part: ["snippet", "status"],
requestBody: {
snippet: {
title: topic,
description,
tags: topic.split(" ").concat(["shorts", "developer", "coding"]),
categoryId: "28", // Science & Technology
defaultLanguage: "en",
},
status: {
privacyStatus: "public",
selfDeclaredMadeForKids: false,
madeForKids: false,
},
},
media: {
mimeType: "video/mp4",
body: createReadStream(videoPath),
},
});
return response.data.id!;
}
What I Learned
Prompt structure beats model size. A structured JSON prompt on llama3.1:8b outperformed unstructured prompts on bigger models. The JSON constraint forces the model to think in terms of discrete scenes.
Parallel is mandatory. Image generation is the bottleneck. Sequential fetching made the pipeline feel broken. Parallel reduces it from 30s to 10s.
FFmpeg’s -shortest flag is essential. Without it, FFmpeg pads the video to match the longer of audio/video. The result is a video that ends and then hangs on a black frame — YouTube doesn’t like this.
Cache your OAuth token. Google’s OAuth token lasts 1 hour. Store both the access token and the refresh token, and implement auto-refresh. ShortEngine handles this transparently.
Tempdirs need cleanup. Each run generates ~150MB of image and audio artifacts. Add a cleanup hook at the end of the pipeline or your disk fills up after 50 runs.
Get ShortEngine
The complete source code is on GitHub: github.com/johnmives/shortengine
The repo includes:
- Full TypeScript source with all 5 stages
- Setup guide (OAuth, Ollama model download, edge-tts install)
- A
topics.jsonexample file for batch mode - GitHub Actions workflow for running on a schedule
If you want the Pro version — which adds a web UI, a topic scheduler, analytics dashboard, and pre-built niche topic packs — it’s available as a one-time purchase on Gumroad: revxljohn.gumroad.com/l/aibuce
The free version is genuinely complete. Pro is for people who want to skip the setup and run multiple channels without touching code.
What’s Next
I’m currently working on:
- Trend injection — pull from Google Trends to auto-select topics that are peaking
- Subtitle burn-in — word-level highlights synchronized to the voiceover
- Multi-channel mode — run separate topic queues for separate channels from one install
- Analytics feedback loop — retire topics that underperform, double down on what gets views
If you build something with ShortEngine, open an issue or a PR. The best version of this tool gets built by people actually running channels with it.
🎙️ Need a voiceover? AI Voiceover Generator — professional voice in seconds, $1 per generation.