Building a Reproducible YouTube-to-M4B Audiobook Pipeline

Build a reproducible command-line pipeline using Nix and Task to convert YouTube videos into voice-optimized M4B audiobooks with native chapters and static cover art.

Woojong Koh

Long-form YouTube videos—lectures, philosophy readings, audio essays, and scientific discussions—make for exceptional passive listening. However, using the standard YouTube app or a general-purpose media player on your phone is a suboptimal experience. If you lock your screen, close the app, or lose cellular service, you lose your place.

The gold standard for passive listening is a dedicated mobile audiobook player like BookPlayer or Bound on iOS, or Voice Audiobook Player on Android. These players expect standard .m4b files, featuring native chapter markers, low-bitrate voice-optimized audio, and standard embedded cover artwork.

This guide walks through building a fully automated, reproducible command-line pipeline that downloads any YouTube video, optimizes the audio for speech, extracts chapters, and cleanly embeds native cover art—all sandboxed within a reproducible Nix shell environment orchestrated by Task.


1. The Core Architecture #

A production-grade audiobook pipeline needs to solve three critical technical problems:

  1. Vocal Optimization: Philosophy lectures do not need 320 kbps stereo audio. We downsample to mono, 48,000 Hz, and a highly efficient 34–48 kbps AAC bitrate. This keeps file sizes extremely compact (around 15-20 MB per hour) while maximizing vocal clarity.
  2. Interactive Chapters: We programmatically harvest YouTube’s native user-submitted chapter markers and inject them directly as native QuickTime chapter metadata.
  3. Pristine Cover Art: Dedicated audiobook players are highly sensitive to file structures. A simple thumbnail download can easily break cover art rendering on mobile devices (more on this below).

Here is how the automated pipeline flows from a single command:

graph TD
    A[YouTube URL] --> B[yt-dlp Download]
    B --> C[Convert WebP Thumbnail to JPEG]
    B --> D[Extract Mono AAC Audio]
    B --> E[Harvest Chapter Markers]
    C --> F[Native Mutagen/AtomicParsley Metadata Embedding]
    D --> F
    E --> F
    F --> G[Pristine .m4b Audiobook]

2. Step 1: The Reproducible Nix Shell #

To guarantee this script runs identically on any machine without polluting global system paths, we use a Nix flake. This flake bundles the latest versions of yt-dlp, ffmpeg, and atomicparsley inside an isolated development shell.

Create a flake.nix in your workspace directory:

{
  description = "A Nix-flake-based development environment for Audiobook generation";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
  };

  outputs = { self, nixpkgs }:
    let
      supportedSystems = [ "x86_64-linux" "aarch64-linux" "x86_64-darwin" "aarch64-darwin" ];
      forEachSupportedSystem = f: nixpkgs.lib.genAttrs supportedSystems (system: f {
        pkgs = import nixpkgs { inherit system; };
      });
    in
    {
      devShells = forEachSupportedSystem ({ pkgs }: {
        default = pkgs.mkShell {
          packages = with pkgs; [
            go-task
            yt-dlp
            ffmpeg
            atomicparsley
          ];
        };
      });
    };
}

3. Step 2: Orchestration via Task #

Instead of writing a brittle bash script, we use Task (Taskfile.yml) to orchestrate our pipeline. It handles arguments, directories, error checking, and cleans up post-process file extensions.

Create a Taskfile.yml in your project root:

version: '3'

tasks:
  audiobook:
    desc: Download YouTube audio as an M4B audiobook with chapters, cover, and metadata
    summary: |
      Usage: task audiobook URL="https://www.youtube.com/watch?v=..." [OUT="filename"] [BITRATE="48k"] [CHANNELS="1"]      
    vars:
      URL: '{{.URL | default ""}}'
      OUT: '{{.OUT | default "%(title)s"}}'
      DIR: '{{.DIR | default "work/audiobooks"}}'
      BITRATE: '{{.BITRATE | default "48k"}}'
      CHANNELS: '{{.CHANNELS | default "1"}}'
    cmds:
      - |
        if [ -z "{{.URL}}" ]; then
          echo "Error: URL variable is required."
          echo "Usage: task audiobook URL=\"https://www.youtube.com/watch?v=...\""
          exit 1
        fi
        mkdir -p "{{.DIR}}"
        
        # 1. Dynamically locate ffmpeg in our Nix store path
        FFMPEG_DIR=$(dirname "$(which ffmpeg)")
        
        # 2. Run the optimized yt-dlp extraction
        yt-dlp -x --audio-format m4a --audio-quality {{.BITRATE}} --embed-thumbnail --convert-thumbnails jpg --embed-metadata --embed-chapters \
          --ffmpeg-location "$FFMPEG_DIR" \
          --paths "home:{{.DIR}}" --paths "temp:.tmp" \
          --ppa "ThumbnailsConvertor+ffmpeg: -update 1" \
          --ppa "ExtractAudio: -ac {{.CHANNELS}}" \
          --ppa "ffmpeg: -f mp4" \
          -o "{{.OUT}}.m4b" \
          "{{.URL}}"
        
        # 3. Post-process: Rename output from .m4b.m4a to .m4b if yt-dlp appends .m4a
        for f in "{{.DIR}}"/*.m4b.m4a; do
          if [ -f "$f" ]; then
            echo "➔ Renaming $f to ${f%.m4a}"
            mv "$f" "${f%.m4a}"
          fi
        done        

4. Behind the Scenes: The “Aha!” Metadata Workarounds #

While the Task configuration looks straightforward, it hides several hard-won engineering workarounds that resolve critical compatibility issues in modern OS environments.

The H.264 Video Track Fallback Bug #

If you download an audiobook with standard flags on macOS, you might notice that standard desktop media players (like IINA or VLC) display the cover art perfectly, while mobile players (BookPlayer or Bound) display a blank, generic musical note icon.

Why? YouTube serves thumbnails in the modern .webp format. Because native MP4 containers do not support .webp cover art, yt-dlp falls back to utilizing ffmpeg to stitch the WebP image directly into the file. This creates a miniature H.264 video track (usually 1 frame long) inside the file.

  • IINA is a full video-capable media player, so it successfully decodes and plays this H.264 track as the “cover.”
  • BookPlayer treats the file strictly as an audio file. To optimize performance and battery, it completely ignores video streams and looks only for the native MP4 metadata tag (the covr atom).

The Fix: We supply --convert-thumbnails jpg to convert the WebP format into a standard JPEG first. This allows yt-dlp to bypass the video track fallback and use Python’s native mutagen package to cleanly write the cover image directly into the covr metadata atom as a static attached picture.

The FFmpeg 8.0 image2 Muxer Gotcha #

When running FFmpeg 8.0 inside a Nix environment, the default image2 muxer strictly enforces single-image writes. If you try to transcode a single WebP thumbnail to JPEG without specifying sequence patterns, FFmpeg throws a fatal error:

[image2] The specified filename does not contain an image sequence pattern or a pattern is invalid.

The Fix: We use yt-dlp’s post-processor argument (--ppa) to target the ThumbnailsConvertor subprocess and append the -update 1 flag:

--ppa "ThumbnailsConvertor+ffmpeg: -update 1"

This forces FFmpeg to write the single static JPEG frame cleanly without complaining about missing sequence variables.

Clean Workspace Sandboxing #

Downloading multi-hour audiobooks generates large temporary fragmented streams (.part files) and intermediate raw image structures. Writing these files directly into your active directory is messy and interferes with clean git version control.

The Fix: We configure yt-dlp to segregate its build steps using the --paths system:

--paths "home:{{.DIR}}" --paths "temp:.tmp"

This instructs yt-dlp to run all downloads and heavy media transcoding inside a local hidden .tmp/ folder. Once the audiobook is fully compiled and tagged, it cleanly moves the final .m4b file into your output folder and removes the temporary fragments.

Add .tmp directory to your .gitignore to keep things completely pristine:

# Temporary files
**/.tmp/

5. Running the Pipeline #

To execute the pipeline, simply activate your Nix devShell and run the task command, supplying the YouTube URL and your desired output filename:

# 1. Enter the isolated environment
nix develop

# 2. Run the audiobook pipeline
task audiobook URL="https://www.youtube.com/watch?v=NHTc-hrpii0" OUT="Shakyamuni_Buddha_Wisdom_Sayings"

Once the extraction is complete, you can verify that the metadata is clean and correct using ffprobe:

ffprobe -show_streams work/audiobooks/Shakyamuni_Buddha_Wisdom_Sayings.m4b

You will see a perfectly stream-segregated file:

  • Stream #0:0: AAC Audio (mono, voice-optimized)
  • Stream #0:1: Chapters and subtitles (bin_data)
  • Stream #0:2: Native Cover Art (mjpeg, attached_pic)

Conclusion #

By automating this pipeline with Nix and Task, we’ve eliminated all manual transcoding, metadata tagging, and player compatibility headaches. You get a single, reproducible command that converts any vocal YouTube lecture into a perfect, mobile-ready .m4b audiobook with pristine cover art and zero background bloat.

Happy listening!