Back to blog
·8 min read

Teaching an AI Agent to Write Songs: Inside SongSmith

ai-agentsmusic-generationllmsorchestrationarchitecture

I recently finished building SongSmith, an AI agent that writes and produces complete songs from natural language descriptions. What started as a fun idea—"what if an AI could be a songwriter?"—turned into a deep dive into agent orchestration, autonomous decision-making, and multimodal AI pipelines.

Here's how we taught an AI agent to write songs.

The Core Idea: Inline Tool Calling

Most AI agent frameworks use function calling APIs where the LLM explicitly invokes tools. We went a different direction: inline JSON tool calling. The LLM responds with JSON mixed into its text output, and our agent parses and executes the tools.

Why? It's simpler, works with any LLM (even ones without function calling support), and honestly feels more organic—the agent "thinks" by outputting JSON, then we execute it.

Here's the pattern:

# SongAgent orchestrates the entire pipeline
class SongAgent:
    def generate(self, prompt: str) -> SongProject:
        # 1. Ask the LLM to plan and invoke tools
        llm_response = self.ollama_client.generate(
            model="llama3.1:70b",
            prompt=self._build_system_prompt(prompt)
        )

        # 2. Parse JSON tool calls from LLM response
        tool_calls = self._parse_json_tools(llm_response)
        # Looks for: {"tool": "lyrics_writer", "args": {...}}

        # 3. Execute tools in sequence
        for call in tool_calls:
            if call["tool"] == "lyrics_writer":
                lyrics = self.lyrics_writer.write(call["args"]["style"])
            elif call["tool"] == "beat_generator":
                beat_audio = self.beat_gen.generate(call["args"]["style"])
            elif call["tool"] == "vocal_generator":
                vocals = self.vocal_gen.generate(lyrics, call["args"]["voice"])

The system prompt is crucial—it tells the LLM to be fully autonomous:

"You are SongAgent, an AI composer. Infer the mood, style, and voice from the user's description. Don't ask clarifying questions. Output your tool calls as JSON. Be creative and confident."

This gives the agent permission to make decisions without constantly asking the user.

The Architecture: Four Specialized Tools

SongSmith's pipeline looks like this (ASCII art incoming):

User Prompt
    ↓
┌─────────────────────────────────────────┐
│         SongAgent (Orchestrator)        │
│  Parses prompt, invokes tools in order  │
└─────────────────────────────────────────┘
    ↓
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Lyrics     │→ │    Beat      │→ │   Vocal      │→ │    Audio     │
│   Writer     │  │   Generator  │  │   Generator  │  │    Mixer     │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
     Ollama           MusicGen          Bark TTS          FFmpeg
   (LLM)         (Style Templates)     (Chunking)    (Blend+Balance)
                                          ↓
                                    Final MP3 Song

Each tool is independently trained and specialized:

1. LyricsWriter — Creative Writing at Scale

Uses an Ollama LLM with temperature 0.8 for creativity. The temperature matters—0.8 is warm enough for poetic license but cool enough to stay coherent.

class LyricsWriter:
    def write(self, style: str, mood: str) -> str:
        prompt = f"""You are a professional songwriter. Write lyrics for a {style} song.
        Mood: {mood}

        Requirements:
        - Include a verse, chorus, and bridge
        - Make it singable (short lines, natural rhythm)
        - Be original and emotional
        """

        lyrics = self.ollama.generate(
            model="llama3.1:70b",
            prompt=prompt,
            temperature=0.8  # Creative but coherent
        )
        return lyrics

2. BeatGenerator — Style Templates as Code

This was the clever insight: instead of asking MusicGen to generate a "pop beat," we map styles to detailed descriptions that MusicGen understands.

class BeatGenerator:
    STYLE_TEMPLATES = {
        "pop": "upbeat pop song, bright synths, 120 BPM, catchy hook",
        "hip-hop": "hip-hop beat, 90 BPM, hard-hitting drums, bass drop",
        "rock": "rock song, electric guitars, heavy drums, 100 BPM, energetic",
        "lullaby": "soft lullaby, gentle piano, string background, 60 BPM, calming",
        "r&b": "smooth R&B, laid-back vibe, groovy bass, 95 BPM, soulful",
        "electronic": "electronic dance, synthesizer, digital drums, 128 BPM, hypnotic",
        "acoustic": "acoustic guitar, minimal drums, organic, 110 BPM, intimate",
        "jazz": "jazz trio, upright bass, improvisation, 100 BPM, sophisticated"
    }

    def generate(self, style: str) -> Path:
        description = self.STYLE_TEMPLATES.get(style, "pop beat")
        beat = self.musicgen_model.generate(
            description=description,
            duration=30  # seconds
        )
        return beat.save_to_file()

The beauty here: we're not hard-coding audio. We're encoding knowledge about styles as language prompts. MusicGen does the heavy lifting.

3. VocalGenerator — Smart Chunking Strategy

Bark TTS is powerful but works best on short text. Feeding 100 lines of lyrics at once produces weird pacing and intonation. So we chunk intelligently:

class VocalGenerator:
    CHUNK_SIZE = 200  # characters

    def generate(self, lyrics: str, voice: str = "female_neutral") -> Path:
        chunks = self._smart_chunk(lyrics)
        vocal_segments = []

        for chunk in chunks:
            # Prefix with ♪ to hint "singing mode" to the TTS
            singing_prompt = f"♪ {chunk}"

            audio = self.bark_model.generate_speech(
                text=singing_prompt,
                voice_preset=voice
            )
            vocal_segments.append(audio)

        # Concatenate all segments
        return self._concatenate_audio(vocal_segments)

    def _smart_chunk(self, text: str) -> List[str]:
        """Split on sentence boundaries near CHUNK_SIZE."""
        chunks = []
        current = ""

        for sentence in text.split(". "):
            if len(current) + len(sentence) < self.CHUNK_SIZE:
                current += sentence + ". "
            else:
                if current:
                    chunks.append(current.strip())
                current = sentence + ". "

        if current:
            chunks.append(current.strip())

        return chunks

The prefix is a neat trick—it hints to the model that this is sung content, affecting prosody subtly.

4. AudioMixer — FFmpeg Magic

Combining beat and vocals requires careful balance. We use FFmpeg's filter graph:

class AudioMixer:
    def mix(self, beat_path: Path, vocals_path: Path) -> Path:
        """Blend beat and vocals with proper levels and crossfade."""

        # FFmpeg filter: loop beat, mix with vocals, balance levels
        filter_graph = (
            "[0:a]aloop=loop=-1:size=2880[beat]; "
            "[1:a]adelay=500|500[vocals]; "
            "[beat]volume=0.7[beat_volume]; "
            "[vocals]volume=1.2[vocals_volume]; "
            "[beat_volume][vocals_volume]amix=inputs=2:duration=longest[out]"
        )

        subprocess.run([
            "ffmpeg",
            "-i", str(beat_path),
            "-i", str(vocals_path),
            "-filter_complex", filter_graph,
            "-map", "[out]",
            "-q:a", "9",  # MP3 quality
            str(output_path)
        ])

Key choices:

  • Beat at -3dB (quieter) so vocals shine
  • Vocals at +5dB (louder) so they're clear
  • 500ms delay on vocals for sync
  • aloop filter to extend beat to match vocal length

Session Management: Persistence Matters

Songs are big projects. We persist state to JSON:

@dataclass
class SongProject:
    title: str
    prompt: str
    lyrics: str
    style: str
    voice: str
    beat_path: Path
    vocals_path: Path
    final_audio_path: Path
    created_at: datetime

class SessionManager:
    def save_project(self, project: SongProject) -> Path:
        data = asdict(project)
        # Convert Path objects to strings for JSON
        data = {k: str(v) if isinstance(v, Path) else v for k, v in data.items()}

        project_file = self.session_dir / f"{project.title}.json"
        with open(project_file, "w") as f:
            json.dump(data, f, indent=2, default=str)

        return project_file

This lets users resume editing, re-mix different beat/vocal combinations, or iterate on lyrics.

The User Interface: Streaming Matters

The Gradio UI is simple but effective. We stream the LLM output so users see the agent "thinking":

def generate_song(prompt: str, voice: str):
    agent = SongAgent(model="llama3.1:70b")

    # Stream LLM reasoning
    for token in agent.stream_generation(prompt):
        yield f"Agent: {token}"

    # Execute tools
    project = agent.generate(prompt)

    yield f"✓ Lyrics written\n✓ Beat generated\n✓ Vocals created\n✓ Mixed!"
    yield f"🎵 Song ready: {project.final_audio_path}"

Streaming transforms a black box (wait for a minute, then get audio) into an engaging interaction (watch the AI reason in real-time).

Example: "Write me a birthday song for my daughter"

Let me walk through what happens end-to-end when a user types a simple prompt.

User input:

"Write a fun birthday song for my 5-year-old daughter Anaya. She loves dinosaurs."

Step 1 — The LLM plans autonomously

The agent doesn't ask "what style?" or "what mood?" — it infers from context:

Agent thinking: Birthday song for a 5-year-old who loves dinosaurs.
               Style: pop/kids. Mood: happy, playful. Voice: female_expressive.

{"tool": "generate_all", "args": {
  "prompt": "fun birthday song for 5-year-old Anaya who loves dinosaurs",
  "style": "pop",
  "mood": "happy",
  "voice_type": "female_expressive"
}}

Step 2 — LyricsWriter generates lyrics

✓ Lyrics written:

[Verse 1]
Happy birthday little Anaya,
The coolest dino girl I know,
T-Rex stomp and Stego sway,
Let's make your birthday glow!

[Chorus]
Roar roar, it's your day,
Blow the candles, hip hooray!
Five years old and growing strong,
Sing along to Anaya's song!

[Verse 2]
Triceratops brought you a cake,
Velociraptor's running late,
But every dino near and far,
Knows today you are the star!

[Chorus] ...

Step 3 — BeatGenerator creates the instrumental

The style "pop" maps to the template: "upbeat pop song, bright synths, 120 BPM, catchy hook"

⏳ Generating beat... (MusicGen, ~30 seconds)
✓ Beat generated: pop_happy_30s.wav

Step 4 — VocalGenerator sings the lyrics

Bark TTS receives the lyrics in chunks, each prefixed with ♪ for singing mode:

⏳ Generating vocals... chunk 1/3
⏳ Generating vocals... chunk 2/3
⏳ Generating vocals... chunk 3/3
✓ Vocals generated: anaya_birthday_vocals.wav

Step 5 — AudioMixer blends everything

FFmpeg combines the beat (at -3dB) with vocals (at +5dB), loops the beat to match vocal length:

✓ Song mixed: anaya_birthday_final.mp3
🎵 Duration: 42 seconds
▶ [Play] [Download]

From prompt to finished song: about 2 minutes on GPU hardware. The user watched the agent reason, saw lyrics appear, and then heard their daughter's personalized birthday song.

What I Learned

1. LLM autonomy requires clear intent. The system prompt determines whether the agent asks endless clarifying questions or confidently makes decisions. "Don't ask, decide" works.

2. Tool descriptions matter more than tool calls. We could use function calling APIs, but encoding style knowledge as natural language (BeatGenerator's templates) was more flexible and easier to debug.

3. Chunking is an art. Naive chunking (split every N chars) breaks prosody. Smart chunking (split on sentence boundaries) makes TTS output more natural.

4. Audio levels are critical. A beat that's too loud drowns out lyrics. A beat that's too quiet feels thin. -3dB and +5dB are magic numbers we found empirically.

5. Persistence unlocks iteration. Once users can save and resume projects, they stop treating AI song generation as a one-shot demo and start treating it as a creative tool.

Try It

SongSmith is on GitHub (private repo). The full pipeline runs on standard hardware (though GPU significantly speeds up generation), and the modular design means you can swap out any component—use a different TTS model, add a drums-only generator, or fine-tune the beat templates. If you'd like access or want to discuss the project, connect with me on LinkedIn.

The next iteration: I'm thinking about adding LoRA fine-tuning for personalized voice styles, and exploring zero-shot music style transfer to let users remix their songs into different genres.

For now, SongSmith proves a beautiful principle: with the right orchestration, specialized AI models can collaborate to create something more creative than any single model.

And that's kind of magical. 🎵



Thanks for reading. Follow me for more.

← More posts