Teaching an AI Agent to Write Songs: Inside SongSmith
I recently finished building SongSmith, an AI agent that writes and produces complete songs from natural language descriptions. What started as a fun idea—"what if an AI could be a songwriter?"—turned into a deep dive into agent orchestration, autonomous decision-making, and multimodal AI pipelines.
Here's how we taught an AI agent to write songs.
The Core Idea: Inline Tool Calling
Most AI agent frameworks use function calling APIs where the LLM explicitly invokes tools. We went a different direction: inline JSON tool calling. The LLM responds with JSON mixed into its text output, and our agent parses and executes the tools.
Why? It's simpler, works with any LLM (even ones without function calling support), and honestly feels more organic—the agent "thinks" by outputting JSON, then we execute it.
Here's the pattern:
# SongAgent orchestrates the entire pipeline
class SongAgent:
def generate(self, prompt: str) -> SongProject:
# 1. Ask the LLM to plan and invoke tools
llm_response = self.ollama_client.generate(
model="llama3.1:70b",
prompt=self._build_system_prompt(prompt)
)
# 2. Parse JSON tool calls from LLM response
tool_calls = self._parse_json_tools(llm_response)
# Looks for: {"tool": "lyrics_writer", "args": {...}}
# 3. Execute tools in sequence
for call in tool_calls:
if call["tool"] == "lyrics_writer":
lyrics = self.lyrics_writer.write(call["args"]["style"])
elif call["tool"] == "beat_generator":
beat_audio = self.beat_gen.generate(call["args"]["style"])
elif call["tool"] == "vocal_generator":
vocals = self.vocal_gen.generate(lyrics, call["args"]["voice"])
The system prompt is crucial—it tells the LLM to be fully autonomous:
"You are SongAgent, an AI composer. Infer the mood, style, and voice from the user's description. Don't ask clarifying questions. Output your tool calls as JSON. Be creative and confident."
This gives the agent permission to make decisions without constantly asking the user.
The Architecture: Four Specialized Tools
SongSmith's pipeline looks like this (ASCII art incoming):
User Prompt
↓
┌─────────────────────────────────────────┐
│ SongAgent (Orchestrator) │
│ Parses prompt, invokes tools in order │
└─────────────────────────────────────────┘
↓
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Lyrics │→ │ Beat │→ │ Vocal │→ │ Audio │
│ Writer │ │ Generator │ │ Generator │ │ Mixer │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Ollama MusicGen Bark TTS FFmpeg
(LLM) (Style Templates) (Chunking) (Blend+Balance)
↓
Final MP3 Song
Each tool is independently trained and specialized:
1. LyricsWriter — Creative Writing at Scale
Uses an Ollama LLM with temperature 0.8 for creativity. The temperature matters—0.8 is warm enough for poetic license but cool enough to stay coherent.
class LyricsWriter:
def write(self, style: str, mood: str) -> str:
prompt = f"""You are a professional songwriter. Write lyrics for a {style} song.
Mood: {mood}
Requirements:
- Include a verse, chorus, and bridge
- Make it singable (short lines, natural rhythm)
- Be original and emotional
"""
lyrics = self.ollama.generate(
model="llama3.1:70b",
prompt=prompt,
temperature=0.8 # Creative but coherent
)
return lyrics
2. BeatGenerator — Style Templates as Code
This was the clever insight: instead of asking MusicGen to generate a "pop beat," we map styles to detailed descriptions that MusicGen understands.
class BeatGenerator:
STYLE_TEMPLATES = {
"pop": "upbeat pop song, bright synths, 120 BPM, catchy hook",
"hip-hop": "hip-hop beat, 90 BPM, hard-hitting drums, bass drop",
"rock": "rock song, electric guitars, heavy drums, 100 BPM, energetic",
"lullaby": "soft lullaby, gentle piano, string background, 60 BPM, calming",
"r&b": "smooth R&B, laid-back vibe, groovy bass, 95 BPM, soulful",
"electronic": "electronic dance, synthesizer, digital drums, 128 BPM, hypnotic",
"acoustic": "acoustic guitar, minimal drums, organic, 110 BPM, intimate",
"jazz": "jazz trio, upright bass, improvisation, 100 BPM, sophisticated"
}
def generate(self, style: str) -> Path:
description = self.STYLE_TEMPLATES.get(style, "pop beat")
beat = self.musicgen_model.generate(
description=description,
duration=30 # seconds
)
return beat.save_to_file()
The beauty here: we're not hard-coding audio. We're encoding knowledge about styles as language prompts. MusicGen does the heavy lifting.
3. VocalGenerator — Smart Chunking Strategy
Bark TTS is powerful but works best on short text. Feeding 100 lines of lyrics at once produces weird pacing and intonation. So we chunk intelligently:
class VocalGenerator:
CHUNK_SIZE = 200 # characters
def generate(self, lyrics: str, voice: str = "female_neutral") -> Path:
chunks = self._smart_chunk(lyrics)
vocal_segments = []
for chunk in chunks:
# Prefix with ♪ to hint "singing mode" to the TTS
singing_prompt = f"♪ {chunk}"
audio = self.bark_model.generate_speech(
text=singing_prompt,
voice_preset=voice
)
vocal_segments.append(audio)
# Concatenate all segments
return self._concatenate_audio(vocal_segments)
def _smart_chunk(self, text: str) -> List[str]:
"""Split on sentence boundaries near CHUNK_SIZE."""
chunks = []
current = ""
for sentence in text.split(". "):
if len(current) + len(sentence) < self.CHUNK_SIZE:
current += sentence + ". "
else:
if current:
chunks.append(current.strip())
current = sentence + ". "
if current:
chunks.append(current.strip())
return chunks
The ♪ prefix is a neat trick—it hints to the model that this is sung content, affecting prosody subtly.
4. AudioMixer — FFmpeg Magic
Combining beat and vocals requires careful balance. We use FFmpeg's filter graph:
class AudioMixer:
def mix(self, beat_path: Path, vocals_path: Path) -> Path:
"""Blend beat and vocals with proper levels and crossfade."""
# FFmpeg filter: loop beat, mix with vocals, balance levels
filter_graph = (
"[0:a]aloop=loop=-1:size=2880[beat]; "
"[1:a]adelay=500|500[vocals]; "
"[beat]volume=0.7[beat_volume]; "
"[vocals]volume=1.2[vocals_volume]; "
"[beat_volume][vocals_volume]amix=inputs=2:duration=longest[out]"
)
subprocess.run([
"ffmpeg",
"-i", str(beat_path),
"-i", str(vocals_path),
"-filter_complex", filter_graph,
"-map", "[out]",
"-q:a", "9", # MP3 quality
str(output_path)
])
Key choices:
- Beat at -3dB (quieter) so vocals shine
- Vocals at +5dB (louder) so they're clear
- 500ms delay on vocals for sync
- aloop filter to extend beat to match vocal length
Session Management: Persistence Matters
Songs are big projects. We persist state to JSON:
@dataclass
class SongProject:
title: str
prompt: str
lyrics: str
style: str
voice: str
beat_path: Path
vocals_path: Path
final_audio_path: Path
created_at: datetime
class SessionManager:
def save_project(self, project: SongProject) -> Path:
data = asdict(project)
# Convert Path objects to strings for JSON
data = {k: str(v) if isinstance(v, Path) else v for k, v in data.items()}
project_file = self.session_dir / f"{project.title}.json"
with open(project_file, "w") as f:
json.dump(data, f, indent=2, default=str)
return project_file
This lets users resume editing, re-mix different beat/vocal combinations, or iterate on lyrics.
The User Interface: Streaming Matters
The Gradio UI is simple but effective. We stream the LLM output so users see the agent "thinking":
def generate_song(prompt: str, voice: str):
agent = SongAgent(model="llama3.1:70b")
# Stream LLM reasoning
for token in agent.stream_generation(prompt):
yield f"Agent: {token}"
# Execute tools
project = agent.generate(prompt)
yield f"✓ Lyrics written\n✓ Beat generated\n✓ Vocals created\n✓ Mixed!"
yield f"🎵 Song ready: {project.final_audio_path}"
Streaming transforms a black box (wait for a minute, then get audio) into an engaging interaction (watch the AI reason in real-time).
Example: "Write me a birthday song for my daughter"
Let me walk through what happens end-to-end when a user types a simple prompt.
User input:
"Write a fun birthday song for my 5-year-old daughter Anaya. She loves dinosaurs."
Step 1 — The LLM plans autonomously
The agent doesn't ask "what style?" or "what mood?" — it infers from context:
Agent thinking: Birthday song for a 5-year-old who loves dinosaurs.
Style: pop/kids. Mood: happy, playful. Voice: female_expressive.
{"tool": "generate_all", "args": {
"prompt": "fun birthday song for 5-year-old Anaya who loves dinosaurs",
"style": "pop",
"mood": "happy",
"voice_type": "female_expressive"
}}
Step 2 — LyricsWriter generates lyrics
✓ Lyrics written:
[Verse 1]
Happy birthday little Anaya,
The coolest dino girl I know,
T-Rex stomp and Stego sway,
Let's make your birthday glow!
[Chorus]
Roar roar, it's your day,
Blow the candles, hip hooray!
Five years old and growing strong,
Sing along to Anaya's song!
[Verse 2]
Triceratops brought you a cake,
Velociraptor's running late,
But every dino near and far,
Knows today you are the star!
[Chorus] ...
Step 3 — BeatGenerator creates the instrumental
The style "pop" maps to the template: "upbeat pop song, bright synths, 120 BPM, catchy hook"
⏳ Generating beat... (MusicGen, ~30 seconds)
✓ Beat generated: pop_happy_30s.wav
Step 4 — VocalGenerator sings the lyrics
Bark TTS receives the lyrics in chunks, each prefixed with ♪ for singing mode:
⏳ Generating vocals... chunk 1/3
⏳ Generating vocals... chunk 2/3
⏳ Generating vocals... chunk 3/3
✓ Vocals generated: anaya_birthday_vocals.wav
Step 5 — AudioMixer blends everything
FFmpeg combines the beat (at -3dB) with vocals (at +5dB), loops the beat to match vocal length:
✓ Song mixed: anaya_birthday_final.mp3
🎵 Duration: 42 seconds
▶ [Play] [Download]
From prompt to finished song: about 2 minutes on GPU hardware. The user watched the agent reason, saw lyrics appear, and then heard their daughter's personalized birthday song.
What I Learned
1. LLM autonomy requires clear intent. The system prompt determines whether the agent asks endless clarifying questions or confidently makes decisions. "Don't ask, decide" works.
2. Tool descriptions matter more than tool calls. We could use function calling APIs, but encoding style knowledge as natural language (BeatGenerator's templates) was more flexible and easier to debug.
3. Chunking is an art. Naive chunking (split every N chars) breaks prosody. Smart chunking (split on sentence boundaries) makes TTS output more natural.
4. Audio levels are critical. A beat that's too loud drowns out lyrics. A beat that's too quiet feels thin. -3dB and +5dB are magic numbers we found empirically.
5. Persistence unlocks iteration. Once users can save and resume projects, they stop treating AI song generation as a one-shot demo and start treating it as a creative tool.
Try It
SongSmith is on GitHub (private repo). The full pipeline runs on standard hardware (though GPU significantly speeds up generation), and the modular design means you can swap out any component—use a different TTS model, add a drums-only generator, or fine-tune the beat templates. If you'd like access or want to discuss the project, connect with me on LinkedIn.
The next iteration: I'm thinking about adding LoRA fine-tuning for personalized voice styles, and exploring zero-shot music style transfer to let users remix their songs into different genres.
For now, SongSmith proves a beautiful principle: with the right orchestration, specialized AI models can collaborate to create something more creative than any single model.
And that's kind of magical. 🎵
Thanks for reading. Follow me for more.
← More posts