dspyprompt-engineeringllmpythonprogrammatic-prompting

DSPy and the Rise of Programmatic Prompting

Max P

DSPy and the Rise of Programmatic Prompting

For a long time, prompt engineering meant writing a long string, tweaking it by hand, and hoping it generalized. That worked while we had one model and one task. It does not work when you are wiring three or four LLM calls together into a pipeline that has to survive model swaps, dataset drift, and a new evaluation every quarter.

DSPy takes a different stance. Treat prompts as programs. Compile them. Optimize them. Re-run the optimizer when something changes. This post is about what programmatic prompting actually is, why it has caught on, and how it sits next to structured-output libraries like Outlines and Guidance.

If you are coming at this from the AI-agent angle, the How to Build with AI Agents post is a good companion piece on the broader ecosystem.

What DSPy actually does

DSPy is, per its own README, "the framework for programming, rather than prompting, language models." It is MIT-licensed and originated at Stanford NLP. The pitch is that you should write Python code that describes what you want from the model, and DSPy compiles that into prompts you can run.

Three concepts do most of the work:

  1. Signatures. A signature is a typed description of an input-output mapping, like "question to answer" or "context, question to answer." It looks like a Python class or a string. The signature is the contract; you do not write the prompt.
  2. Modules. A module is a small, composable program that uses one or more signatures. There is a baseline Predict module, and richer ones like chain-of-thought, ReAct, and program-of-thought. You build pipelines by composing modules.
  3. Optimizers. Once you have a pipeline and a metric, you call an optimizer. The optimizer searches for better prompts and, in some configurations, better few-shot examples or even weight updates. The README emphasizes that DSPy ships "algorithms for optimizing their prompts and weights," whether the system is "simple or sophisticated."

The shift in mindset is real. You stop writing prompt strings and start writing typed components, and you let an optimizer figure out the strings.

A minimal example

The flavor of DSPy code is small and unsurprising once you have seen it.

import dspy

dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))

class AnswerQuestion(dspy.Signature):
    """Answer a question concisely using the given context."""
    context: str = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

qa = dspy.ChainOfThought(AnswerQuestion)
prediction = qa(context="Apache 2.0 is a permissive license.", question="Is Apache 2.0 copyleft?")
print(prediction.answer)

The interesting part is what is missing. There is no prompt template, no "you are a helpful assistant" preamble, no role plumbing. The signature describes the shape of the call. The module decides how to run it. If you swap ChainOfThought for Predict, the prompt strategy changes underneath you, and your pipeline code does not.

Why programmatic prompting got popular now

Three forces converged.

First, models got reliable enough that the hand-tuned prompt is no longer the bottleneck. When models would invent JSON keys at random, you needed bespoke prompts to nudge them. With current frontier and strong open models, the marginal value of a hand-tuned prompt has dropped. The marginal value of a well-structured pipeline has gone up.

Second, evaluation has become non-negotiable. If you are shipping an LLM feature, you are running an offline metric. DSPy's optimizers consume that metric directly. Once you have one, calling an optimizer is cheaper than another round of manual prompt edits.

Third, model swaps happen constantly. New checkpoints land. Costs change. Vendors deprecate models. A pipeline written in DSPy survives a model swap because the contract is the signature, not the prompt string. Recompile, run the metric, ship.

For teams that are already taking a programming-language view of LLM systems, this fits naturally. For teams that mostly want to call one model with one prompt, DSPy is overkill.

Where DSPy fits next to Outlines and Guidance

This is the question I get most often. DSPy is a programming framework. Outlines and Guidance are structured-output libraries. They overlap, but the center of gravity is different.

Outlines is Apache-2.0 and focused on producing structured outputs from LLMs. Per its README, it supports JSON and Pydantic models, regular expressions, context-free grammars, and multiple-choice constraints. It works across providers including OpenAI, vLLM, Ollama, and transformers. The job is "ensure the model produces output that matches this schema."

Guidance is MIT-licensed and described in its repo as "an efficient programming paradigm for steering language models." It also offers regex constraints, selection from predefined choices, context-free grammars, and JSON-schema validation via Pydantic, plus token fast-forwarding. It supports transformers, llama.cpp, OpenAI, and other providers.

Both are about controlling the next token to fit a structure. DSPy is about composing and optimizing whole pipelines. They are complementary. A common pattern:

  • DSPy declares the modules and signatures of a pipeline.
  • Inside one of those modules, you call an LLM with structured-output constraints from Outlines or Guidance to force the response into a JSON schema.

That gives you both layers: a pipeline you can recompile when the model changes, and a guarantee that each leaf call returns valid structured data.

When not to use DSPy

DSPy is not the right tool for every problem.

  • If your system is a single LLM call with a stable prompt and you do not have a metric, you do not need DSPy. Write the call.
  • If you only need structured output from one call, Outlines or Guidance alone is lighter weight.
  • If you are deeply tied to one provider's tool-calling API and you do not want to abstract over it, the indirection cost of DSPy may not pay off.
  • If you are very early in a product and you do not yet know what "good" means, build the v0 in plain code, then port to DSPy once you can write a metric.

The rule of thumb I use: if the pipeline has more than two LLM calls and you have a numeric quality target, DSPy starts to earn its keep.

What to read in the docs first

If you decide to try DSPy, the things that paid off most for me were:

  1. The "Building a system with DSPy" tutorial in the official docs. It walks through signatures, modules, and metrics in one pass.
  2. The optimizer docs, especially BootstrapFewShot and the more recent prompt-search optimizers. Pick one, run it, compare to your hand-tuned baseline.
  3. The integration guides for whatever LLM provider you actually use. DSPy's adapter layer changes how you configure a model; the patterns are simple but they vary by provider.

Beyond that, DSPy's repo is active and the issue tracker is a useful read for what production users hit.

The big picture

Programmatic prompting is not a fad. It is the same shift we saw move web pages from HTML to templates to component frameworks. We now have enough surface area in LLM systems that "write the prompt by hand" stops scaling, and "write the program and let an optimizer find the prompt" starts to make sense.

DSPy is the most polished expression of that idea. Outlines and Guidance solve a complementary problem, locking individual outputs to a schema. Together they cover most of what production LLM systems actually need.

Tools mentioned in this post

  • DSPy: MIT-licensed Stanford NLP framework for programmatic, optimizable LLM pipelines built around signatures and modules.
  • Outlines: Apache-2.0 Python library for structured outputs with JSON, regex, and grammar-based constraints across multiple model providers.
  • Guidance: MIT-licensed control framework for steering LLM output with regex, choice, grammar, and JSON-schema constraints.

Related Tools

More Articles