The SDXL “family” of diffusion models is incredibly powerful for image generation, especially for creators who want flexibility and control. Within this family, Pony—more formally known as PDXL—is one of the most popular and widely used branches. Another major branch is Illustrious (often referred to as ILXL), which has carved out its own fanbase thanks to its versatility and unique stylistic outputs.
Each of these SDXL-based frameworks comes with a wide range of checkpoints, spanning everything from ultra-realistic photography to surreal, pastel-soaked, cartoon-style creations. If you have a specific visual style in mind, chances are good that there’s a checkpoint tailored to help bring it to life.
That said, no checkpoint—no matter how powerful—has been trained on every possible concept. So, if you’re trying to generate something oddly specific and it keeps missing the mark, it’s likely the model simply wasn’t trained on that particular idea. It’s not broken… it just doesn’t “get it.”
Here’s a quick breakdown of how the major SDXL branches differ:
SDXL (Base Model)
Great at producing stunning, hyper-realistic images—but can struggle with fine-grained prompt control and composition. You might not get exactly what you asked for, but the results are usually impressive if your prompt gives it enough direction.
PDXL ("Pony")
Originally built for pony-related content (seriously), Pony has since evolved into a powerhouse with strong scene control and structure handling. While it still allows for some creative variation, Pony tends to follow prompts more closely than base SDXL—especially if your input is clean, well-structured, and free of conflicting tags.
ILXL ("Illustrious")
Designed with a more illustrative and artistic bent, Illustrious sits somewhere between SDXL and Pony. It’s a bit more flexible in composition than SDXL and tends to be less picky about how your prompt is formatted compared to Pony.
All three frameworks—SDXL, PDXL, and ILXL—offer a wide variety of checkpoints with different specialties and strengths. Personally, I tend to stick with Pony. Why? Mostly because I’ve gotten used to it, and I like that when I give it a detailed, well-structured prompt, I usually get something pretty damn close to what I had in mind on the first try. If the result looks goofy, nine times out of ten, it’s because I screwed something up in the prompt—not the model.
This guide will focus on prompting for Pony-based checkpoints. While many of these guidelines will likely work with other SDXL branches, I lean on the Pony style out of habit, and it tends to translate well enough across the board when I'm moving between models like ILXL or base SDXL.
There isn’t a single template that works for every prompt, but Pony does require structure to consistently create clean, coherent results. As far as I can tell, most SDXL-based models follow a general order of importance when interpreting prompts. The trick is to understand what matters most to your vision and order your prompt accordingly.
This guide will focus primarily on character-centric creations (whether people or other distinct entities), but the concepts can be applied to other types of compositions—from landscapes to surreal scenes to whatever your caffeinated imagination cooks up.
Personally, I compare it to the five W's: who, what, when, where, and why — but take that loosely. Pony doesn’t really care about the "why"... until it suddenly does. (We’ll get to that later.)
I always place embeddings and score tags at the start of the prompt. These elements hold a lot of weight and can influence everything that comes after. You can put them at the end, and they'll still work, but their effect tends to weaken the further down they are—especially in long, complex prompts.
Who – Almost always the most important part of a composition.
Pony needs to know who we’re looking at.
Early on, I used to over-describe characters right at the top of my prompts, diving into clothing, actions, and physical features before I ever established the who. That turned out to be a mistake—at least for my goals. Pony works better when you first lay out a basic cast list.
Not like this:
“A woman and a grown man standing in a bank lobby with a child playing in the background.”
That's way too much, too fast. Instead, break it down into something elemental:
1woman, 1man, female focus
This tells Pony how many people are in the shot and who the main subject is. If you're used to danbooru tagging, you might say 1girl, 1boy, but I find Pony handles woman and man more reliably when I want to lean into mature physical traits or adult fashion. Those nuances matter.
At this stage, I also like to include camera angle and basic scene framing. Let’s say we want a shot where both characters are fully visible from above.
So we might add:
high angle, full view, view from above, entire figure visible
So far, the prompt would look like this:
1woman, 1man, female focus, wide_shot, full body shot, entire figure visible,
That’s a solid foundation. From here, I add a [BREAK] to tell Pony we’re shifting to a new concept or section of the image.
With the addition of embeddings, score tags, and a our full prompt (so far) looks like this:
Stable_Yogis_PDXL_Positives2, score_9, score_8_up, score_7_up, 1woman, 1man, female focus, proportional anatomy, dynamic pose, dynamic angle, dynamic view
I like AutHuman Pony V4 from Civitai (https://openapi.smxblysq.workers.dev/models/477246) as my go-to. Its outputs have been reliable and pretty realistic. It’s among the best that I’ve tried for prompt adherence, while still maintaining enough flexibility in its creation to not make everything NSFW, but still having the capability to do so if I really want to go that route. I plugged this prompt into my GUI and this is the output that I got.

Not quite what I’m going after, but we’ll get there.
What – Pony needs to know what we’re looking at.
Now that Pony has been instructed on the overall concept for the picture, we can move on to what will take up the majority of its set of instructions. This section builds on what we established earlier by adding descriptions of each character.
There is no single correct order, but I’ve found it helpful to list minor characters with brief descriptions first. Yes, they’re in the background, but they play a crucial role in setting the scene. Placing them earlier in the token chain helps Pony understand their relevance without overshadowing the main subject.
Let’s begin with the secondary character, the man:
[BREAK] Man, mature male, old, gray hair, tan skin, standing, tall, professionally dressed, navy blue suit, black footwear, friendly smile, looking to the side, facing toward the side, bent forward, friendly smile,
We kept his description short to save token space for the main subject—the woman. That "female focus" tag is about to do some heavy lifting.
Now let’s give the woman a more detailed description:
[BREAK] woman, adult, mature female, 40 years old. looking at another, facing away, standing, casual posture, professional attire, knee-length pencil skirt, ivory colored silk blouse with long sleeves, ivory colored footwear, high heels, porcelain white skin, tall, slender, inverted triangle face, fine lines on face, long brown hair, elegant hairstyle, long tassel earrings, v-shaped eyebrows, detailed green eyes, high cheekbones, full lips, light smile, upper teeth only, elegant makeup, eyeshadow, winged eyeliner, ultra long eyelashes, pink lips, long pink fingernails,
By layering details like this, we reinforce what matters most—ensuring Pony captures it clearly.
Negative Tags
I won’t go into exhaustive detail here because negative tags are pretty self-explanatory: anything you don’t want to see should be explicitly listed.
Here’s the set I used for this generation:
Stable_Yogis_PDXL_Negatives2-neg, score_6, score_5, easy negative, text, watermark, tattoo, arm tattoo, facial tattoo, bad hands, malformed hands, missing fingers, extra fingers, fused fingers, deformed face, deformed teeth, distorted features, bad anatomy, long body, ((no strange anatomy)), ((extra limbs)), no exaggerated expressions, (low quality:1.4), (worst quality:1.4), flat lighting, bland details, empty space, 3d, cgi, (greyscale, monochrome, no humans), (source_furry, source_western, source_pony), NSFW, completely nude
Plugged the information we have so far into my GUI, and got this in return:

When, Where, and Why – Pony needs to know where we are, when it’s happening, and (sometimes) why.
This is the final polish—the part that turns a technically sound image into something worthy of a gallery wall (or at least a spot in your showcase folder).
When: Not always necessary. Pony defaults to well-lit daytime scenes. But if you want a specific time of day or historical setting, you need to say so. In this case, let’s assume a modern, indoor environment:
[BREAK] interior, inside, nicely decorated bank lobby, city visible through a window on the far wall, small obscured crowd behind them, ornate crystal chandelier,
From here, let's add some quality tags that tell the model: make it good.
[BREAK] photorealistic, hyper-realistic, ultra detailed, exquisite details, cinematic shadows, depth of field, rim lighting, warm lighting, film grain, realistic lighting, high contrast, sharp focus, detailed skin, detailed eyes, 8k uhd, dslr photo, Fujifilm XT3, RAW photo
I plopped this into my GUI and got this in return:

Not bad, but it’s missing a bit of that final layer of detail. Let’s assume we like what we’re seeing and want to refine it further.
We could have used LoRAs earlier to help boost the initial output—and honestly, I usually do. But for the purposes of this guide, I held off to keep things beginner-friendly.
Satisfied with what we have? Then we can either move to img2img for refinement, or press the green circular arrow to reuse the seed and stay thematically consistent, which is what I did within the txt2img tab.
I love Jeda's Detailer LoRA (https://openapi.smxblysq.workers.dev/models/1238742) and use it regularly. It’s powerful, adjustable, and doesn’t require a trigger word. I set its power to 1 and let it run.
Here’s what it returned:

The Why
While this may or may not directly impact the way Stable Diffusion sees things, you could loosely interpret "why" as a way to set an overall tone. Stable Diffusion doesn’t understand emotion, and for the purposes of AI image creation, "why" is essentially an emotion. That said, even though the model doesn’t fully comprehend emotional nuance, well-placed poetic phrasing or mood-driven tags can absolutely influence tone, atmosphere, and subtle expression.
Let’s try something polished and evocative as a final touch, dropped in right at the end of the current positive prompt:
"the look of success captured mid-step, when ambition meets style and charisma whispers louder than words"
This brings our current positive prompt to:
Stable_Yogis_PDXL_Positives2, score_9, score_8_up, score_7_up, 1woman, 1man, wide_shot, full body shot, entire figure visible, [BREAK] Man, mature male, old, gray hair, tan skin, standing, tall, professionally dressed, navy blue suit, black footwear, friendly smile, looking to the side, facing toward the side, bent forward, friendly smile, [BREAK] woman, adult, mature female, 40 years old. looking at another, facing away, standing, casual posture, professional attire, knee-length pencil skirt, ivory colored silk blouse with long sleeves, ivory colored footwear, high heels, porcelain white skin, tall, slender, inverted triangle face, fine lines on face, long brown hair, elegant hairstyle, long tassel earrings, v-shaped eyebrows, detailed green eyes, high cheekbones, full lips, light smile, upper teeth only, elegant makeup, eyeshadow, winged eyeliner, ultra long eyelashes, pink lips, long pink fingernails, [BREAK] Interior, inside, nicely decorated bank lobby, city visible through a window on the far wall, small obscured crowd behind them, ornate crystal chandelier, [BREAK] photorealistic, hyper-realistic, ultra detailed, exquisite details, cinematic shadows, depth of field, rim lighting, warm lighting, film grain, realistic lighting, high contrast, sharp focus, detailed skin, detailed eyes, 8k uhd, dslr photo, Fujifilm XT3, RAW photo, the look of success captured mid-step, when ambition meets style and charisma whispers louder than words, <lora:DetailerIL:1>
And the final output is:

Not bad. The changes were small, but powerful. The detail was slightly enhanced, and it helped Stable Diffusion better understand the mood and the "moment" we were going for.
Keep in mind: there are thousands of different checkpoints and LoRAs you can leverage to customize whatever you’re trying to create. This guide is far from an exhaustive treatise on "How to Pony," but hopefully it sheds some light on how to get more consistent, coherent, and expressive outputs.
And if nothing else, maybe it helps you figure out why Pony just refuses to cooperate sometimes.
Good luck!