Introduction
VACE is currently one of the most powerful tools currently available to be able to control AI video generations and Phantom is currently one of the best image reference to video for WAN. For the last few months I have worked on trying to combine the two together. I could have released earlier as even my earlier approaches were at least inconsistently working but I have spent time to find an approach that works most of the time.
I have split this article into two sections - one is a user guide for those of you that just want to use this tool and the other is a more detailed report on my findings for those who want to try mixing WAN models.
Model Here: https://openapi.smxblysq.workers.dev/models/1849007?modelVersionId=2092479 (Please use the Low Steps Model with this workflow)
Workflow is on the download section of this article (right hand side of this screen under attachments)
User Guide
There are several steps to using Phantom and this is not the be all and end all but an approach that seems to work.
Find your images - You can use anything but the ideal image would be one that is approximately the same as exactly how it will look in the video. Phantom likes to copy and paste exactly what is in the images. So if you want a closeup shot then cropping a full body image to the torso will give you more consistent results. Things that are more realistic tend to do better as well as things that are generated with WAN.
Find the prompt for your image - You trigger your images by prompting for them - basically you need to caption what is in them. This can be a difficult step to do by hand. Fortunately the Phantom team has noted that they used Gemini to help caption. It is available for free https://aistudio.google.com and I have had mostly good luck asking it to give a 1 sentence description of the image. I am sure any LLM will work fine though.
Add extra info to the prompts - You will want to prompt where the characters are and having a description of the motion also will help with consistency. Of course anything not in the images needs to be prompted as well.
It should be as simple as that - at least most of the time. The more you push the boundaries of what the model the more tweaking you might need to do.
Here are some nodes you should be aware of:

I use the Load VIdeo node to control the length of the animation and the dimensions - so depending on your wish you will need to change these to your needs.

Remember phantom takes at most 4 images to embed.

You will see I have increased VACE strength to 1.5 - this is because otherwise openpose models gets ignored. The higher it is the more drift you will likely get from your characters so you can tune it down especially if you are using other controls like depth which are more consistent.

If for some reason you want to add a reference frame to VACE you will have to subtract 4 from each of the simple math nodes.

You likely can get away with as low as 6 steps but Phantom needs CFG > 1 and more steps usually results in more accurate results when it comes to character consistency. You can try other samplers or noise schedules but I can't say I found any that made a consistent difference.
Sometimes the results are a bit more blurry than base WAN - this is a result of the merge and if I find a solution I will update the model.
Troubleshooting
If you are not getting what you want it usually can be summarized as either the characters are not showing up or the motion is not being followed.
If you character is not showing up:
Have you cropped the image of the character to match the control video?
Have you tried to have an LLM caption the image? Have you tried weighting the prompts of the things that are incorrect?
Have you prompted where the character is in the video?
Try a different seed - some are much better than others.
If worst comes to worst make VACE strength 0 and see if your character shows up - this will turn off any motion control and you should get a good
If your motion is not being followed:
Have you prompted where the characters are and the motion that is shown?
Longer videos result in better and stronger VACE control - if you can make your video 81 frames you will get much more control following.
Try a different seed - some are much better than others.
Consider trying a different contolnet type like depth which will be much stronger and more consistent.
Technical Report
I am not alone in trying to merge VACE with things. VACE is well designed in that it is adding layers to WAN rather than being a fine tune which works in our favor here as otherwise we could not mix these models together.
The main issues in this case are conditioning for Phantom adds a latent with each image to the end of conditioning. You therefore need to add n + 4 frames to the VACE conditioning (or subract the same from the Phantom) to have them be compatible. The type of conditioning here really matters for some reason - grey and white result in much worse results than black. There might be a better conditioning method for this but black seems to work best.
Merging the models themselves a simple merge is what I used for the majority of my testing adding only the VACE blocks - but you can add small amounts to various Phantom layers to get improvement. Just be careful.
If you are an advanced user I suggest you get the regular V2 Version and add Causvid V2 at 1.0 strength as that is equivalent to the low step model.
Ill try to update this further as things come to mind.
In Closing
I hope you enjoyed this tutorial. If you did enjoy it please consider subscribing to my YouTube/ Instagram/Tiktok/X (https://linktr.ee/Inner_Reflections )
Thanks to the WAN/VACE and Phantom teams for opensourcing such powerful models
Thanks to all those on the Banadoco server always exploring new things. Especially AbleJones.