As explained in the paper, our video generation method doesn't simply copy spatio-temporal chunks, but generated never-seen-before frames according to the main motions in the video.
This can be seen in the following video depiction of Fig. 9.
Each row contains four videos. From left to right -
VGPNN [1] Generated video - A video generated by VGPNN. The generation copies large spatio-temporal chunks as-is from the input video.
VGPNN NNF color map - The NNF map corresponding to the VGPNN video. Every large uniformly-colored chunk is copied as-is from the original video.
SinFusion Generated video - A video generated by SinFusion (ours). Our video is more diverse and doesn't simply copy chunks from the input video.
SinFusion NNF color map - The NNF map corresponding to our generated video. The varied color map represents diverse directions to nearest-neighbour patches, indicating that our method doesn't copy large existing chunks from the single input video.
Generated Video, NNF Map |
Generated Video, NNF Map |
Here we compare several videos generated by our model with and without the usage of the Projector model.
This shows the importance of the Projector model in removing of small artifacts generated by our auto-regressive Predictor model.
Notice how the videos that were generated without the Projector (right column) slowly accumulate visual artifacts and decay to poor quality.
Predictor & Projector |
No Projector (Only Predictor) |
|
As described in the paper, we show a basic qualitative comparison between our single-video DDPM to a VDM [2] trained on a single video.
Top: generated videos using our method.
Bottom: generated videos using VDM [2].
For further explanation, please see discussion in the supplementary material details file.