Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification