Temporal learning for video-language understanding and generation