Large language models for multimodal user interaction in a virtual environment