GLM-130B (ICLR'23): an open bilingual (Enligsh & Chinese) pre-trained model with 130 billion parameters based on GLM (ACL'22); better than GPT-3 175B on LAMBADA and MMLU.
ChatGLM-6B & ChatGLM2-6B & ChatGLM3-6B & GLM-4: a family of open bilingual dialogue language models, over 14,000,000 global downloads. Receiving , , , and GitHub Stars!
WebGLM (KDD'23): an efficient web-enhanced question answering system based on GLM-10B, outperforming WebGPT-13B and approaching WebGPT-175B performance in human evaluation.
Foundational Agents For Real-world Challenging Missions
AgentBench (ICLR'24): the first systematic multi-dimensional benchmark to evaluate LLMs as Agents in 8 distinct environments deriving from real-world practical missions.
AutoWebGLM (KDD'24): a strong web navigating agent constructed upon ChatGLM-3-6B, outperforming prompted GPT-4 on Mind2Web, WebArena, and our constructed new dataset AutoWebBench.
VisualAgentBench (ICLR'25): a comprehensive framework to train and test Large Multimodal Models (LMMs) to serve as visual foundation agents.
WebRL (ICLR'25): self-evolving online curriculum RL transform open LLMs to outperform GPT-4-Turbo on Web Agent tasks by 160%.
AndroidLab (ACL'25): training and systematic benchmarking android autonomous agents.
AutoGLM: autonomous foundation agents for GUIs, the first Phone Use and Web Browser Use agent family.
Alignment and Scalable Oversights over LLMs and Diffusers
ImageReward (NeurIPS'23): the first general-purpose text-to-image human preference reward model (RM) for RLHF, outperforming CLIP/BLIP/Aesthetic by 30% in terms of human preference prediction.
BPO (Black-box Prompt Optimization, ACL'24): a novel direction to align LLMs via preference-aware prompt optimization. Improving ChatGPT, Claude, LLaMA on human preference's win rates by 20%+ without training them.
AlignBench (ACL'24): the first comprehensive benchmark on evaluating LLMs' Chinese alignment, deriving from ChatGLM's online real scenarios. Adopted by top Chinese LLMs (ChatGLM, Qwen, DeepSeek, Yi, Baichuan, Abab, and etc.)
SPaR (ICLR'25): Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models