GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling language models with mixture-of-experts architecture for efficient training

Pricing

Free

Tool Info

Rating: N/A (0 reviews)

Date Added: April 22, 2024

What is GLaM: Efficient Scaling of Language Models with Mixture-of-Experts?

Scaling language models with more data, compute, and parameters has driven significant progress in natural language processing. However, training these large dense models requires significant amounts of computing resources. In this paper, GLaM (Generalist Language Model) is proposed, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while incurring substantially less training cost compared to dense variants. GLaM achieves strong results on in-context learning tasks and compares favorably to models like GPT-3.