Silia: A Tiny Transformer Architecture for Sub-10M Parameter Models

Original: Tiny Scale Is All I Can Spare To Play With Transformer

A student paper introduces Silia, a tiny Transformer variant that combines attention and FFN behavior to reduce parameters.

A student from India shared their first paper on r/LocalLLaMA, proposing Silia, a Transformer architecture for extremely small models. The idea is to merge attention-style dynamic mixing with SwiGLU-like nonlinear transformation, aiming to save parameters in models under roughly 10M parameters. The author frames the work as an early, small-scale exploration, limited by old hardware and restricted access to larger compute.

A student researcher shared their first paper on r/LocalLLaMA, introducing Silia, a proposed Transformer architecture aimed at making very small language models more parameter-efficient. The work focuses on the underexplored regime of tiny models, especially those with 10 million parameters or fewer, and in the abstract specifically emphasizes challenges around models at or below about 5 million parameters. The author’s motivation is that conventional Transformer designs may spend too many parameters when scaled down to this size, because they still preserve separate attention and feed-forward network components even when the total parameter budget is extremely constrained.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.