feb 2026

my experience training 0.6b qwen model to play wordle and observing it learn to cheat and collapse

a month back, i decided to finally stop studying theory and get my hands dirty. so I made a cool looking project plan (optimized for twitter) and took it to tokenbender and he gave a profound advice

“when you are starting out, start with the basics. so that you get a hang of the controls, start by doing something like wordie”

I was not aware of Wordie, so first I searched that up and then I searched how people have gone about training models to play wordie. prime-intellect repo came up, I saw they used 1.7B Owen and the method they followed was first SFT —> RL

I was totally unaware of training frameworks out there, I knew vaguely that there is hugging face, native pytorch, unsloth, prime-intellect etc etc

I studied about all of these and what i understood was :

there was nothing like native pytorch, like you really don’t wanna do it
prime-intellect is mostly for RL and you can’t run it outside their environment which is not free
huggingface has a library called TRL which is mostly used
unsloth is basically a library which writes custom kernels to make LoRA fine-tuning faster and cheaper, so I decided to go with unsloth and just to start small I chose Qwen-0.6B

I knew we do not want to do full-fine tuning. it takes a lot of GPU, so I read lora and qlora origional papers not much leanings honestly papers. thinky machines have a paper called “lora without regret” it’s really good. the main takeaway was to apply lora only to mlp layers and not to the attention layers

another interesting thing was that i used to think lora and qlora are only for sft, never thought about them to apply on RL but now it’s pretty obvious