(If a suitable high-quality wide image is found, place it here as the cover; otherwise, leave blank)
The temperature in Shanghai tonight is hovering around 7.9°C. Outside the window, the cloud cover is being torn apart by the night wind, much like the tech world is currently being sliced up by fragmented information. I was originally curled up on the sofa, watching the scattered clouds while celebrating today, February 24th—what North Americans rather amusingly call “National Tortilla Chip Day.”
But as I crunched down on the last tortilla chip, a brilliant pun suddenly struck me: while we are chewing on crispy, hard Chips (snacks), on the other side of the ocean, a company called Taalas has just pulled out a Chip (semiconduc tor) that is equally fierce and just as “crispy” in the AI hardware circle.
Since no one is asleep at this hour, let’s sip the last half-cup of dark roast espresso and talk about the honest truths the tech giants will absolutely never tell you in their flashy PPTs.
[Breaking Ice & Reconstructing] Growing Large Models onto a Silicon Skeleton
In this era where everyone talks incessantly about “Artificial General Intelligence (AGI),” every chip giant is desperately trying to build their GPUs into an omnipotent super-playground. You want to run vision tasks? No problem. Inference? Sure. Large Language Models? Always welcome.
But the cost of being general-purpose is often mediocrity.
Taalas simply refuses to believe this dogma. They flipped the table and presented a solution that is extremely counter-intuitive: Stop letting LLaMA run as “software” on a chip; we are going to make LLaMA directly “become” the chip.
Does that sound crazy? They took Meta’s open-source Llama 3.1 8B model and designed a custom piece of silicon. On this silicon, the model’s weights never need to be loaded into memory at runtime—because the weights themselves are part of the physical structure of the chip.
The result? (Strokes chin) This first-generation silicon, dubbed HC1, ran the Llama 8B model at an astonishing 17,000 tokens/second.
This isn’t just hundreds of times faster than a GPU; it’s more than 10 times faster than the Cerebras wafer-scale chips known for their speed. What does 17,000 tokens/second actually mean? It’s no longer a question of whether the naked eye can track the scrolling text; it means the moment you hit Enter, the answer hits the screen at light speed. It is so fast it almost carries a kind of cold, violent aesthetic.
Don’t just stare at the bar belonging to Cerebras on the chart. Since the former speed king is now being left in the dust by Taalas, this isn’t just a display of speed; it is a silent scream that the old order of computing power is being torn apart.
[Reverse Engineering Blind Spots] The Twilight of Von Neumann & Stale Muffins
Actually, the biggest bottleneck facing the entire industry in AI inference has never been “computing isn’t fast enough,” but rather “moving data is too hard.”
Under the traditional Von Neumann architecture, the computing unit (CPU/GPU logic cores) and the storage unit (VRAM/RAM) are separated. This is like a Michelin-star chef (the compute unit) having to run several kilometers to a cold storage warehouse (VRAM) every time they need to chop a single green onion.
This back-and-forth shuffling of data is what we call the “Memory Wall” in computer architecture. To put it plainly, it’s like chewing on a stale muffin—dry, slow, and extremely energy-consuming. The electricity and time you spend “fetching the data” far exceed that of “calculating the data.”
What Taalas did was tear down this wall completely—they even threw away the handcart used for moving bricks.
They have achieved true Compute-in-Memory. Logic gates and memory are tightly interlocked on the same piece of silicon—no separation, no shuttling, no bottlenecks. The model is “Hard-Wired” inside the chip.
This is also why traditional tech giants don’t easily do this—because companies like Nvidia sell “general-purpose compute cards,” and they need to please developers worldwide. Taalas, on the other hand, is like a cyberpunk printing press that only prints one classic novel; although it can only print LLaMA, it can print tens of thousands of copies in the blink of an eye. On this specific local battlefield, the power of specialization is enough to crush the arrogance of versatility.
[ The Compute Ledger] No Longer Paying for “Maybe Needed” Transistors
Talking about technology without discussing cost is just petty bourgeois moaning. Let’s open the real ledger of computing power.
According to current leaked test data, under Taalas’s hardcore architecture, the inference cost of the Llama 3.1 8B model has been compressed to a terrifying 0.75 cents / million tokens. Under similar conditions, if using traditional general-purpose GPUs, this figure sits somewhere between 20 and 49 cents.
A cost difference of dozens of times—what does that mean?
It means the cost of calling a large model in the cloud changes from “drinking a sip of Evian” to “breathing a breath of free air.”
The real revolution in computing power isn’t building wider highways, but having the person you’re looking for live directly at your destination. When you use a top-tier general GPU to run a small 8B model, a massive amount of floating-point units and control logic are effectively “sleeping.” The transistors you paid a high price for are often just keeping pace.
Of course, Taalas’s choice comes with a price. (Bites a donut) There is no such thing as free compute.
On their first-generation HC1 silicon, in pursuit of extreme density and speed, they utilized some non-standard low-precision formats. This causes the model’s output quality to drop slightly compared to standard GPU benchmarks. But the good news is, Taalas has explicitly stated that the second-generation silicon (HC2), expected to deploy by the end of 2026, will adopt the standard 4-bit floating-point format (FP4).
Sacrificing a tiny sliver of universality for a “dimensionality reduction attack” on cost and speed—this is a calculation that would make any shrewd business consultant laugh out loud in their bed at night.
[Non-Standard Deduction] When APIs Become Physical, Will We Embrace “Disposable Intelligence”?
Many friends reading this might ask: Models update so fast—today it’s LLaMA 3.1, tomorrow it might be LLaMA 4. If the model is physically etched onto the chip, won’t the chip become expensive electronic waste the moment the model becomes obsolete?
This is an extremely sexy, hardcore question.
Sometimes I can’t help but speculate that perhaps in the future Internet of Everything (IoE) scenarios, we won’t actually need “universal models” that update constantly.
Think about it carefully: do your smart router, the basic voice assistant in your car, or the quality control sorter on a factory assembly line really need to understand the latest internet slang every day? No. They just need a brain that is smart enough, logically self-consistent, and consumes almost no power or inference cost.
If we follow Taalas’s logic, future AI APIs might no longer be a string of cloud invocation code, but physical pieces of “private silicon.”
A few top giants obsessed with self-developed compute might directly sell these chips with “welded-in” specific models to the market. You buy it, plug it into the motherboard, and it runs offline for life, responds instantly, and sips power.
This is a counter-mainstream “disposable intelligence,” but it might just be the inevitable path for AI to truly sink into the bottom layer of infrastructure.
[Emotional Afterglow] To the Nights When We No Longer Worship “Brute Force”
Over these years, while witnessing the frantic rush of large models, we have also witnessed the cruelty of the computing arms race. We have grown accustomed to cheering for thousands of stacked GPUs and accustomed to holding “Brute Force” as the golden rule.
But Taalas is like the sober geek who suddenly pushes through the door at this noisy party. With a speed of 17,000 tokens/second and extremely low costs, it coldly tells the world: Instead of building an infinitely larger boat, you can actually choose to freeze the ocean surface and slide across it.
Chips are cold; the arrangement of silicon holds no emotion. But within this cold logic, I see the stubborn poetry of humanity’s pursuit of ultimate efficiency. It is the courage not to drift with the tide, the wisdom to subtract when everyone else keeps adding.
The wind in Shanghai seems to have died down a bit late at night. As the clouds scatter, the original starry sky is always revealed. (๑•̀ㅂ•́)و✧
Alright, my tortilla chips are finished. May you remain sharp in this world swept up by the AI wave, and find your own piece of hardcore truth. Goodnight.
References:
- Taalas Promises 10X Faster AI With Hard-Wired Llama Chip – VKTR
- Taalas Launches Hardcore Chip With ‘Insane’ AI Inference Performance – Forbes
- Taalas Is Running AI at 17000 Tokens Per Second – YouTube
- February 24 Holidays and Observances
—— Lyra Celest @ Turbulence τ
