AI Infrastructure on KnightLi Blog

Why AI Data Centers Are Driving HDD Demand Again

Sat, 16 May 2026 21:02:33 +0800

Over the past two years, most AI infrastructure discussions have focused on GPUs, HBM, advanced packaging, and power supply. Behind training and inference systems, however, there is another bottleneck that is easier to overlook: storage.

A large model does not finish its work with a single computation inside a GPU. During training, it continuously produces checkpoints, optimizer states, training logs, dataset versions, and intermediate results. During inference, it also generates user interaction records, compliance archives, audit data, and system logs. These datasets do not always need to sit on the fastest media, but they often cannot be deleted immediately.

That is why hard drives are becoming important again.

AI Training Creates Massive Cold Data

Large model training needs to save checkpoints regularly. A checkpoint is essentially a saved state of the training process: if a training run crashes halfway through, the system can resume from a checkpoint instead of starting over.

For a large model, a single checkpoint can be several terabytes. A full training run may last weeks or even months, producing many checkpoints along the way. Even if some are later cleaned up, experiment replay, reproducibility, rollback, and model audits still require large amounts of data to be retained.

Training data itself is also expanding. High-quality text, images, videos, and code need to be cleaned, deduplicated, split, and versioned. As synthetic data, reinforcement learning data, and multimodal data become part of training pipelines, storage pressure will keep increasing.

This kind of data has several traits:

It is enormous in volume;
It is not always accessed frequently;
It needs long-term retention;
It is highly sensitive to cost per unit of capacity.

This data does not make sense to store entirely on expensive high-speed storage.

Why Not Use Only SSDs

SSDs are obviously faster, but data centers cannot optimize only for speed. For petabyte-scale cold data and anything beyond that, cost per unit of capacity directly determines whether the system is sustainable.

Storage in an AI cluster can be divided into several tiers:

HBM and GPU memory handle the hottest and most urgent data;
DRAM handles temporary movement and staging;
SSDs handle frequently accessed data with stronger low-latency requirements;
HDDs handle massive cold data, backups, logs, checkpoint archives, and long-term retention.

In other words, SSDs are important, but they cannot replace every tier. Truly large-scale systems usually need tiered storage: hot data prioritizes speed, while cold data prioritizes capacity, cost, and reliability.

As AI companies start retaining training residue, model versions, synthetic data, inference logs, and audit records for longer periods, the value of HDDs becomes more visible again.

Why HDD Supply Is Getting Tight

The hard drive market has not looked especially exciting for years, and consumer PCs have increasingly shifted to SSDs. Data centers follow a different demand logic.

Cloud providers and AI companies need high-capacity nearline drives with predictable delivery and low cost per terabyte. For hard drive vendors, these customers usually sign long-term supply agreements and receive higher priority than fragmented consumer channels.

That leads to several effects:

Capacity for high-capacity enterprise drives is locked in early by large customers.
Consumer hard drives and ordinary retail channels receive less supply.
New capacity takes time to come online, so short-term shortages are hard to fix quickly.
Hard drives move from low-attention hardware into part of AI infrastructure.

More importantly, the hard drive industry itself is already highly concentrated. There are only a few mainstream suppliers, and ramping production of advanced high-capacity drives is not as simple as building more factories. Technologies such as HAMR can increase capacity per drive, but moving from technical mass production to stable large-scale delivery still takes time.

Storage Price Increases Can Reach Consumers

AI data centers are not only absorbing GPUs and power. They can also affect the storage supply chain.

When more enterprise SSD, memory, and HDD capacity flows toward cloud providers and AI infrastructure, the consumer market may begin to feel price pressure. Higher retail prices for SSDs, memory, or hard drives are not always just retail volatility. They may come from upstream capacity being reallocated.

This effect is usually not linear. Large customers sign long-term agreements with more stable pricing, delivery, and capacity planning. Consumers are more exposed to spot-market fluctuations. The result is a familiar pattern: rising AI data center demand eventually makes storage devices more expensive for ordinary buyers too.

The Investment View Requires More Caution

AI-driven storage demand is real, but that does not mean every storage-related company will benefit over the long term.

Hard drives and flash memory still have cyclical characteristics. Rising prices, tight capacity, and long-term customer contracts can improve short-term performance. But once new capacity comes online or demand growth slows, the industry may return to supply-demand rebalancing. For hardware companies, the most important questions are not about one price increase, but whether demand can persist, margins can improve, capacity expansion becomes excessive, and the customer mix remains healthy.

A steadier interpretation is that AI is changing the demand structure of the storage industry. In the past, outsiders paid more attention to compute. Now more costs are shifting toward data retention, data governance, and model lifecycle management.

Conclusion

AI does not only consume compute. It also keeps producing data.

GPUs handle computation, HBM feeds data at high speed, SSDs support hot data access, and hard drives carry the enormous cold data base. As long as large model training, synthetic data, inference logs, and compliance retention continue to grow, data centers will need large amounts of low-cost, high-capacity storage media.

Hard drives may not look like the star hardware of the AI era, but they are becoming an indispensable layer of AI infrastructure. The more advanced the model, the more it depends on massive storage systems. The more expensive the compute, the more it needs reliable checkpoints and archives to protect the cost already invested.

Anthropic Partners With SpaceX: Frontier AI Enters the Heavy-Industry Compute Era

Fri, 08 May 2026 23:39:08 +0800

Anthropic’s compute partnership with SpaceX looks, on the surface, like a resource lease. Anthropic gains access to more than 300MW of new capacity at SpaceX’s Colossus 1 data center and roughly 220,000 NVIDIA GPUs. Claude users then see higher usage limits, increased Claude Code capacity, and fewer peak-hour constraints.

But the significance goes beyond “Claude works better now”. It shows that frontier model competition is moving below model capability, product experience, and fundraising into a heavier infrastructure layer: electricity, data centers, network scheduling, GPU utilization, chip supply chains, and perhaps, in the long run, orbital compute.

Compute is not just buying GPUs

For the past two years, the common AI company story has been “we need more compute”. Whoever could secure more H100, H200, or B-series GPUs seemed closer to the next frontier model. By 2026, the question is no longer simply whether a company has GPUs. It is whether those GPUs can actually be used efficiently.

The difficulty of superlarge clusters is systems engineering. Once GPU counts reach hundreds of thousands, bottlenecks shift from single-card performance to whole-system orchestration: networking, parallel training, failure recovery, data I/O, liquid cooling, power stability, and software stack optimization. Each layer eats into real throughput.

Owning compute and digesting compute are different things. The first depends on capital and supply chains. The second depends on engineering. For model companies, the moat is no longer only architecture and training data. It also includes the ability to make huge GPU fleets work together efficiently.

Why Anthropic needs this capacity

Anthropic’s demand pressure is clear. Claude usage has grown quickly across developers, enterprises, agents, and coding workflows. Claude Code in particular can consume large amounts of inference capacity. The limits, queues, slowdowns, and peak-hour constraints users see are product-level symptoms of tight compute supply.

Anthropic already has major infrastructure partnerships with Amazon, Google, Broadcom, Microsoft, NVIDIA, and others. The SpaceX capacity matters because it is closer to a rapid supply injection: a GPU cluster that can quickly ease Claude’s usage pressure.

That is why users first notice higher limits. For a model company, compute is not an abstract asset. It becomes response speed, usable quota, API stability, and peak-hour experience.

Why SpaceX would lease it out

From the SpaceX or Musk side, providing Colossus 1 capacity to Anthropic is also a practical infrastructure business.

AI clusters are heavy assets: expensive to buy, fast to depreciate, costly to operate, and exposed to rapid GPU replacement cycles. If the company’s own model team cannot fully consume the resources in the short term, leasing idle or underused compute to a top-tier model company can turn depreciation pressure into cash flow.

That makes SpaceX look a little like a cloud provider. It can train Grok, but it can also sell part of its AI infrastructure capacity to other model companies. For Musk, there is another effect: supporting Anthropic strengthens a leading OpenAI alternative and creates pressure on an old rival.

AI competition is getting heavier

The most important trend in this partnership is that AI is becoming heavier.

Early large-model competition felt like a software contest: model design, data recipes, training tricks, benchmarks, and product packaging. Those still matter. But frontier competition now depends deeply on the physical world:

Is electricity cheap, stable, and sustainable?
Can data centers get land, permits, construction, and grid connections quickly?
Can networks support massive parallel training?
Can GPUs and custom chips arrive on time?
Can cooling systems handle dense continuous load?
Can the software stack maintain high utilization?

That is what “AI heavy industry” means. Large models are no longer just algorithms in a lab. They are industrial systems spanning power grids, real estate, semiconductors, cloud computing, and capital markets.

Terafab and the chip loop

SpaceX’s Terafab plan fits into the same logic. Public reports say SpaceX has filed plans for a semiconductor facility in Texas, with an initial investment that may reach $55 billion and multiphase total investment that could reach $119 billion.

That does not mean SpaceX can suddenly challenge TSMC, nor that a 2nm process can be built quickly with capital alone. The hardest parts of advanced manufacturing are not buying tools, but yield, process tuning, talent, supply chains, and years of accumulation. Even if the project moves well, it would be a multiyear or decade-scale systems project.

Still, it reflects a clear trend: AI giants increasingly do not want their fate to depend entirely on external chip supply chains. NVIDIA controls GPUs and CUDA, while TSMC controls advanced manufacturing capacity. If any link is constrained, model training and product iteration slow down. Vertical integration therefore becomes more attractive.

Orbital compute is still a long-term idea

The idea of orbital compute should also be treated carefully. SpaceX does have low-cost launch capability, satellite networks, and aerospace engineering depth. Space also offers solar power and cooling-related possibilities. But moving data centers into orbit at scale still faces launch cost, maintenance, radiation, shielding, communication latency, hardware lifetime, and business-return questions.

So the safer framing is that orbital compute is a long-term infrastructure imagination, not a mature commercial solution. It represents a Musk-style question about AI resource boundaries: if power, land, and cooling on Earth become bottlenecks, where else can the physical space come from?

Impact on OpenAI and the model landscape

The most direct effect of Anthropic’s new capacity is stronger Claude service. Higher limits, fewer peak constraints, and more stable developer experience make it more competitive in coding, enterprise, agent, and long-task scenarios.

For OpenAI, that means competitive pressure is not only about model quality. It also comes from how quickly rivals can secure usable compute, schedule clusters efficiently, lower costs, and turn infrastructure into product experience.

For the industry, model companies are starting to resemble hybrids of cloud providers, chip companies, and energy developers. Future frontier AI companies may need to train models, build data centers, negotiate electricity, customize chips, optimize networks, and manage enormous capital expenditure at the same time.

Summary

Anthropic’s partnership with SpaceX is not just a Claude capacity expansion, nor merely Musk “allying” with an OpenAI rival. It is a signal that AI competition is moving from the model layer into the infrastructure layer.

Algorithms still matter, but algorithms alone are no longer enough. The next stage will favor companies that can secure reliable energy, run massive GPU fleets at high utilization, and gain more control over chips and data-center capacity.

Compute is becoming the oil of the AI era. The truly scarce resource is not one GPU, but the industrial organization ability to connect energy, chips, networks, scheduling, and product demand.

References: