AI Infrastructure: Inference, Chips & RAG Systems

The systems AI runs on: compute, chips, inference, RAG and retrieval — and the economics of serving intelligence at scale. Curated by Turing Post.

AI Concepts & Techniques

AI 101: From Tokens to Answers: What Actually Happens During LLM Inference

10 min read

May 21, 2026

AI 101: From Tokens to Answers: What Actually Happens During LLM Inference

How LLM inference works end-to-end: tokenization, embeddings, prefill, decode, KV cache, batching, retrieval, and modern inference orchestration.

Alyona Vert.

AI 101

12 min read

May 6, 2026

AI 101: Agentic Vector Databases – What Is That?

How vector databases are evolving for AI agents: agentic RAG with Qdrant, memory layers with Weaviate Engram, and Pinecone Nexus knowledge engine explained.

Alyona Vert.

AI Concepts & Techniques

AI 101: LLM Token Types: Input, Output, Reasoning, Cached & More

15 min read

Apr 22, 2026

AI 101: LLM Token Types: Input, Output, Reasoning, Cached & More

Input, output, reasoning, cached, vision — not all LLM tokens cost the same. A guide to token types and how each one shapes your AI bill.

Ksenia Se

AI 101

Nemotron 3 and the Surprising Coalition Building New AI in the Open

12 min read

Mar 18, 2026

Nemotron 3 and the Surprising Coalition Building New AI in the Open

Nemotron Coalition is NVIDIA's bet on open frontier AI — with Mistral, Cursor, Black Forest Labs and others. How Nemotron 3 works and who holds power.

Alyona Vert., +1

AI Concepts & Techniques

AI 101: The Inference Chip Wars – MatX, Taalas, and the Cracks in the GPU Era

14 min read

Feb 25, 2026

AI 101: The Inference Chip Wars – MatX, Taalas, and the Cracks in the GPU Era

The inference chip landscape in 2026: NVIDIA Vera Rubin, MatX's programmable LLM accelerator, and Taalas' model-as-hardware approach compared on cost per token

Alyona Vert., +1

AI 101

CPU vs GPU vs TPU vs NPU: AI Hardware Processing Units Explained

13 min read

Sep 10, 2025

CPU vs GPU vs TPU vs NPU: AI Hardware Processing Units Explained

CPU for general tasks, GPU for parallel compute, TPU is Google AI ASIC, NPU for on-device inference. Full guide: ASIC, APU, IPU, FPGA explained.

Alyona Vert.

AI Concepts & Techniques

LLM Inference Explained: Latency, Throughput & How It Work

10 min read

Apr 2, 2025

LLM Inference Explained: Latency, Throughput & How It Work

How to optimize LLM inference latency and throughput: quantization, batching, KV cache, speculative decoding, GPU vs TPU, and hardware accelerators.

Alyona Vert.

AI 101

7 min read

Sep 11, 2024

What is HybridRAG?

we discuss the innovative combination of VectorRAG and GraphRAG in HybridRAG, its impact on financial document analysis and other areas of implementation, and clarify related terms for better understanding

Alyona Vert.

AI 101

8 min read

Aug 14, 2024

What is Speculative RAG

Speculative RAG uses a small drafter and a large verifier LM to boost RAG speed and accuracy. How it works, where it excels, and key limitations.

Alyona Vert., +1

AI 101

4 min read

Jul 10, 2024

What is LongRAG framework?

LongRAG uses 4K-token retrieval units instead of 100-word chunks, reducing corpus size 30×. How LongRAG architecture works and how it compares to standard RAG.

Ksenia Se, +1

AI 101

4 min read

Jun 19, 2024

What is FSDP and YaFSDP?

FSDP shards model params across GPUs to reduce memory overhead. YaFSDP by Yandex adds layer sharding — saving 150 GPUs and cutting LLM training time by 26%.

Ksenia Se

AI 101

4 min read

Jun 5, 2024

What is Graph RAG approach?

GraphRAG uses knowledge graphs instead of vector chunks to give LLMs richer, connected context. How it works, key advantages over RAG, and use cases.

Ksenia Se, +1