Golang 1.26版本新GC回收——Green Tea 🍵 Garbage Collector

官方改动文档：green tea garbage collector

宏观设计目标差异

设计哲学差异

老 GC 原文描述：

Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

老 GC 原文问题描述：

As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.
译文：它表现出极差的空间局域性——在内存中完全不同的部分间跳跃——时间局域性差——在同一记忆循环中反复访问，且对拓扑毫无关心。

新 GC 原文描述：

Green Tea: a parallel marking algorithm that, if not memory-centric, is at least memory-aware, in that it endeavors to process objects close to one another together.

分析：

【老 GC】把堆当成纯图结构，只做“洪水漫灌”式并发标记，完全不顾对象在物理内存上的位置。

【新 GC】首要目标改为“内存感知”，优先把物理上相邻的对象一起处理，以改善 cache/NUMA 表现。

核心算法流程差异

最小工作单元

原文：

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself.
译文：新并行标记算法的核心思想很简单。垃圾回收器不是扫描单个对象，而是扫描更大、相连的内存块。共享工作队列跟踪的是这些粗数据块，而非单个对象，等待扫描的单个物体则在该数据块内被跟踪。

分析：

【老 GC】以 “单个对象” 为最小单位；
【新 GC】以“span（8 KiB 连续块）”为最小单位。

入队/出队动作

原文：

When scanning finds a pointer to a small object, it sets that object’s gray bit … If the gray bit was not already set and the object’s span is not already enqueued … it enqueues the span.”
“When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits … scans any objects that had their gray bit set but not their black bit.

分析：

【老 GC】每遇到一个指针就把目标对象立即压栈；
【新 GC】仅当指针落在小对象 span 且该 span 首次变灰时才整 span 入队；出队时批量扫描 span 内所有待标对象。

大对象路径

原文：

Larger objects continue to use the old algorithm … The choice of which algorithm to use is made when scanning encounters a pointer.

分析：

【老 GC】所有对象一视同仁；
【新 GC】大对象仍走老算法，小对象走新算法，形成“混合路径”。

这里的大对象指的是 > 32 KiB，详细来说：

≤ 32 KiB → 小对象，可能占 8 KiB 或 16 KiB… 的 span，受 Green Tea 新算法管理。
> 32 KiB → 大对象，直接整页分配，继续用老 GC 的标记方式（逐对象扫描 + 三色标记）。
- 遇到一个对象
- 扫它的字段（指针）
- 把新对象变灰
- 最终全图遍历
KB = 1000，KiB = 1024

实现层关键优化

单对象退化防护

原文：

If a span has only a single object to scan … we track the object that was marked when the span was enqueued … if the hit flag is not set, then the garbage collector can directly scan the span’s representative.

分析：
【老 GC】无此概念；
【新 GC】通过“代表对象 + hit 标志”保证稀疏场景不额外吃亏。

工作分布机制

原文：

Go’s current garbage collector … each scanner aggressively checks and populates global lists. This frequent mutation … is a significant source of contention …
The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues … fewer items to queue … inherently lower contention.

分析：
【老 GC】用全局对象栈，多核频繁争抢；
【新 GC】复用 goroutine 调度器的 steal-deque，span 为粗粒度单元，竞争天然减少。

以前是一个个小对象去争抢处理，现在则是每次处理以 8KiB 大小的小对象页进行处理。
相当于是揽一些量的活放着一件件处理，而不是做完一个活揽一个活的方式，肯定要快很多。
而且无论用不用 Green Tea，Go 的内存管理器都把堆切成 8 KiB span，并在每个 span 内维护对象级标记位。新算法只是复用了已有的位图，没有新增 per-object 空间。所以峰值内存占用不会变大，缓存压力反而减小

地址算术定位元数据

原文：

Since small object spans are always 8 KiB large and 8 KiB aligned … simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads.

分析：
【老 GC】需要通过对象头或 side table 间接取标记位；
【新 GC】指针对齐后移位即可得元数据，去掉一次依赖加载。

队列顺序策略

原文：

FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued.

分析：
【老 GC】无 span 概念，自然无此策略；
【新 GC】显式对比多种顺序，实测 FIFO 能让 span 在等待期间累积最多待扫对象。

性能与可扩展性表现

微基准

原文：

In select GC-heavy microbenchmarks … we observed anywhere from a 10–50% reduction in GC CPU costs … cache misses was reduced by half.

分析：
【老 GC】CPU stalled on memory 占 35%；
【新 GC】同场景 GC CPU ↓10–50%，cache-miss ↓50%。

核数扩展

原文：

The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation.

分析：
【老 GC】全局栈成为多核瓶颈；
【新 GC】核数越多，优势越大。

新老GC差异

老 GC 的实际竞争点在哪？

原文：

Go’s current garbage collector … each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.

拆解：

标记代码运行在某个 P 上，但它 push/pop 的却是同一个全局数据结构。
每遇到一个指针就要 CAS 抢这把“全局锁”；核数翻倍，CAS 失败率、缓存行乒乓指数级上升 → 扩展性撞墙。

Green Tea 到底改了什么？

原文：

The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues … fewer items to queue … inherently lower contention.

实现细节：

物理独立的 per-P span 队列
Green Tea 新增了一条专属于 GC 标记工作的 per-P span 队列，和调度队列彻底分离——旧 GC 的 per-P 只管调度 goroutine。
为什么要独立？因为 GC 标记的访问模式是随机遍历堆图，业务代码是局部性访问，两者若塞进同一个队列交替执行，调度器频繁切换上下文会导致互相污染 CPU cache。分离后，GC 队列和调度队列各司其职，各自连续执行任务，保持更好的 cache 局部性。
工作单元粒度放大
从"单个对象"换成 "8 KiB span"，队列项数立刻降到 1/16~1/几十。更粗的粒度意味着 steal 操作更少，队列本身更短，遍历时的内存连续性也更好。
复刻 GMP 的工作窃取策略
标记 goroutine 先消费自己 P 的本地队列；空了才去别的 P 偷，全局区域几乎不被触碰。竞争面从"所有人抢一把锁"变成"各玩各的，偶尔偷一下"——这正是 Go 调度器解决 M:N 模型的同一套方法论。

这里就可以看得出 Green Tea 的工作单元设计，其实有意向往 GMP 当时的设计靠，通过队列来控制自己处理的任务量，不够再去抢占、再去公有区域获取。

总结

新 Green Tea GC，其实就是在原来三色标记法的基础上，混合路径方式区分（即 以 32 KiB 作为大小对象分界线）。原先小对象过多会影响 GC 回收资源占用率和效率，那么就把小对象多的情况分开，用新算法，大对象按原来算法进行操作处理。

新 Green Tea GC 以 8KiB span 小对象块为基本单位，不再是以单个对象为基本单位，主打一个批量处理，批量输出，减少原先全局区域竞争的问题。优先把物理上相邻的对象一起处理，减少处理完后内存上稀疏区域的问题。

而且现在因为 Green Tea GC 每个核都有了自己的 P 队列，且都是 8KiB span 为单位，当多个 span 同时等着被扫描时，先挑哪一个出队的规则很重要，最终测试下来 FIFO 最不错，这是原先 GC 没有的。

🔵 旧 GC（逐对象扫描）

以前是这样：

for 每个对象:
    扫描这个对象的指针字段

问题：

小对象非常多（几十万、几百万）
每个对象都要：
- 找 metadata
- 解码 pointer bitmap
👉 CPU cache miss 非常严重

为什么 CPU cache miss 严重，是因为 “对象在内存中“逻辑连续，但访问方式不连续”。

旧 GC 找需要处理的对象流程：对象 → metadata → bitmap → 下一个对象 → metadata → bitmap 。而 metadata / bitmap 分散在内存各处，所以导致了 CPU 一直在 拉新 cache line 和 淘汰旧 cache line。不停的置换 cache 的内容，保证下次需要使用时命中。

🟢 新 GC（span 批处理扫描）

现在更接近这样：

for 每个 span:
    一次性处理这个 span 里的所有对象

更具体一点：

利用 span 已经有的信息：
- size class（对象大小一致）
- pointer layout（统一）
可以做到：批量扫描；减少重复 metadata 查找；顺序访问内存（cache-friendly）。

每个 span 里：所有对象 大小相同（size class）。对于同一类型分配的对象：指针布局（pointer bitmap）也一致。即 一次性用同一份 metadata，批量扫描所有对象。

至于为什么 metadata 会是一致的，是因为 Go 的 内存分配器 早就把对象按类型分组好了。

以及新 GC 的 CPU cache miss 少，也是因为新 GC 是按 span 处理，这些对象都是物理连续且邻近的，自然 CPU cache 里主要是当前正在扫描的 span 的数据。

这就是新版 GC 和旧版 GC 的区别，旧版 GC 虽然物理排列上也是连续、邻近的，但是因为访问方式问题，导致不连续。新版 GC 因为算法物理邻近的都放在一个 span 里面，让访问方式连续了，直接解决这一大问题。

既然 CPU cache 里是一整块都需要处理的 span 数据，基本不会 miss，CPU 也就不怎么因为 cache miss 而需要进行 cache 置换了，让 GC 处理整体高效且 CPU 资源利用率提高。

新旧 GC 比较起来

一个 cache line ≈ 64B：

旧 GC：可能只用到 64B 里的 8B，极度浪费 CPU cache 本就少的宝贵空间。
新 GC：连续扫描，一个 cache line 里的数据几乎都会被用到，利用率极高

Green Tea 本质上就是把 GC 从 “在内存里到处乱跳地逐个对象处理”，变成 “按连续内存块顺序批量处理”，从而更符合 CPU 的工作方式并显著减少 cache miss 和竞争。

[Golang] 1.26版本新GC回收——Green Tea 🍵 Garbage Collector