from Ten Lessons From Three Generations Shaped Google’s TPUv4i

Evolution of ML DSA

for TPUv1 see TPUv1: Single Chipped Inference DL DSA_maxzcl的博客-CSDN博客

for TPUv2/3 see https://blog.csdn.net/maxzcl/article/details/121399583

for TPUv1 to TPUv2 see TPUv2/v3 Design Process_maxzcl的博客-CSDN博客

The Ten Lessons

General

① Logic, wires, SRAM, & DRAM improve unequally

② Leverage prior compiler optimizations

the fortunes of a new architecture have been bound to the quality of its compilers.

Indeed, compiler problems likely sank the Itanium’s VLIW architecture [25], yet many DSAs rely on VLIW (see §6) including TPUs. Architects wish for great compilers to be developed on simulators, yet much of the progress occurs after hardware is available since compiler writers can measure actual time taken by code. Thus, reaching an architecture’s full potential quickly is much easier if it can leverage prior compiler optimizations (==> which requires the hardware design to incorporate software designs) rather than starting from scratch.

③ Design for performance per TCO vs per CapEx

Capital Expense (CapEx) is the price for an item

Operation Expense (OpEx) is the cost of operation, including electricity consumed and power provisioning

TCO: Total Cost of Ownership

TCO = CapEx + 3 ✕ OpEx //accounting amortizes computer CapEx over 3-5 years

most companies care more about performance/TCO of production apps (perf/TCO)

A DSA should aim for good Perf/TCO over its full lifetime, and not only at its birth.

ML DSA

④ Support Backwards ML Compatibility

This is directly handled by the compiler, but should be fundamentally supported by consistency in TPU structures

⑤ Inference DSAs need air cooling for global scale

==> optimality is not just "the best possible" but the most suitable.

⑥ Some inference apps need floating point arithmetic

DSAs may offer quantization, but unlike TPUv1, they should not require it.

Quantized arithmetic grants area and power savings, but it can trade those for reduced quality, delayed deployment, and some apps don’t work well when quantized (see Figure 4 and NMT from MLPerf Inference 0.5 in §4).

Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.

DNN Applications

⑦ Production inference normally needs multi-tenancy

Definition of Multitenancy - IT Glossary | Gartner

Multitenancy is a reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated, but physically integrated.

for more on multitenacy and SaaS, see What is multitenancy?

multitenacy here refers to the fact that most application requires execution of multiple models/agents, hence:

⑧ DNNs grow ~1.5x/year in memory and compute

architects should provide headroom so that DSAs can remain useful over their full lifetimes.

⑨ DNN workloads evolve with DNN breakthroughs

programmability and flexibility are crucial for inference DSAs to track DNN progress.

⑩ Inference SLO limit is P99 latency, not batch size

SLO: Service Level Objectives

==> P99 (99th percentile) latency is what the user-end application cares about

====> the DSA should exploit the specialization advantage to provide low latency in the case of large input batch sizes

====> and perform no worse than general purpose devices in the case of small batch size.

The 4th-Gen TPU

What The Lessons Keep

Given the importance of leveraging prior compiler optimizations ② and backwards ML compatibility ④—plus the benefits of reusing earlier hardware designs—TPUv4i was going to follow TPUv3:

1 or 2 brawny cores per chip,

a large systolic MXU array and vector unit per core,

compiler-controlled vector memory, and compiler-controlled DMA access to HBM.

TPUv4 and TPUv4i

the concurrrent development of the 2 chips was enabled by the realization:

==> a single core, with dual arrangements

==> The core dev. guideline for the 4th gen. is truely inspired ==> do not let past mistake or regret lay in waste

Schematics

Figure 5. TPUv4i chip block diagram. Architectural memories are HBM, Common Memory (CMEM), Vector Memory (VMEM), Scalar Memory (SMEM), and Instruction Memory (IMEM). The data path is the Matrix Multiply Unit (MXU), Vector Processing Unit (VPU), Cross-Lane Unit (XLU), and TensorCore Sequencer (TCS). The uncore (everything not in blue) includes the On-Chip Interconnect (OCI), ICI Router (ICR), ICI Link Stack (LST), HBM Controller (HBMC), Unified Host Interface (UHI), and Chip Manager (MGR).

Figure 6. TPUv4i chip floorplan. The die is <400 mm2 (see Table 1). CMEM is 28% of the area. OCI blocks are stretched to fill space in the abutted floorplan because the die dimensions and overall layout are dominated by the TensorCore, CMEM, and SerDes locations. The TensorCore and CMEM block arrangements are derived from the TPUv4 floorplan.

Compiler compatibility, not binary compatibility

Increased on-chip SRAM storage with common memory (CMEM)

128 MB Common Memory (CMEM) of TPUv4i. This expanded memory hierarchy reduces the number of accesses to the slowest and least energy efficient memory

We picked 128MB as the knee of the curve between good performance and a reasonable chip size, as the amortized chip cost is a significant fraction of TCO ③.

Four-dimensional tensor DMA

1. TPUv4i contains tensor DMA engines that are distributed throughout the chip’s uncore to mitigate the impact of interconnect latency and wire scaling challenges ①. The tensor DMA engines function as coprocessors that fully decode and execute TensorCore DMA instructions.

2. To maximize predictable performance and simplify hardware and software, TPUv4i unifies the DMA architecture across local (on-chip), remote (chip-to-chip), and host (host-to-chip and chip-to-host) transfers to simplify scaling of applications from a single chip to a complete system.

Custom on-chip interconnect (OCI)

Rapidly growing and evolving DNN workloads ⑧, ⑨ have driven the TPU uncore towards greater flexibility each generation. Each component of past TPUs designs were connected point-to-point (Figure 1). As memory bandwidth increases and the number of components grows, a point-to-point approach becomes too expensive, requiring significant routing resources and die area. It also requires up-front choices about which communication patterns to support. For example, in TPUv3, a TensorCore can only access half of HBM as a local memory [30]: it must go through the ICI to access the other half of HBM. This split imposes limits on how software can use the chip in the future ⑧.

NUMA https://en.wikipedia.org/wiki/Non-uniform_memory_access

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.

see more at: NUMA Collections_maxzcl的博客-CSDN博客

Arithmetic Improvements

1. to decided supported datatypes

Another big decision is the arithmetic unit. The danger of requiring quantization ⑥ and the importance of backwards ML compatibility ④ meant retaining bfloat16 and fp32 from TPUv3 despite aiming at inference. As we also wanted applications quantized for TPUv1 to port easily to TPUv4i, TPUv4i also supports int8.

2. improvement from the XLA , introduction of CMEM and the choice of compiler compatible

Our XLA colleagues suggested that they could handle twice as many MXUs in TPUv4i as they did for TPUv3 ②.

Logic improved the most in the more advanced technology node, ①, so we could afford more MXUs. Equally important, the new CMEM could feed them (§5 and §7.A).

3. reduction of critical path (through MXU)

We also wanted to reduce the latency through the systolic array of the MXU while minimizing area and power. Rather than sequentially adding each floating-point multiplication result to the previous partial sum with a series of 128 two-input adders,

TPUv4i first sums groups of four multiplication results together, and then adds them to the previous partial sum with a series of 32 two-input adders. This optimized addition cuts the critical path through the systolic array to ¼ the latency of the baseline approach.

Once we decided to adopt a four-input sum, we recognized the opportunity to optimize that component by building a custom four-input floating point adder that eliminates the rounding and normalization logic for the intermediate results. Although the new results are not numerically equivalent, eliminating rounding steps increases accuracy over the old summation logic. Fortunately, the differences from a four- versus two-input adder are small enough to not affect ML results meaningfully ④. Moreover, the four-input adder saved 40% area and 25% power relative to a series of 128 two-input adders. It also reduced overall MXU peak power by 12%, which directly impacts the TDP and cooling system design ⑤ because the MXUs are the most power-dense components of the chip.

Scaling

Workload Analysis

extensive tracing and performance counter hardware features, particularly in the uncore. They are used by the software stack to measure and analyze system-level bottlenecks in user workloads and guide continuous compiler-level and application-level optimizations (Figure 2). These features increase design time and area, but are worthwhile because we aim for Perf/ TCO, not Perf/CapEx ③. The features enable significant system-level performance improvements and boost developer productivity over the lifetime of the product as DNN workloads grow and evolve (see Table 4) ⑦, ⑧, ⑨.

TPUv4/4i Performance

see the article for the various datasets and explorations of parameter tweaking that shows how to find an optimal configuration for the TPU.

TPUv4/4i: 4th Generation DL DSA相关推荐

嵌入式开发板硬件操作入门学习9——集成电路芯片手册术语词汇表（中英文对照）
原创链接:集成电路芯片半导体中英文对照术语词汇表英语中文 1-9 10 gigabit 10 Gb 1st Nyquist zone 第一奈奎斯特区域 3D full‑wave electroma ...
5G术语缩略及全称（持续更新中）
序号缩略语英文全称中文全称日文 1 3GPP 3rd Generation Partnership Project 第三代合作伙伴计划 2 4G 4th Generation 第四代 ...
集成电路芯片半导体中英文对照术语词汇表（转）
转载自:集成电路芯片半导体中英文对照术语词汇表英语中文 1-9 10 gigabit 10 Gb 1st Nyquist zone 第一奈奎斯特区域 3D full‑wave electromag ...
EyeQ进展The Evolution of EyeQ
EyeQ进展The Evolution of EyeQ Mobileye's proven leadership in ADAS technologies is based in our EyeQ® ...
Nat. Ecol. Evol.：不想当化学家的物理学家不是好的生物学家
本文转载自"生态学文献分享",已获授权群落结构遵循微生物微观结构中的简单组装规则原文信息: Community structure follows simple assembl ...
【代码笔记】iOS-获得设备型号
一,代码. - (void)viewDidLoad {[super viewDidLoad];// Do any additional setup after loading the view.NSL ...
Reactor by Example--转
原文地址:https://www.infoq.com/articles/reactor-by-example Key takeaways Reactor is a reactive streams l ...
iOS Hardware Guide
iOS Hardware Guide 来自U3D文档 Hardware models The following list summarizes iOS hardware available in d ...
第4代白盒測试方法介绍--理论篇
关键词: 白盒測试第4代測试方法 4GWM 在线測试持续測试灰盒脚本驱动脚本桩摘要: 本文是第4代白盒測试方法的理论介绍,描写叙述3个关键领域内9项关键特征的概念与固有特征.同一时候 ...

TPUv4/4i: 4th Generation DL DSA