GAMES104-GPU-Driven Geometry Pipeline-Nanite

王希-GAMES104-现代游戏引擎:从入门到实践。

资源

课程

“Long” Pipeline of Traditional Rendering

传统渲染的“长”管道

  • Compute unit works with graphics processor and rasterizer

    计算单元与图形处理器和光栅化器配合使用

  • It's a series of data processing units arranged in a chain like manner

    它是一系列以链式方式排列的数据处理单元

  • Difficult to fully fill the GPU

    很难完全填满 GPU

webp

Jungle of Direct Draw Graphics API

直接绘制图形的丛林 API

Explosion of DrawCalls:

DrawCalls 激增:

  • Meshes x RenderStates x LoDs x Materials x Animations

    网格 x 渲染状态 x LoD x 材质 x 动画

webp

Problem 1: A traditional DrawIndexedInstanced command requires 5 arguments assigned on CPU

问题 1:传统的 DrawIndexed Instanced 命令需要在 CPU 上分配 5 个参数

webp

Problem 2: Driver state switching overhead between amount of draw commands

问题 2:在绘制命令数量之间切换驱动程序状态的开销

Bottleneck of Traditional Rendering Pipeline

传统渲染管道的瓶颈

webp

When rendering complicated scene with dense geometries and many materials

渲染具有密集几何体和许多材质的复杂场景时

  • High CPU overload

    CPU 过载严重

  • Frustum/Occlusion Culling

    颅骨/闭塞切除术

  • Prepare drawcall

    准备图纸

  • GPU idle time

    GPU 空闲时

  • CPU can not follow up GPU

    CPU 无法跟踪 GPU

  • High driver overhead

    驾驶员头顶高度

  • GPU state exchange overhead when solving large amount of drawcalls

    解决大量 drawcall 时的 GPU 状态交换开销

Compute Shader - General Computation on GPU

计算着色器-GPU 上的通用计算

  • High-speed general purpose computing and takes advantage of the large numbers of parallel processors on the GPU

    高速通用计算,并利用 GPU 上的大量并行处理器

  • Less overhead to graphics pipeline

    减少图形管道的开销

  • Just one stage in pipeline

    只是管道中的一个阶段

webp

Draw-Indirect Graphics API

绘图-直接图形 API

Advantage:

优点:

  • Allow you to specify parameters to draw commands from a GPU buffer, or via GPU compute program

    允许您指定参数以从 GPU 缓冲区或通过 GPU 计算程序绘制命令

  • "Draw-Indirect" command can merge a lot of draw calls into one single draw call, even with different mesh topology

“绘制间接”命令可以将许多绘制调用合并到一个绘制调用中,即使使用不同的网格拓扑

Notice:

注意:

  • The actual name of "Draw-Indirect" is different in each graphics platform, but they act as the same role. (e.g. vkCmdDrawIndexedIndirect(Vulkan), ExecuteIndirect(D3D12), ...)

    “绘制间接”的实际名称在每个图形平台中都不同,但它们扮演着相同的角色。

webp

GPU Driven Render Pipeline – DrawPrimitive vs. DrawScene

GPU 驱动的渲染管道——DrawPrimitive 与 DrawScene

  • GPU controls what objects are actually rendered

    GPU 控制实际渲染的对象

    • Lod selection, visibility culling on GPU

      Lod 选择,GPU 上的可见性剔除

  • No CPU/GPU roundtrip

    无 CPU/GPU 往返

    • CPU do not touch any GPU data

      CPU 不接触任何 GPU 数据

  • N viewports/frustums

    N 个视口/视锥

Frees up the CPU to be used on other things, ie. AI

释放 CPU 以用于其他事情

webp

GPU Driven Pipeline in Assassins Creed

《刺客信条》中的 GPU 驱动流水线

Motivation

动机

  • Massive amounts of geometry: architecture, seamless interiors, crowds

    大量的几何形状:建筑、无缝的内部、人群

Use mesh cluster rendering to

使用网格簇渲染

  • Allow much more aggressive batching and culling granularity

    允许更严格的批处理和剔除粒度

  • Render different meshes efficiently with a single indirect draw command

    使用单个间接绘制命令高效渲染不同的网格

webp

Mesh Cluster Rendering

网格簇渲染

  • Require

    需要

    • Fixed cluster topology (E.g. 64 triangles in Assassin Creed or 128 triangles in Nanite)

      固定集群拓扑(例如,刺客信条中的 64 个三角形或 Nanite 中的 128 个三角形)

    • Split & rearrange all meshes to fit fixed topology (insert degenerate triangles)

      拆分并重新排列所有网格以适应固定拓扑(插入退化三角形)

    • Fetch vertices manually in VS

      在 VS 中手动获取顶点

  • Key Implementation

    关键实施

    • Cull clusters by their bounding on GPU (usually by compute shader)

      通过 GPU 上的边界(通常通过计算着色器)剔除集群

    • GPU outputs culled cluster list & drawcall args

      GPU 输出精选集群列表和绘图调用参数

    • Draw arbitrary number of visible clusters in single drawcall

      在单个 drawcall 中绘制任意数量的可见簇

webp

GPU Driven Pipeline in Assassins Creed

《刺客信条》中的 GPU 驱动流水线

  • Overview

    概述

    • Offload more work from CPU to GPU

      将更多工作从 CPU 转移到 GPU

    • But not perfectly "draw scene" command, can only draw objects with the same material together

      但不是完美的“绘制场景”命令,只能将具有相同材质的对象绘制在一起

webp

Works on CPU side

webp

在 CPU 端工作

  • Perform very coarse frustum culling and then batch all unculled objects together by material

    执行非常粗略的截头体剔除,然后按材质将所有未剔除的对象批处理在一起

    • CPU quad tree culling

      CPU 四叉树剔除

    • Drawcalls merged based on hash that build on noninstanced data:(e.g. material, renderstate, …).

      基于非实例化数据构建的哈希合并的绘图调用:(例如材质、渲染状态等)。

  • Update per instance data(e.g. transform, LOD factor...),static instances are persistent

    更新每个实例的数据(例如转换、LOD 因子等),静态实例是持久的

GPU Instance Culling

GPU 实例剔除

  • Output cluster chunks after instance culling

    实例剔除后输出集群块

  • Use the cluster chunk expansion (64 cluster in a chunk) to balance GPU threads within a wavefront.

    使用集群块扩展(一个块中有 64 个集群)来平衡波阵面内的 GPU 线程。

webp

GPU Cluster Culling

GPU 集群剔除

  • Cluster culling by cluster bounding box

    通过聚类边界框进行聚类剔除

    • output cluster list

      输出集群列表

  • Triangle backface culling

    三角形背面剔除

    • output triangle visibility result and write offsets in new index buffer
webp

Index Buffer Compaction

索引缓冲区压缩

  • Prepare a empty index buffer(8Mb) and per-assign space for each mesh instance

    为每个网格实例准备一个空的索引缓冲区(8Mb)和每个分配的空间

  • Parallel copy the visible triangles index into the new index buffer

    将可见三角形索引并行复制到新索引缓冲区中

  • Index buffer compaction and multi-draw rendering can be interleaved because of fixed size of new index buffer (8Mb)

    由于新索引缓冲区的大小固定(8Mb),索引缓冲区压缩和多绘制渲染可以交错进行

webp

Codec Triangle Visibility in Cube: Backface Culling

立方体中的编解码器三角形可见性:背面消隐

  • Bake triangle visibility for pixel frustums of cluster centered cubemap

    簇中心立方体贴图像素平截头体的烘焙三角形可见性

  • Cubemap lookup based on camera

    基于相机的立方体贴图查找

  • Fetch 64 bits for visibility of all triangles in cluster

    获取 64 位以查看集群中所有三角形的可见性

webp webp

Occlusion Culling for Camera and Shadow

相机和阴影的遮挡抑制

Occlusion Depth Generation

遮挡深度生成

  • Depth pre-pass with best occluders in full resolution

    全分辨率下使用最佳封堵器进行深度预处理

    • Choose best occluders by artist or heuristic (e.g. 300 occluders)

      按艺术家或启发式方法选择最佳封堵器(例如 300 个封堵器)

    • Holes can be from rejected occluder (bad occluder selection or alpha-tested geometry)

      孔可能来自被拒绝的封堵器(封堵器选择不当或阿尔法测试几何形状)

  • Downsampled best occluders depth to 512x256

    下采样最佳封堵器深度为 512x256

  • Then combined with reprojection of 1/16 low resolution version of last frame’s depth

    然后结合最后一帧深度的 1/16 低分辨率版本的重新投影

    • Last frame’s depth can helped to filled with holes.

      最后一帧的深度有助于填补孔洞。

    • False occlusion is possible from large moving objects in the last frame’s depth, but works in most cases.

      在最后一帧的深度中,大型移动对象可能会出现假遮挡,但在大多数情况下都是有效的。

  • Generate hierarchy-Z buffer for GPU culling

    生成用于 GPU 剔除的分层 Z-缓冲区

webp

Two-Phase Occlusion Culling

两相闭塞消隐

1st phase

第一阶段

Cull objects & clusters using last frame’s depth pyramid

使用上一帧的深度金字塔剔除对象和簇

Render visible objects

渲染可见对象

2nd phase

第二阶段

Refresh depth pyramid

刷新深度金字塔

Test culled objects & clusters

测试剔除的物体和集群

Render false negatives

渲染假阴性

webp

Crazy Stressing Cases

疯狂压力案例

  • “Torture” unit test scene 250,000separate moving objects

    “酷刑”单元测试场景 250000 个独立移动物体

  • 1GB of mesh data (10k+ meshes)

    1GB 网格数据(10k+ 网格)

  • 8k2 texture cache atlas

    8k2 纹理缓存图集

  • DirectX 11 code path

    DirectX 11 代码路径

  • 64 vertex clusters (strips)

    64 个顶点簇(条带)

  • No ExecuteIndirect / MultiDrawIndirect

    不执行间接/多画间接

  • Only two DrawInstancedIndirect calls

    只有两个 DrawInstancendirect 调用

webp webp

Fast Occlusion for Shadow

快速遮挡阴影

For each cascade

对于每个级联

  • Generate camera depth reprojection (64x64 pixel)

    生成相机深度重投影(64x64 像素)

  • Then combine with last frame's shadow depth reprojection

    然后结合上一帧的阴影深度重投影

  • Generate hierarchy-Z buffer for GPU culling

    生成用于 GPU 剔除的分层 Z-缓冲区

webp

Camera Depth Reprojection for Shadow Culling

用于阴影消隐的相机深度重投影

Motivation

动机

  • It is essential to cull objects in light view, which does not cast a visible shadow

    在光线视角下剔除物体至关重要,因为光线视角不会投射出可见的阴影

Implementation

实施

  • Get camera visible areas that may appear shadow

    获取可能出现阴影的相机可见区域

  • For each 16*16screen tile, construct a cube (each yellow frustum) according to min/max depth in this tile.

    对于每个 16*16 的屏幕图块,根据该图块中的最小/最大深度构造一个立方体(每个黄色平截头体)。

  • Render max depth of these cubes in the light view

    在灯光视图中渲染这些立方体的最大深度

  • All objects that far from depth can be culled because they certainly do not contribute to visible shadow

    所有远离深度的物体都可以被剔除,因为它们肯定不会产生可见的阴影

webp

Best Cases of Camera Depth Reprojection

相机深度重投影的最佳案例

webp

Visibility Buffer

可见性缓冲区

Recap - Deferred Shading, G-Buffer

回顾-延迟着色,G-缓冲区

  • Forward rendering shades all fragments in triangle- submission order

    正向渲染为三角形中的所有片段着色-提交顺序

  • Wastes rendering power on pixels that don’t contribute to the final image

    在对最终图像没有贡献的像素上浪费渲染能力

  • Deferred shading solves this problem in 2steps:

    延迟着色通过两个步骤解决了这个问题:

  • First, surface attributes are stored in screen buffers -> G-Buffer

    首先,曲面属性存储在屏幕缓冲区 -> G-Buffer 中

  • Second, shading is computed for visible fragments only

    其次,仅对可见片段计算着色

Deferred Shading

延迟渲染

webp

Fat G-Buffer of Deferred Shading

延迟遮光的 Fat G-Buffer

  • However, deferred shading increases memory bandwidth consumption:

    但是,延迟着色会增加内存带宽消耗:

    • Screen buffers for: normal, depth, albedo, material ID,…

      屏幕缓冲区用于:正常、深度、反照率、材质 ID,…

    • G-Buffer size becomes challenging at high resolutions

      G-缓冲区大小在高分辨率下变得具有挑战性

webp

Challenges of Complex Scene

复杂场景的挑战

webp webp

Visibility Buffer - Filling

可见性缓冲区-填充

  • Visibility Buffer generation step

    可见性缓冲区生成步骤

  • For each pixel in screen:

    对于屏幕中的每个像素:

    • Pack (alpha masked bit, drawID, primitiveID) into one 32-bit UINT

    • 将(alpha 掩码位、drawID、primitiveID)打包成一个 32 位的 UINT

    • Write that into a screen-sized buffer

      将其写入屏幕大小的缓冲区

  • The tuple (alpha masked bit, drawID, primitiveID) will allow a shader to access the triangle data in the shading step

    元组(alpha 掩码位、drawID、primitiveID)将允许着色器在着色步骤中访问三角形数据

webp

Visibility Buffer - Shading

可见性缓冲区-着色

  • For each pixel in screen-space we do:

    对于屏幕空间中的每个像素,我们做:

  • Get drawID/triangleID at pixel pos

    在像素位置获取 drawID/trangleID

  • Load data for the 3 vertices from the VB

    从 VB 加载 3 个顶点的数据

  • Compute triangle gradients

    计算三角形梯度

  • Interpolate vertex attributes at pixel pos using gradients

    使用渐变在像素位置插值顶点属性

    • Attribs use w from position to compute perspective correct interpolation

      属性使用 w 从位置计算透视校正插值

    • MVP matrix is applied to position

      MVP 矩阵应用于定位

  • We have all data ready: shade and calculate final color

    我们已经准备好所有数据:阴影和计算最终颜色

Pipeline of Visibility Buffer

可见性缓冲管道

webp

Visibility Buffer + Deferred Shading

可见性缓冲区+延迟着色

webp

Correct Texture Mipmap with Gradient Without

无渐变的正确纹理 Mipmap

webp

Results

Total

  • 8 Million Triangles
  • 5 Million Vertices
webp

Visibility Buffer

GPU AMD RADEON R9 3801080p1440p2160p
No MSAA8.5710.7215.19
No MSAA – No Culling14.5215.8620.45
2x MSAA11.4416.3825.87
4x MSAA15.2720.8237.86

Deferred Shading

GPU AMD RADEON R9 3801080p1440p2160p
No MSAA9.7512.3020.19
No MSAA – No Culling14.1616.624.06
2x MSAA16.1623.0942.68
4x MSAA24.9036.3769.64

Virtual Geometry - Nanite

虚拟几何-Nanite

Challenges of Realistic Rendering

webp

Nanite Overview

Nanite 概述

  • Overview

    概述

  • Geometry Representation

    几何表示法

    • Cluster-based LoD

      基于集群的 LoD

    • BVH and runtime LoD

      BVH 和运行时间 LoD

  • Rendering

    渲染图

    • Software and Hardware Rasterization

      软件和硬件光栅化

    • Visibility Buffer

      可见性缓冲区

    • Deferred Materials

      递延材质

    • Tile-based Acceleration

      基于瓷砖的加速

  • Virtual Shadow Map

    虚拟阴影贴图

  • Streaming and Compression

    流媒体和压缩

Virtual Texture

虚拟纹理

  • Build a virtual indexed texture to represent all blended terrain materials for whole scene

    构建一个虚拟索引纹理,以表示整个场景的所有混合地形材质

  • Only load materials data of tiles based on view depend LOD

    仅加载基于视图的 LOD 的瓷砖材质数据

  • Pre-bake materials blending into tile and store them into physical textures

    预烘烤材质混合到瓷砖中,并将其储存成物理纹理

webp

The Dream

理想

  • Virtualize geometry like we did textures

    像纹理一样虚拟化几何体

    • No more budgets

      没有更多预算

      • Poly count

        多边形计数

      • Draw calls

      • Memory

        记忆

  • Directly use film quality source art

    直接使用电影质量的源艺术

    • No manual optimization required

      无需手动优化

  • No loss in quality

    质量无损失

Reality

现实

  • MUCH harder than virtual texturing

    比虚拟纹理硬得多

    • Not just memory management

      不仅仅是内存管理

    • Geometry detail directly impacts rendering cost

      几何体细节直接影响渲染成本

    • Geometry is not trivially filterable (SDF, Voxels, Point Clouds)

      几何体不能轻易过滤(SDF、体素、点云)

Voxels?

体素?

  • Spatially uniform data distribution

    空间均匀的数据分布

  • Big memory consumption

    内存消耗大

  • Attribute leaking

    属性泄漏

webp
  • Not interested in completely changing all CG workflow

    对完全改变所有 CG 工作流程不感兴趣

    • Support importing meshes authored anywhere

      支持导入在任何地方编写的网格

    • Still have UVs and tiling detail maps

      仍然有 UV 和平铺细节贴图

    • Only replacing meshes, not textures, not materials, not tools

      仅替换网格,不替换纹理,不替换材质,不替换工具

  • Never ending list of hard problems

    永无止境的难题清单

Subdivision Surfaces?

细分曲面?

  • Subdivision by definition is amplification only

    根据定义,细分只是放大

  • Great for up close but doesn't get simpler than base mesh

    非常适合近距离拍摄,但不会比基础网眼更简单

  • Sometimes produces an excessive number of triangles

    有时会产生过多的三角形

webp

Maps-based Method?

基于地图的方法?

  • Works well for organic surfaces that already are uniformly sampled

    适用于已经均匀取样的有机表面

  • Difficult to control hard surface features

    难以控制的硬表面特征

  • Sometimes object surface is not connected

    有时物体表面不连接

webp

Point Cloud?

点云?

  • Massive amounts of overdraw

    大量透支

  • Requires hole filling

    需要补孔

webp

https://highperformancegraphics.org/slides22/Software_Rasterization_of_2_Billion_Points_in_Real_Time.pptx

webp

Foundation of Computer Graphics

计算机图形学基础

  • The most elemental, atomic unit of surface area in 3D space

    三维空间中最基本的原子表面积单位

  • Every surface can be turned into triangles

    每个曲面都可以变成三角形

webp

Nanite Geometry Representation

Nanite 几何表示法

Screen Pixels and Triangles

屏幕像素和三角形

  • Linear scaling in instances can be ok

    实例中的线性缩放是可以的

  • Linear scaling in triangles is not ok

    三角形中的线性缩放是不合适的

Why should we draw more triangles than screen pixels?

为什么我们应该绘制比屏幕像素更多的三角形?

webp

Represent Geometry by Clusters

按簇表示几何体

webp

View Dependent LOD Transitions – Better than AC Solutions

视图相关 LOD 转换——优于 AC 解决方案

webp

Similar Visual Apperance with 1/30 Rendering Cost!

相似的视觉效果,渲染成本为 1/30!

webp

Naïve Solution - Cluster LoD Hierarchy

幼稚的解决方案-集群 LoD 层次结构

  • Decide LOD on a cluster basis

    基于集群确定 LOD

  • Build a hierarchy of LODs

    构建 LOD 层次结构

    • Simplest is tree of clusters

      最简单的是集群树

    • Parents are the simplified versions of their children

      父母是孩子的简化版本

webp

Naïve Solution - Decide Cluster LOD Run-time

天真的解决方案-确定集群 LOD 运行时间

  • Find cut of the tree for desired LOD

    找到所需 LOD 的树木切割

  • View dependent based on perceptual difference

    基于感知差异的视图依赖

webp

Naïve Solution – Simple Streaming Idea

天真的解决方案——简单的流媒体创意

  • Entire tree doesn't need to be in memory at once

    整个树不需要一次出现在内存中

  • Can mark any cut of the tree as leaves and toss the rest

    可以将树上的任何切口标记为叶子,然后扔掉剩下的

  • Request data on demand during rendering

    渲染过程中按需请求数据

    • Like virtual texturing

      类似于虚拟纹理

webp

But, How to Handle LOD Cracks

但是,如何处理 LOD 裂缝

  • If each cluster decides LOD independent from neighbors, cracks!

    如果每个集群独立于邻居决定 LOD,那么就会破裂!

  • Naive solution:

    天真的解决方案:

    • Lock shared boundary edges during simplification

      在简化过程中锁定共享边界边

    • Independent clusters will always match at boundaries

      独立集群将始终在边界处匹配

webp

Locked Boundaries? Bad Results

锁定边界?糟糕的结果

  • Collects dense cruft

    收集稠密的原油

  • Especially between deep subtrees

    尤其是在深子树之间

webp webp

Nanite Solution - Cluster Group

Nanite 解决方案-集群集团

  • Can detect these cases during build

    可以在构建过程中检测到这些情况

  • Group clusters

    集团集群

  • Force them to make the same LOD decision

    迫使他们做出相同的 LOD 决定

  • Now free to unlock shared edges and collapse them

    现在可以自由解锁共享边并折叠它们

webp

Build Operations

构建操作

  • Pick grouped these 4 adjacent clusters

    将这 4 个相邻的集群进行分组

  • Merge and Simplify the clusters to half the number of triangles

    将簇合并并简化为三角形数量的一半

  • Split simplified triangle list back into 2 new clusters

    将简化的三角形列表拆分回 2 个新集群

  • We now have reduced 4 4-triangle clusters to 2 4-triangle clusters

    我们现在已经将 4 个 4 三角聚类减少到 2 个 4 三角群集

webp
  • Cluster original triangles

    对原始三角形进行聚类

  • While NumClusters > 1

    当 NumClusters > 1 时

    • Group clusters to clean their shared boundary

      将集群分组以清理其共享边界

    • Merge triangles from group into shared list

      将组中的三角形合并到共享列表中

    • Simplify to 50% the number of triangles

      将三角形的数量简化到 50%

    • Split simplified triangle list into clusters (128 tris)

      将简化的三角形列表拆分为簇(128 个 tris)

Build Clusters

构建集群

webp

Simplification on Cluster Group

集群组的简化

webp webp

Alternate Group Boundaries between Levels

级别之间的备用组边界

  • The key idea is to alternate group boundaries from level to level by grouping different clusters.

    关键思想是通过对不同集群进行分组,在不同级别之间交替设置组边界。

  • A boundary in one level becomes the interior in the next level

    一层中的边界成为下一层的内部

  • Locked one level, unlocked the next

    锁定一个级别,解锁下一个级别

webp

Cluster group boundaries for LoD0

LoD0 的集群组边界

Cluster group boundaries for LoD1

LoD1 的集群组边界

Cluster group boundaries for LoD2

LoD2 的集群组边界

DAG for Cluster Groups

集群组的 DAG

  • Merge and split makes this a DAG instead of a tree

    合并和拆分使其成为 DAG 而不是树

    • This is a good thing in that you can’t draw a line from LOD0 all the way to the root without crossing an edge

      这是一件好事,因为你不能在不穿过边的情况下从 LOD0 一直画到根部

    • Meaning there can’t be locked edges that stay locked and collect cruft

      这意味着不可能有锁定的边缘保持锁定并收集碎屑

webp

Why DAG, not Tree (Trap!)

为什么是 DAG,而不是树(陷阱!)

Jungle of clusters, group and their links

集群、群体及其联系的丛林

webp

Let’s Chop the Lovely Bunny

webp

Detail of Simplification - QEM

简化细节-QEM

webp webp

Runtime LoD Selection

运行时 LoD 选择

View-Dependent LoD Selection on DAG?

DAG 上的视图相关 LoD 选择?

Group is faster than cluster, but DAG is still very complicated

组比簇快,但 DAG 仍然非常复杂

webp

LOD Selection for Cluster Group

聚类组的 LOD 选择

  • Two submeshes with same boundary, but different LOD

    具有相同边界但 LOD 不同的两个子板

  • Choose between them based on screen-space error

    根据屏幕空间错误在它们之间进行选择

    • Error calculated by simplifier projected to screen

      投影到屏幕上的简化器计算误差

    • Corrected for distance and angle distortion at worst-case point in sphere bounds

      针对球体边界中最坏情况点的距离和角度失真进行了校正

  • All clusters in group must make same LOD decision

    组中的所有集群都必须做出相同的 LOD 决策

    • How? Communicate? No!

      怎么办?沟通?不!

    • Same input => same output

      相同的输入 => 相同的输出

webp

LOD Selection in Parallel

并行 LOD 选择

  • LOD selection corresponds to cutting the DAG

    LOD 选择对应于切割 DAG

    • How to compute in parallel?

      如何并行计算?

    • Don’t want to traverse the DAG at run-time

      不想在运行时遍历 DAG

  • What defines the cut?

    切割的定义是什么?

    • Difference between parent and child

      父母和孩子的区别

  • Draw a cluster when:

    在以下情况下绘制集群:

    • Parent error is too high && Our error is small enough

      父错误太高 & 我们的错误足够小

    • Can be evaluated in parallel!

      可以并行评估!

webp
  • Only if there is one unique cut

    只有当有一个独特的切割

    • Force error to be monotonic

      力误差为单调

  • Parent view error >= child view error

    父视图错误 >= 子视图错误

  • Careful implementation to make sure runtime correction is also monotonic

    仔细实施以确保运行时校正也是单调的

webp

Core Equation of Parallel LoD Selection for Cluster Groups

集群并行 LoD 选择的核心方程

  • When can we LOD cull a cluster?

    我们什么时候可以 LOD 剔除集群?

    • Render: ParentError > threshold && ClusterError <= threshold

      渲染:父错误 > 阈值 && ClusterError <= 阈值

    • Cull: ParentError <= threshold || ClusterError > threshold

  • Parent is already precise enough. No need to check child

    家长已经足够精确了。无需检查孩子

    • ParentError <= threshold

      父错误 <= 阈值

    • Tree based on ParentError, not ClusterError!

      基于 ParentError 的树,而不是 ClusterError!

Isolated LoD Selection for Each Cluster Group

每个集群组的独立 LoD 选择

  • Render: ParentError > threshold && ClusterError <= threshold
  • Cull: ParentError <= threshold || ClusterError > threshold
webp

BVH Acceleration for LoD Selection

用于 LoD 选择的 BVH 加速

Really Bad Explanation of Why and How about BVH

关于 BVH 的原因和原因的糟糕解释

  • BVH4

    • Max of children’s ParentError

      儿童父母最大错误

    • Internal node: 4 children nodes

      内部节点:4 个子节点

    • Leaf node: List of clusters in group

      叶子节点:组中的簇列表

Build BVH for Acceleration of LoD Selection

构建 BVH 以加速 LoD 选择

  • 7,000,000 triangles will create 110,000 clusters

    7000000 个三角形将创建 110000 个簇

  • Iterating all cluster/cluster groups is too slow

    迭代所有集群/集群组太慢

  • Let’s build BVH for each LoD cluster groups

    让我们为每个 LoD 集群组构建 BVH

webp

Balance BVH for 4 Nodes

平衡 4 个节点的 BVH

webp

Detail of BVH Acceleration

  • total 110437 clusters,

    总共 110437 个簇,

  • check bvh node = 107, check cluster = 4240,

    检查 bvh 节点 = 107、检查簇 = 4240

  • select cluster = 2175

    选择群集 = 2175

webp

Hierarchical Culling - Naive Approach

分层剔除-朴素方法

  • Dependent DispatchIndirects

    依赖视差间接

    • One per level

      每层一个

  • Global synchronization

    全局同步

    • Wait for idle between every level

      在每个级别之间等待空闲

  • Worst case # of levels

    最坏情况下的级别数量

    • Empty dispatches at the end

      末尾为空调度

  • Can be mitigated by higher fanout

    可以通过更高的扇出来缓解

    • Wasteful for small/distant objects

      对小/远距离物体浪费

webp

Persistent Threads

持久线程

  • Ideally

    理想情况下

    • Start on child as soon as parent finished

      父母一完成,就从孩子开始

    • Spawn child threads directly from compute

      直接从计算中生成子线程

  • Persistent threads model instead

    改为持久线程模型

    • Can't spawn new threads. Reuse them instead!

      无法生成新线程。重复使用它们!

    • Manage our own job queue

      管理我们自己的作业队列

    • Single dispatch with enough worker threads to fill GPU

      单分派,具有足够的工作线程来填充 GPU

    • Use simple multi-producer multi-consumer (MPMC) job-queue to communicate between threads

      使用简单的多生产者多消费者(MPMC)作业队列在线程之间进行通信

webp

Nanite Rasterization

Nanite 光栅化

Pixel Scale Detail

像素比例细节

  • Can we hit pixel scale detail with triangles > 1 pixel?

    我们可以用大于 1 像素的三角形来达到像素级的细节吗?

  • Depends how smooth

    取决于平滑程度

  • In general no

    一般来说,没有

  • Need to draw pixel sized triangles

    需要绘制像素大小的三角形

webp

Hardware Rasterization

硬件光栅化

  • HW Rasterization unit is quad (2x2 pixels) for ddx and ddy

    对于 ddx 和 ddy,HW 光栅化单元为四边形(2x2 像素)

  • Need help pixels (yellow) to form quads

    需要帮助像素(黄色)来形成四边形

webp
  • Use 4x4 tiled traversal to accelerate

    使用 4x4 平铺遍历来加速

webp
  • A lot of wasting for small triangle

    小三角形浪费很多

  • tiled traversal stage is useless

    平铺遍历阶段毫无用处

  • quad generate 4x pixels than its really covered

    四边形产生的像素比实际覆盖的像素多 4 倍

webp

Software Rasterization for Tiny Triangles

微小三角形的软件光栅化

  • Terrible for typical rasterizer

    对于典型的光栅化器来说很糟糕

  • Typical rasterizer:

    典型光栅化器:

    • Macro tile binning

      宏平铺

    • Micro tile 4x4

      微型瓷砖 4x4

    • Output 2x2 pixel quads

      输出 2x2 像素四边形

    • Highly parallel in pixels not triangles

      像素高度平行,而非三角形

  • Modern GPUs setup 4 tris/clock max

    现代 GPU 设置最大 4 tris / 时钟

    • Outputting SV_PrimitiveID makes it even worse

      输出 SV_PrimitiveID 会使情况变得更糟

  • Can we beat the HW rasterizer in SW?

    我们能在软件中击败硬件光栅化器吗?

3x faster!

webp

Nanite – Rasterization

Nanite-光栅化

  • Only rasterize 1 pixel when the triangle size smaller than 1 pixel in Shader function

    当着色器函数中的三角形尺寸小于 1 像素时,仅光栅化 1 像素

  • We will save 3 pixels compute resources if the triangle only covered in 1 pixel

    如果三角形只覆盖 1 个像素,我们将节省 3 个像素的计算资源

  • Reconstruct derivatives for ddx/ddy

    重建 ddx/ddy 的导数

webp

Scanline Software Rasterizer

扫描线软件光栅化器

  • Per-cluster based rasterization selection

    基于每个集群的光栅化选择

    • All edges of cluster <18 pixels are SW rasterized

      所有小于 18 像素的簇边缘都进行了 SW 光栅化

  • Iterate over the rect tests a lot of pixels

    迭代 rect 测试大量像素

  • Best case half are covered

    最好的一半都包括在内

  • Worst case none are

    最坏的情况是没有

  • Scanline method is a choice

    扫描线方法是一种选择

webp

How To Do Depth Test?

如何进行深度测试?

  • Don't have ROP or depth test hardware

    没有 ROP 或深度测试硬件

  • Need Z-buffering

    需要 Z-缓冲

    • Can't serialize at tiles

      无法在图块上序列化

    • Many tris may be in parallel for single tile or even single pixel

      对于单个图块甚至单个像素,许多 tris 可能是并行的

  • Use 64 bit atomics!

    使用 64 位原子!

32257
Depth
深度
Visible cluster index
可见集群索引
Triangle index
三角形索引
  • InterlockedMax

    联锁 Max

    • Visibility buffer shows its true power

      可见性缓冲显示其真正的力量

Nanite Visibility Buffer

Nanite 可见性缓冲区

NumberBits32257
Type
类型
Depth
深度
Visible cluster index
可见集群索引
Triangle index
三角形索引
webp
  • Write geometry data to screen

    将几何数据写入屏幕

    • Depth : InstanceID : TriangleID

      深度:实例 ID:三角形 ID

  • Material shader per pixel:

    每像素材质着色器:

    • Load VisBuffer

      加载 VisBuffer

    • Load instance transform

      加载实例转换

    • Load 3 vert indexes

      加载 3 个涵洞索引

    • Load 3 positions

      加载 3 个位置

    • Transform positions to screen

      将位置转换到屏幕

    • Derive barycentric coordinates for pixel

      推导像素的重心坐标

    • Load and lerp attributes

      加载和 lerp 属性

  • Sounds crazy? Not as slow as it seems

    听起来很疯狂?没有看起来那么慢

    • Lots of cache hits

      大量缓存命中

    • No overdraw or pixel quad inefficiencies

      没有过度绘制或像素四边形效率低下

  • Material pass writes GBuffer

    物料传递写入 GBuffer

    • Integrates with rest of our deferred shading renderer

      与我们的其他延迟着色渲染器集成

  • Draw all opaque geometry with 1 draw

    用 1 次绘制绘制所有不透明几何体

    • Completely GPU driven

      完全由 GPU 驱动

    • Not just depth prepass

      不仅仅是深度预付

    • Rasterize triangles once per view

      每个视图对三角形进行一次栅格化

Hardware Rasterization

硬件光栅化

  • What about big triangles?

    大三角形呢?

  • Use HW rasterizer

    使用硬件光栅化器

  • Choose SW or HW per cluster

    为每个集群选择软件或硬件

  • Also uses 64b atomic writes to UAV

    还使用 64b 原子写入无人机

webp

Imposters for Tiny Instances

微小实例的冒名顶替者

  • 12 x 12 view directions in atlas

    图集中 12 x 12 个视图方向

    • XY atlas location octahedral mapped to view direction

      XY 图集位置八面体映射到视图方向

    • Dithered direction quantization

      离散方向量化

  • 12 x 12 pixels per direction

    每个方向 12 x 12 像素

    • Orthogonal projection

      正交投影

    • Minimal extents fit to mesh AABB

      最小范围适合网眼 AABB

    • 8:8 Depth, TriangleID

      8:8 深度,三角形 ID

    • 40.5KB per mesh always resident

      每个网格始终驻留 40.5KB

  • Ray march to adjust parallax between directions

    光线行进以调整方向之间的视差

    • Few steps needed due to small parallax

      由于视差小,需要很少的步骤

  • Drawn directly from instance culling pass

    直接从实例剔除过程中提取

    • Bypassing visible instances list

      绕过可见实例列表

  • Would like to replace with something better

    想换个更好的

webp

Rasterizer Overdraw

光栅过冲

  • No per triangle culling

    无每个三角形的剔除

  • No hardware HiZ culling pixels

    无硬件 HiZ 剔除像素

  • Our software HZB is from previous frame

    我们的软件 HZB 来自上一帧

    • Culls clusters not pixels

      剔除聚类而非像素

    • Resolution based on cluster screen size

      基于集群屏幕大小的分辨率

  • Excessive overdraw from:

    过度透支来自:

    • Large clusters

      大型集群

    • Overlapping clusters

      重叠集群

    • Aggregates

      骨料

    • Fast motion

      快速运动

  • Overdraw expense

    超支费用

    • Small tris: Vertex transform and triangle setup bound

      小三角:顶点变换和三角形设置边界

    • Medium tris: Pixel coverage test bound

      中等分辨率:像素覆盖率测试范围

    • Large tris: Atomic bound

      大三体:原子束缚

webp

Nanite Deferred Material

Nanite 递延材质

Deferred Material

递延材质

  • Nanite want to support full artist created pixel shaders

    Nanite 希望支持完全由艺术家创建的像素着色器

  • In theory, all materials could be applied in a single pass, but there are complexities and inefficiencies there

    理论上,所有材质都可以一次性使用,但存在复杂性和效率低下的问题

webp

Material Shading

材质着色

  • Common method

    常用方法

    • Draw a full screen quad per unique material

      为每种独特材质绘制全屏四边形

    • Skip pixels not matching this material

      跳过与此材质不匹配的像素

  • Disadvantages

    缺点

    • CPU unaware if some materials have no visible pixels (unfortunate side effect of GPU driven)

      CPU 不知道某些材质是否没有可见像素(GPU 驱动的不幸副作用)

    • So unnecessary drawing instructions will be committed

      因此,将提交不必要的图纸说明

Shading Efficiency

遮光效率

  • Hardware depth test!

    硬件深度测试!

    • Convert material ID to depth value

      将材质 ID 转换为深度值

webp

Shading

  • Then draw a full screen quad and set depth test function to "equal", so unmatched pixels will be discarded

    然后绘制一个全屏四边形,并将深度测试功能设置为“相等”,这样不匹配的像素将被丢弃

  • But full screen quad is not necessary and can be improved!

    但全屏四屏不是必需的,可以改进!

Material Sorting with Tile-Based Rendering

基于平铺渲染的材质排序

  • We can do a screen tile material classification

    我们可以做一个筛网材质分类

  • For a certain material, exclude tiles that do not contain this material

    对于某种材质,排除不包含此材质的瓷砖

webp

Material Classify

材质分类

webp

Material Classify - Material Tile Remap Table

材质分类-材质瓷砖重绘表

  • Finally forms a material and tile remap table

    最后形成材质和瓷砖重映射表

  • Get the number of tiles based on the screen resolution and pack 32 tiles into a group

    根据屏幕分辨率获取图块数量,并将 32 个图块打包成一组

  • 'MaterialRemapCount' means the number of groups

    “MaterialRemapCount” 是指组的数量

  • Record the tiles in which a material is located by marking it by bit

    通过逐点标记来记录材质所在的瓷砖

  • This table can be used to calculate the tile position to render to

    此表可用于计算要渲染的图块位置

webp

Deferred Material Overall Process

递延材质整体流程

  • Generate material resolve texture

    生成材质解析纹理

  • Generate material depth texture

    生成材质深度纹理

  • Classify screen tile materials

    对筛网材质进行分类

  • Generate G-Buffer

    生成 G-缓冲区

    • This will be output to the g-buffer to match with the rest of the pipeline

      这将被输出到 g-buffer,以与管道的其余部分相匹配

    • Commit drawing commands per material

      按材质提交绘图命令

webp
C++
void UnpackMaterialResolve(uint Packed,
	out bool IsNanitePixel,
	out bool IsDecalReceiver,
	out uint MaterialSlot)
{
	IsNanitePixel = BitFieldExtractU32(Packed,10) != 0;
    MaterialSlot = BitFieldExtractU32(Packed, 14, 1);
    IsDecalReceiver = BitFieldExtractU32(Packed, 1, 15) != 0:

Shadows

Micropoly Level Detail for Shadows

阴影的微多层细节

webp

Nanite Shadows - Ray Trace?

Nanite 阴影-射线追踪?

  • Ray trace?

    射线追踪?

  • There are more shadow rays than primary since there are on average more than 1 light per pixel

    由于每个像素平均有 1 个以上的光,因此阴影光线比主光线多

  • Custom triangle encoding

    自定义三角形编码

  • No partial BVH updates

    无部分 BVH 更新

  • HW triangle formats + BLAS (bottom level acceleration structure) currently are 3-7x the size of Nanite data

    HW 三角形格式 + BLAS(底层加速结构)目前是 Nanite 数据大小的 3-7x

webp

RTX 40XX,50XX? Radeon RX 70XX...?

Recap Cascaded Shadow Map

回顾级联阴影图

  • Relatively coarse LOD control

    LOD 控制相对粗糙

  • If better shadow detail is desired, there is still significant memory consumption

    如果需要更好的阴影细节,仍然会消耗大量内存

webp

Sample Distribution Shadow Maps

示例分布阴影图

  • Gives a better cascaded map coverage by analysing the range of screen pixel depths

    通过分析屏幕像素深度范围,提供更好的级联地图覆盖率

  • An optimized cascaded shadow map but still has coarse LOD control

    优化的级联阴影贴图,但仍具有粗略的 LOD 控制

webp webp

Virtual Shadow Map - A Cached Shadow System!

虚拟阴影地图-缓存的阴影系统!

  • Most lights don't move, should be cached as much as possible

    大多数灯光不会移动,应尽可能缓存

webp

Virtual Shadow Maps

虚拟阴影地图

  • 16k x 16k virtual shadow map for each light (exception, point light with 6 VSMs)

    每个灯光的 16k x 16k 虚拟阴影贴图(具有 6 个 VSM 的点光源除外)

webp

Different Light Type Shadow Maps

不同的灯光类型阴影贴图

webp

Shadow Page Allocation

影子页面分配

  • Only visible shadow pixels need to be cached

    只需要缓存可见的阴影像素

    • For each pixel on screen

      对于屏幕上的每个像素

    • For all lights affecting this pixel

      对于影响此像素的所有灯光

    • Project the position into shadow map space

      将位置投影到阴影贴图空间

    • Pick the mip level where 1 texel matches the size of 1screen pixel

      选择 1 个纹理像素与 1 个屏幕像素大小匹配的 mip 级别

    • Mark the page as needed

      根据需要标记页面

    • Allocate physical page space for uncached pages

      为未缓存的页面分配物理页面空间

Shadow Page Table and Physical Pages Pool

影子页表和物理页池

webp

Shadow Page Cache Invalidation

卷影页缓存无效

  • Camera movement, if the movement is relatively smooth, there will not be many pages to update

    相机移动,如果移动相对平稳,就不会有很多页面需要更新

  • Any light movement or rotation will invalidate all cached pages for that light

    任何灯光移动或旋转都会使该灯光的所有缓存页面无效

  • Geometry that casts shadows moving, or being added or removed from the scene will invalidate any pages that overlap its bounding box from the light's perspective

    投射阴影的几何体在场景中移动、添加或删除,将使从灯光角度与其边界框重叠的任何页面无效。

  • Geometry using materials that may modify mesh positions

    使用可能修改网格位置的材质的几何体

  • ...

Shadow Demo

webp

Conclusions

结论

  • Number of shadow pages proportional to screen pixels

    与屏幕像素成比例的阴影页数

  • Shadow cost scales with resolution and number of lights per pixel

    阴影成本随分辨率和每像素的灯光数量而变化

webp

Streaming and Compression

流媒体和压缩

Streaming

流媒体

  • Virtualized geometry

    虚拟几何体

    • Unlimited geometry at fixed memory budget

      固定内存预算下的无限几何图形

  • Conceptually similar to virtual texturing

    概念上类似于虚拟纹理

    • GPU requests needed data then CPU fulfills them.

      GPU 请求所需的数据,然后 CPU 完成它们。

    • Unique challenges: must no cracks in the geometry

      独特的挑战:几何体中不得有裂纹

  • Cut DAG at runtime to only loaded geometry

    在运行时将 DAG 剪切为仅加载的几何体

    • Needs to always be a valid cut of full DAG

      需要始终是完整 DAG 的有效切割

    • Similar to LOD cutting. No cracks

      类似于 LOD 切割。无裂纹

webp

Paging

分页

  • Fill fixed-sized pages with groups

    用组填充固定大小的页面

  • Based on spatial locality to minimize pages needed at runtime

    基于空间局部性,以最小化运行时所需的页面

    • Sort groups by mip and spatial locality

      按 mip 和空间位置对组进行排序

  • Root page (64k)

    根页面(64k)

    • First page contains top lod level(s) of DAG

      第一页包含 DAG 的顶级 lod 级别

    • Always resident on GPU so we always have something to render

      始终驻留在 GPU 上,所以我们总是有东西要渲染

  • Streaming Page (128k)

    流媒体页面(128k)

    • Other lod levels of cluster groups

      集群组的其他 lod 水平

    • Life time is managed by LRU on CPU

      寿命由 CPU 上的 LRU 管理

  • Page contents:

    页面内容:

    • Index data,Vertex data, Bounds, LOD info, Material tables, etc.

      索引数据、顶点数据、边界、LOD 信息、材质表等。

webp

Memory representation

内存表示

Vertex quantization and encoding

顶点量化和编码

  • Global quantization

    全局量化

    • A combination of artist control and heuristics

      艺术家控制和启发式的结合

    • Clusters store values in local coordinates that is relative to value min/max range

      集群将值存储在相对于值最小/最大范围的局部坐标中

  • Per-cluster custom vertex format

    每簇自定义顶点格式

    • Uses minimum number of bits per component: ceil(log2(range))

      使用每个组件的最小位数:ceil(log2(range))

    • Just a string of bits, not even byte aligned

      只是一串比特,甚至没有字节对齐

  • Decoded using GPU bit-stream reader because of divergent encode format between clusters

    由于集群之间的编码格式不同,使用 GPU 比特流读取器进行解码

webp

Disk Representation

磁盘表示法

  • Hardware LZ decompression

    硬件 LZ 解压

    • In consoles now and on its way to PC with DirectStorage

      现在在控制台中,并正在通过 DirectStorage 进入 PC

    • Unbeatably fast, but general purpose

      速度无与伦比,但用途广泛

    • String deduplication and entropy coding

      字符串重复数据删除和熵编码

  • For better compression

    为了更好的压缩

    • Domain-specific transforms

      特定于域的转换

    • Focus on redundancies not already captured by LZ and massaging the data to better fit how LZ compression

      关注 LZ 尚未捕获的冗余,并对数据进行处理,以更好地适应 LZ 压缩方式

  • Transcode on the GPU

    GPU 上的转码

    • High throughput for parallel transforms, currently runs at ~50GB/s with fairly unoptimized code on PS5

      并行转换的高吞吐量,目前在 PS5 上以约 50GB/s 的速度运行,代码相当未优化

    • Powerful in combination with hardware LZ

      与硬件 LZ 结合使用功能强大

    • Eventually stream data directly to GPU memory

      最终将数据直接流式传输到 GPU 内存

Results: Lumen in the Land of Nanite

结果:Nanite 土地上的管腔

  • 433M Input triangles, 882M Nanite triangles

    433M 输入三角形,882M Nanite 三角形

  • Raw data: 25.90GB Memory format: 7.67GB

    原始数据:25.90GB 内存格式:7.67GB

  • Compressed: 6.77GB Compressed disk format: 4.61GB

    压缩:6.77GB 压缩磁盘格式:4.61GB

  • ~20% improvement since Early Access

    自早期访问以来提高了约 20%

  • 5.6 bytes per Nanite triangle, 11.4 bytes per input triangle

    每个 Nanite 三角形 5.6 个字节,每个输入三角形 11.4 个字节

  • 1M triangles = ~10.9MB on disk

    1M 三角形 = 磁盘上约 10.9MB

webp

Welcome to Billions of Triangles World

欢迎来到亿万三角形世界

Jungle of Nanite Geometries

Nanite 几何丛林

webp

References