GAMES104-Data-Oriented Programming and Job System

王希-GAMES104-现代游戏引擎:从入门到实践。

资源

课程

第二十节:现代游戏引擎架构:面向数据编程与任务系统

Code Execution Is Not As Simple As It Looks

代码执行并不像看起来那么简单

Code is executed on top of specific hardware and operating system

代码在特定硬件和操作系统上执行

  • Hardware and OS must be considered if we want to write a high performance program

    如果我们想编写高性能程序,就必须考虑硬件和操作系统

webp

Basics of Parallel Programming

并行程序设计基础

Ceiling of Moore's Law and Multi-Cores

摩尔定律的上限与多核

  • The number of transistors in a dense integrated circuit (IC) doubles about every two years

    密集集成电路(IC)中的晶体管数量大约每两年翻一番

  • In these years, chip densities are no longerdoubling every two years

    这些年来,芯片密度不再每两年翻一番

  • Multi-coreprocessorbecomes the new industry trend

    多核处理器成为新的行业趋势

webp

处理器的精度和频率的提升达到了一个瓶颈,添加更多核心成为提升处理器性能的一个趋势。

Process and Thread

进程和线程

{% div subfields %}

{% div subfield %}

Process

进程

  • The instance of an application (or program)

    应用程序(或程序)的实例

  • Has its own individual region of memory

    有自己的内存空间

{% enddiv %}

{% div subfield %}

Thread

线程

  • Preemptive multitasking

    先发制人的多任务处理

  • The smallest unit of task that can be scheduled by OS

    操作系统可以调度的最小任务单元

  • Must reside in a process

    必须驻留在进程中

  • Threads in the same process share the same region of memory

    同一进程中的线程共享同一内存区域(处理不好容易出 bug)

{% enddiv %}

{% enddiv %}

webp

Types of Multitasking

多任务处理的类型

{% div subfields %}

{% div subfield %}

webp

Preemptive Multitasking

先发制人的多任务处理

  • Currently executing task can be interrupted at a time decided by the scheduler

    当前正在执行的任务可以在调度程序决定的时间中断

  • Scheduler determines which task to be executed next

    调度器确定下一个要执行的任务

  • Applied in most operating systems

    适用于大多数操作系统

{% enddiv %}

{% div subfield %}

webp

Non-preemptive Multitasking

非抢占式多任务处理

  • Tasks must be explicitly programmed to yield control

    任务必须明确编程以实现控制

  • Tasksmust cooperate for the scheduling scheme to work

    任务必须配合调度方案才能发挥作用

  • Currently many real-time operating systems (RTOS) also support this kind of scheduling

    目前,许多实时操作系统(RTOS)也支持这种调度

{% enddiv %}

{% enddiv %}

Thread Context Switch

线程上下文切换

Store the state of a thread and resume the execution at a later point

存储线程的状态,并在稍后恢复执行

  • State including registers, stack and other OS required data

    状态,包括寄存器、堆栈和其他操作系统所需的数据

  • Thread context switch implies extra user-kernel mode switch

    线程上下文切换意味着额外的用户内核模式切换

  • Cache invalidation after context switch has even more cost

    上下文切换后缓存失效的成本更高

webp

Cache invalidation may takes 10,000~1,000,000 cycles

缓存失效可能需要 10000~1000000 个周期

Parallel Problems in Parallel Computing

并行计算中的并行问题

Embarrassingly Parallel Problem (or Perfectly Parallel)

易并行计算问题(或完全平行问题)

  • Little or no dependency or need for communication between parallel tasks

    并行任务之间很少或不需要通信

webp

Monte Carlo algorithm is a typical example of embarrassingly parallel

蒙特卡洛算法是易并行计算问题的典型例子(使用多线程计算蒙特卡洛积分)

Non-embarrassingly Parallel Problem

非尴尬的平行问题

  • Communication is needed between parallel tasks

    并行任务之间需要通信

webp

Data Race in Parallel Programming

并行编程中的数据竞争

Multiple threads in a single process access the same memory location concurrently

单个进程中的多个线程同时访问同一内存位置

  • Atleast one of the accesses is for writing

    至少有一个入口是用于写作的

C++
while(job_count < 100) {
    doSomething();
    job_count++;
}
  • Read job_count

    读取 job_count

  • Compute job_count +1

    计算任务计数 +1

  • Write new job_count

    任务计数

webp

Blocking Algorithm -Locking Primitives

阻塞算法-上锁

Lock

  • Only one thread can acquire the lock at a time

    一次只能有一个线程获取锁

  • Make a critical sectionfor shared resource access

    创建共享资源访问的关键部分

webp

Only one thread can do this part of code in the same time

只有一个线程可以同时执行这部分代码

Other Issues with Locks

锁的其他问题

  • Thread suspending and resuming will bring performance overhead

    线程挂起和恢复将带来性能开销

  • Suspending threads never get resumed if the thread that acquires the lock exits abnormally

    如果获取锁的线程异常退出,则挂起的线程永远不会恢复

  • Priority Inversion

    优先级反转

    • A higher priority task attempts to acquire the lock that is already acquired by a lower priority task

      较高优先级的任务试图获取较低优先级任务已经获取的锁

webp

Atomic Operation : Lock-free Programming

原子操作:无锁编程

Atomic Loads and Stores

原子负载和存储

  • Load: Load data from shared memory to either a register or thread-specific memory

    加载:将数据从共享内存加载到寄存器或线程特定内存

  • Store: Move data into shared memory

    存储:将数据移动到共享内存中

Atomic Read-Modify-Write (RMW)

原子读修改写(RMW)

  • Test and Set: Set 1 to shared memory and return the previous value

    测试和设置:将 1 设置为共享内存并返回上一个值

  • CompareandSwap (CAS): Update the data in shared memory if it equals an expected value

    比较和交换(CAS):如果共享内存中的数据等于预期值,则更新该数据

  • Fetch and Add: Add a value to the data in shared memory and return the previous value

    提取并添加:向共享内存中的数据添加一个值,并返回上一个值

  • ...

webp

C++11 和 C++20 提供了原子操作的语句。

Lock Free vs. Wait Free

webp
  • wait-free:不管 OS 如何调度线程,每个线程始终在做有用的事情。

  • lock-free:不管 OS 如何调度线程,至少有一个线程在做有用的事情。

Compiler Reordering Optimizations

编译器重新排序优化

webp

编译器将高级语言汇编成汇编语言时,有可能进行优化操作修改代码的执行顺序,这在单线程中不会有变化,但在多线程中可能出现奇怪情况。

Problem of Memory Reordering

内存重新排序问题

  • Compilers and CPUs often modify the execution order of instructions to optimize performance

    编译器和 CPU 经常修改指令的执行顺序以优化性能

  • It's the hard part of parallel programming

    这是并行编程的难点

webp

assert 是 C/C++ 中的一个宏,用于在运行时检查一个条件是否为真。如果条件不满足,则运行时将终止程序的执行并输出一条错误信息。

如果 Thread 1 进行了编译器优化,则 Thread 2 可能不会在 assert(a==2); 处抛出错误。

Out-of-order Execution by CPUs

CPU 执行顺序错误

For different CPU

适用于不同的 CPU

  • The optimization strategy are significantly different

    优化策略明显不同

  • Provides different types of memory order guarantees

    提供不同类型的内存顺序保证

  • Parallel programs require different processing

    并行程序需要不同的处理

webp

有些 CPU 不保证执行顺序。导致有些代码在平台 A 运行正常,但在平台 B 运行异常。

Parallel Framework of Game Engine

游戏引擎的并行框架

Fixed Multi-thread

固定多线程

One fixed thread for each part of the game engine logic

游戏引擎逻辑的每个部分都有一个固定线程

  • Render, Simulation, Logic, Network, and etc.

    渲染、模拟、逻辑、网络等

  • Easy to implement

    易于实施

webp

Issues with Fixed Multi-thread

已修复的多线程问题

  • The workload is unbalanced among threads (cores)

    线程(核心)之间的工作负载不平衡

  • Unscalable while there are more processor cores

    当有更多处理器内核时,无法缩放

webp

还是容易出现其它核心陪跑的情况。

Thread Fork-Join

螺纹叉连接

Fork-join for data-parallelizable work (based on fixed multi-thread)

用于数据并行工作的分叉连接(基于固定多线程)

  • Use a thread pool to prevent frequent thread creation/destruction

    使用线程池来防止频繁的线程创建/销毁

webp

Problems with Thread Fork-Join

螺纹叉连接问题

  • Not easy for logic programmers (work split, work threads count)

    逻辑程序员不容易(工作拆分、工作线程计数)

  • Too many threads can bring performance overhead on context switch

    太多的线程会给上下文切换带来性能开销

  • The workload is still unbalanced among threads (cores)

    线程(核心)之间的工作负载仍然不平衡

webp

Unreal Parallel framework

虚幻并行框架

Two types of threads

两种类型的螺纹

  • Named Thread

    命名线程

    • Created by other systems and attached to parallel framework

      由其他系统创建并附加到并行框架

  • Worker Thread

    工人线程

    • Three priorities: high, middle and low

      三个优先事项:高、中、低

    • The number is determined by the number of CPU cores

      该数量由 CPU 内核的数量决定

webp

Taskgraph

任务图

A directed acyclic graph

有向无环图

  • Node→Task

    节点→任务

  • Edge→Dependency

    边缘→依赖

webp

通过链接构建任务图

webp

Job System

任务系统

Coroutine

协程

Allows for multitaskingby creating jobs that run inside of coroutines

通过创建在协程内运行的任务,允许多任务处理

  • Coroutine is a lightweight execution context (include a user provided stack, registers...)

    协程是一个轻量级的执行上下文(包括用户提供的堆栈、寄存器等)

    • Execution is collaborative, means a coroutine can switch to another interactively

      执行是协作的,意味着协程可以交互式地切换到另一个协程

webp

Coroutine vs. Thread

webp

{% div subfields %}

{% div subfield %}

Coroutine

协程

  • Scheduled by programmers

    由程序员安排

  • To be executed within a thread

    在线程内执行

  • Context switch is faster without kernel switch

    无需内核切换,上下文切换速度更快

{% enddiv %}

{% div subfield %}

Thread

线程

  • Scheduled by operating system

    按操作系统计划

  • Resides in a process

    过程中的残留物

  • Context switch is costly with kernel switch

    内核切换的上下文切换成本很高

{% enddiv %}

{% enddiv %}

Stackful Coroutine

堆叠式协程

Coroutine owns an independent runtime stack which is reserved after yield

Coroutine 拥有一个独立的运行时堆栈,该堆栈在 yield 后保留

  • Enable to yield from within a nested stackframe

    允许在嵌套的堆栈框架内屈服

  • Use local variables just like normal functions

    像普通函数一样使用局部变量

webp

在 C++ 中,yield() 通常用于多线程编程中,来让出当前线程的执行权。具体来说,它的作用取决于你使用的线程库或框架。以下是几种常见的使用情况:

  1. std::this_thread::yield()(在 C++11 及以后的版本中):

    • 这是 C++ 标准库中的一个函数,声明在 <thread> 头文件中。

    • std::this_thread::yield() 使当前线程暂停执行,并允许其他同等优先级的线程有机会运行。它通常用于在多线程环境中优化资源使用,尤其是在忙等待的情况下。

    • 示例代码:

      cppCopy Code#include <iostream>
      #include <thread>
              
      void worker() {
          for (int i = 0; i < 10; ++i) {
              std::cout << "Working...\n";
              std::this_thread::yield(); // 让出线程执行权
          }
      }
              
      int main() {
          std::thread t(worker);
          t.join();
          return 0;
      }
      
  2. std::experimental::coroutine_handle::yield_value()(在 C++20 协程中):

    • 在 C++20 中,协程引入了更复杂的机制,如 std::experimental::coroutine_handle::yield_value(),用于协程的挂起和恢复。
    • 这允许在协程中挂起执行,并在之后恢复,提供了更高级的控制流功能。

在使用 yield() 时,需要注意它不会阻塞当前线程,而只是提示调度器当前线程愿意暂停执行。调度器将决定是否实际暂停当前线程以及何时恢复它。yield() 是一种建议,具体的调度行为取决于底层操作系统的线程调度策略。

总的来说,yield() 用于提高多线程程序的效率,避免在某些情况下导致不必要的资源占用或提高线程的响应性。

Stackless Coroutine

无堆栈协程

Coroutine has no independent runtime stack to be reserved when yield

Coroutineyield 时没有要保留的独立运行时堆栈

  • Only the top-levelroutine may yield (subroutines have no idea where to return without stack)

    只有顶层例程可能会产生结果(子程序不知道在没有堆栈的情况下返回哪里)

  • The data that is required to resume execution should be stored separately from the stack

    恢复执行所需的数据应与堆栈分开存储

webp

Stackful vs. Stackless Coroutine

堆叠式与无堆叠式协程

{% div subfields %}

{% div subfield %}

Stackful Coroutine

堆叠式协程

  • More powerful with enable to yield from within a nested stackframe

    更强大,支持从嵌套堆栈框架中屈服

  • Needs more memory to reserve stacks for each coroutine

    需要更多内存为每个协程保留堆栈

  • Coroutine context switch takes more time

    子程序上下文切换需要更多时间

{% enddiv %}

{% div subfield %}

Stackless Coroutine

无堆栈协程

  • Unable to yield from within a subroutine

    无法从子例程中屈服

  • More difficult to use without a stack to reserve data

    如果没有堆栈来保留数据,则更难使用

  • No extra memory needed for coroutine's stack

    协程堆栈不需要额外的内存

  • Faster context switch

    更快的上下文切换

{% enddiv %}

{% enddiv %}

Fiber-based Job System

基于光纤的任务系统

Allows for multitasking by creating jobs instead of threads

通过创建任务而不是线程来实现多任务处理

  • Fiber is like coroutine except that fiber is scheduled by a scheduler

    光纤类似于协程,只是光纤是由调度器调度的

  • Thread is the execution unit while fiber is the context

    线程是执行单元,光纤是上下文

    • One thread for each processor core to minimize the context switch overhead

      每个处理器核心一个线程,以最大限度地减少上下文切换开销

  • Job is executed within the context of a fiber

    任务在光纤上下文中执行

webp

One Work Thread for One Core

一个核心对应一个工作线程

To minimize the overhead of thread context switch

尽量减少线程上下文切换的开销

  • Multiple work threads for a single core still suffers from context switch

    单个核心的多个工作线程仍然受到上下文切换的影响

  • One work thread for each core eliminates context switch

    每个核心都有一个工作线程,消除了上下文切换

webp

Fiber-based Job System

基于光纤的任务系统

  • Thread is the execution unit while fiber is the context

    线程是执行单元,光纤是上下文

  • Job is executed within a fiber

    任务在光纤内执行

webp

Job Scheduler -Global Job

LIFO and FIFO Mode

LIFO 和 FIFO 模式

  • Schedule Model

    计划模型

    • First In First Out (FIFO)

      先进先出(FIFO)

    • Last In First Out (LIFO)

      后进先出(LIFO)

  • LIFO Mode

    • In most case, job dependency is tree like

      在大多数情况下,工作依赖性是树状的

    • Some system add jobs occasionally but wait them immediately

      有些系统偶尔会添加任务,但会立即等待

Job Scheduler -Job Dependency

Job Scheduler -Job Stealing

Pros and Cons of Job System

{% div subfields %}

{% div subfield %}

Pros

  • Easy to implement task schedule

    易于实施的任务计划

  • Easy to handle task dependency

    易于处理任务依赖关系

  • Job stack is isolated

    任务堆栈是隔离的

  • Avoid frequency context switch

    避免频率上下文切换

{% enddiv %}

{% div subfield %}

Cons

  • C++ does not natively support fiber

    C++ 本身不支持光纤

  • Implementation is different between OS

    操作系统之间的实现不同

  • Has some restrictions (thread_local invalid)

    有一些限制(thread_local 无效)

{% enddiv %}

{% enddiv %}

Programming Paradigms

编程范式

Procedure-oriented Programming

面向过程的程序设计

Object-oriented Programming

面向对象程序设计

Programming Paradigm of Game Engine

游戏引擎的编程范式

  • There are many different programming paradigms

    有许多不同的编程范式

  • Inpractice, some paradigms are widely used

    在实践中,一些范式被广泛使用

  • Programming languages aren't always tied to a specific paradigm

    编程语言并不总是与特定的范式联系在一起

webp

Procedural Oriented Programming (POP)

面向过程编程(POP)

  • Follows a step-by-step approach to break down a task into a collection of variables and routines (or subroutines) through a sequence of instructions

    遵循循序渐进的方法,通过一系列指令将任务分解为变量和例程(或子程序)的集合

  • Impossible to write a game engine in this way

    无法以这种方式编写游戏引擎

    • Data is not well maintained.

      数据没有得到很好的维护。

    • A co-relation with real-world objects is difficult

      与现实世界中的物体建立关联是困难的

Object-Oriented Programming (OOP)

面向对象程序设计(OOP)

  • Based on the concept of "objects", which can contain data and code

    基于“对象”的概念,可以包含数据代码

  • It's natural for human to abstract from real world in an object-oriented way

    人类以面向对象的方式从现实世界中抽象出来是很自然的

webp

Problems of OOP : Where to Put Codes?

面向对象编程的问题:把代码放在哪里?

webp

当攻击者击中对象时,是执行 Attacker.doDamageTo() 还是 Victim.receiveDamage()

当玩家接触敌人时,是执行 Player.attachTo() 还是 Enemy.isAttached()

Problems of OOP:Method Scattering in Inheritance Tree

面向对象程序的问题:继承树中的方法分散

  • Hard to know which parent class has the method implementation

    很难知道哪个父类有方法实现

webp

为了找到玩家攻击敌人的代码逻辑,有可能从继承树中套好几层才找到代码。

Problems of OOP : Messy Based Class

面向对象编程的问题:基于混乱的类

C++
classENGINE_API AActor: publicUObject
{
    ...
    constFTransform&GetTransform() const;
    constFTransform&ActorToWorld() const;
    FVectorGetActorForwardVector() const;
    FVectorGetActorUpVector() const;
    FVectorGetActorRightVector() const;
    virtualvoidGetActorBounds(...) const;
    virtualFVectorGetVelocity() const;
    floatGetDistanceTo(constAActor*OtherActor) const;
    virtualvoidSetActorHiddenInGame(boolbNewHidden);
    boolGetActorEnableCollision() const;
    boolHasAuthority() const;
    UActorComponent*AddComponent(...);
    voidAttachToActor(...);
    voidDetachFromActor(constFDetachmentTransformRules&DetachmentRules);
    boolGetTickableWhenPaused();
    boolIsActorInitialized() const;
    voidReceiveAnyDamage(...);
    voidGetOverlappingActors(...) const;
    virtualvoidSetLifeSpan(floatInLifespan);
    virtualvoidSerialize(FArchive&Ar) override;
    virtualvoidPostLoad() override;
    ...
}

Parts of methods of a "messy base class"

“混乱的基层”的部分方法(代码可读性差)

Find some methods in common? Put it to the base class! → We get a messy base class

找到一些共同的方法吗?把它放在基层阶级!→ 我们得到了一个混乱的基础阶级

This is not the best OO design, and it certainly is possible to make a better one. But also, often code ends up being like this, even if no one wanted it that way.

这不是最好的 OO 设计,当然有可能做出更好的 OO 设计。 但是,即使没有人希望这样,代码最终也往往是这样的。

Problems of OOP : Performance

面向对象编程的问题:性能

  • Memory scattering

    记忆散射

  • Jungle of virtual functions

    虚拟功能丛林

webp

面向对象编程可能导致运行时地址分散,Cache 命中率低,性能差。

Problems of OOP : Testability

面向对象编程的问题:可测试性

  • Unit Testing

    单元测试

    • OO designs often need a lot of setup to test

      OO 设计通常需要大量的设置来测试

webp

Data-Oriented Programming (DOP)

面向数据编程(DOP)

Processor-Memory Performance Gap

处理器内存性能差距

  • Performance of memory grows much slowly than processor

    内存的性能增长速度比处理器慢

  • The gap is even larger which make memory becomes the main bottleneck of performance CPU

    差距更大,使内存成为性能 CPU 的主要瓶颈

webp

The Evolution of Memory -Cache

内存缓存的演变

Add cache to speed up data reading

添加缓存以加快数据读取速度

  • L1: Ranges between 256KB to no more than 1MB, but even that is sufficient.

    L1:范围在 256KB 到不超过 1MB 之间,但即使这样也足够了。

  • L2: Usually a few megabytes and can go up to 10MB.

    L2:通常为几兆字节,最高可达 10MB。

  • L3: Larger than L1 and L2, varies from 16MB to 64MB, shared between all cores.

    L3:大于 L1 和 L2,大小从 16MB 到 64MB 不等,在所有内核之间共享。

webp

Principle of Locality

地方性原则

the tendency of a processor to access the same set of memory locations repetitively over a short period of time

处理器在短时间内重复访问同一组存储位置的趋势

Spatial Locality

空间位置

  • The use of data elements within relatively close storage locations

    在相对较近的存储位置内使用数据元素

webp

Single instruction multiple data (SIMD)

单指令多数据(SIMD)

webp

LRU (Least Recently Used)

LRU(最近最少使用)

  • When cache is full, discards the least recently usedcache-linefirst.

    当缓存已满时,首先丢弃最近最少使用的缓存行。

    • Record the "used time" of each cache line

      记录每个缓存行的“使用时间”

    • Discard the most "oldest used" cache lineeach time

      每次丢弃“最旧的已使用”缓存行

    • Update "used time" when access data of cache line

      访问缓存行数据时更新“已用时间”

webp

Cache Line

缓存行

  • Data is transferred between memory and cache inblocks of fixed size (typically 64 bytes), called cache lines or cache blocks.

    数据在固定大小(通常为 64 字节)的内存和缓存块之间传输,称为缓存行或缓存块。

  • A cache can only hold a limited number of lines, determined by the cache size. For example, a 64 kilobyte cache with 64-byte lines has 1024 cache lines.

    缓存只能容纳有限数量的行,由缓存大小决定。例如,具有 64 字节行的 64KB 缓存有 1024 条缓存行。

  • Every time you load any memory at all, you are loading in a full cache line of bytes

    每次加载任何内存时,都会加载一个完整的字节缓存行

Cache Miss

未命中缓存

webp
  • When cahce is full (loaded 4 rows), new rows will replace the oldest one

    当 cahce 已满(加载 4 行)时,新行将替换最旧的行

  • When a elements not in cache, a whole row will be loaded

    当一个元素不在缓存中时,将加载整行

Data-Oriented Programming (DOP)

面向数据编程(DOP)

1. Datais all we have

我们只有数据

webp

示例:假设你正在开发一个游戏,并需要处理大量的游戏实体(例如玩家、敌人、NPC 等)。

在面向对象编程(OOP)中,你可能会为每种实体创建一个类,并将属性和方法打包在一起:

python
pythonCopy Codeclass Player:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def move(self, dx, dy):
        self.x += dx
        self.y += dy

而在面向数据编程(DOP)中,你会将实体的属性分开存储,并编写独立的函数来操作这些属性:

python
pythonCopy Code  # 使用列表存储所有玩家的位置数据
player_positions = [(0, 0), (5, 5), (10, 10)]
 
def move_player(index, dx, dy):
    x, y = player_positions[index]
    player_positions[index] = (x + dx, y + dy)
 
# 移动第一个玩家
move_player(0, 1, 1)

在这个例子中,player_positions 列表存储所有玩家的位置数据。move_player 函数则直接操作这些数据,不涉及复杂的对象方法。这种方式可以提高性能,尤其是在处理大量数据时,因为它减少了对象的开销,并且更容易进行批量操作和并行处理。

Instructions are Data Too

指令也是数据

webp

Code 和 Data 都是内存中的数据。

Keep Code and Data Tight in Memory

保持代码和数据在内存中的紧密性

  • Keep both code and data small and process in bursts when you can

    尽可能保持代码和数据的小规模,并以突发的方式处理

webp

Performance-Sensitive Programming

性能敏感编程

Reducing Order Dependency

减少订单依赖

  • The work being done because of a misprediction will have to be undone

    由于预测失误而完成的任务将不得不撤销

  • Never modify variables once they are initially assigned

    一旦变量最初被赋值,就永远不要修改它们

{% div subfields %}

{% div subfield %}

webp

These 2 parts of code will not be excuted in parallel

这两部分代码不会并行执行

because variables a & b is used before

因为之前使用了变量 a 和 b

{% enddiv %}

{% div subfield %}

webp

Compiler allow these 2 parts of code to execute in parallel

编译器允许这两部分代码并行执行

Actually, compiler use static single-assignment (SSA) to deal with simple situation like this

实际上,编译器使用静态单赋值(SSA)来处理这样的简单情况

{% enddiv %}

{% enddiv %}

False Sharing in Cache Line

缓存行中的错误共享

  • Ensuring any rapidly updated variables are kept local to the thread

    确保任何快速更新的变量都保持在线程的本地

  • Cache contension

    缓存争议

webp

Branch prediction

分支预测

  • CPU will prefetch instructions and data ahead

    CPU 将提前预取指令和数据

  • Use branch prediction technics to decide what to prefetch

    使用分支预测技术来决定预取什么

webp
  • To avoid branch mis-prediction

    避免分支预测错误

C++
int a[10] = {2, 5, 8, 11, 3, 12, 9, 22, 5, 13};
for(int i = 0; i < 10; i++)
{
    if(a[i] > 10)
	{
    	doFunc1();
	}
	else
	{
   		doFunc2();
	}
}
webp

如果代码跳着执行,则 Cache 容易未命中,降低运行速度。

C++
int a[10] = {2, 3, 5, 5, 8, 9, 11, 12, 13, 22};
for(int i = 0; i< 10; i++)
{
	if(a[i] > 10)
    {
    	doFunc1();
    }
    else
    {
    	doFunc2();
    }
}
webp

如果事先将序列排序好,代码就不会跳着执行。

Existential Processing

存在加工

{% div subfields %}

{% div subfield %}

vb
for actor in actor_array do
    if actor is alive then
		aliveFunc(actor)
	else
		deadFunc(actor)
	end
end

This code also faces branch prediction problems

此代码还面临分支预测问题

Unlike the example before, actor_array changes every tick

与前面的示例不同,actor_array 每一刻都会发生变化

{% enddiv %}

{% div subfield %}

vb
for actor in alive_actor_array do
	aliveFunc(actor)
end
 
for actor in dead_actor_array do
	deadFunc(actor)
end

Completely avoid "if-else"

完全避免“if-else”

By maintaining 2 lists of different actors, we could avoid branch mis-precondition

通过维护 2 个不同参与者的列表,我们可以避免分支预处理错误

{% enddiv %}

{% enddiv %}

Performance-Sensitive Data Arrangements

性能敏感数据安排

Reducing Memory Dependency

减少内存依赖

  • (chained memory lookups/accesses by pointers)

(通过指针进行链式内存查找 / 访问)

webp
  • Load the first cache line 1

    加载第一个缓存行 1

  • Get the next node address

    获取下一个节点地址

  • Cache miss

    缓存未命中

  • Unload the old one, and load another cahce line 2

    卸下旧的,再加载另一个 Cache 线 2

  • Repeating

    重复

Array of Structure vs. Structure of Array

数组的结构与结构的数组

{% div subfields %}

{% div subfield %}

webp

{% enddiv %}

{% div subfield %}

webp

{% enddiv %}

{% enddiv %}

SOA 的地址排列要比 AOS 更连续。

If we want to read the position of all particles, SOA has better performance

如果我们想读取所有 particles 的位置,SOA 的性能更好。

Entity Component System (ECS)

实体组件系统(ECS)

Recap: Component-based Design

回顾:基于组件的设计

webp
  • Code example

    代码示例

webp

Entity Component System (ECS)

实体组件系统(ECS)

A pattern to structure game code in a data-oriented way for maximum performance

一种以面向数据的方式构建游戏代码以获得最大性能的模式

  • Entity: an ID refer to a set of components

    实体:ID 指一组组件

  • Component: the data to be processed by systems, no logic at all

    组件:系统要处理的数据,根本没有逻辑

  • System: where the logic happens, read/write component data

    系统:逻辑发生的地方,读/写组件数据

webp

Unity Data-Oriented Tech Stack (DOTS)

面向 Unity 数据的技术栈(DOTS)

A combination of technologies that work together to deliver a data-oriented approach to coding

结合多种技术,共同提供面向数据的编码方法

  • The Entity Component System (ECS) provides data-oriented programming framework

    **实体组件系统(ECS)**提供面向数据的编程框架

  • The C# Job System provides a simple method of generating multithreaded code

    C# 任务系统提供了一种生成多线程代码的简单方法

  • The Burst Compiler generates fast and optimized native code

    Burst 编译器生成快速且优化的本机代码

webp

Unity ECS – Archetype

A specific combination of components

Archetype 是组件的特定组合

  • Entities are grouped into archetypes

    实体被分组到 archetype 中

webp

Unity ECS –Data Layout in Archetype

Unity ECS——Archetype 中的数据布局

Same components in an archetype are packed tightly into chunks for cache friendliness

原型中的相同组件被紧密地打包成块,以便于缓存

  • A chunk is a block of memory with fixed size, i.e. 16KB

    块是固定大小的内存块,即 16KB

webp

Unity ECS –System

Unity ECS——系统

webp
C#
public class MoveSystem: SystemBase
{
	protected override void OnUpdate()
	{
	// For each entity which has Translation and Velocity
	Entities.ForEach(
		// Write to Displacement (ref), read Velocity (in)
		(refTranslationtrans, inVelocityvelocity) =>
		{
			// Execute for each selected entity
			trans = newTranslation()
			{
				// dT is a captured variable
				Value = trans.Value + velocity.Value * dT};
			}
	).ScheduleParallel(); // Schedule as a parallel job
	}
}

Unity C# Job System

Unity C# 任务系统

Make it easier forusers to write correct multithreaded code

使用户更容易编写正确的多线程代码

  • A job is a small unit of work that performs a specific task

    作业是执行特定任务的小工作单元

  • Jobs can depend on other jobs to complete before they run

    作业在运行之前可以依赖其他作业来完成

{% div subfields %}

{% div subfield %}

C#
public struct FirstJob: IJob
{
	publicvoidExecute()
    {
    	...
    }
}
 
public struct SecondJob: IJob
{
	public void Execute()
    {
    	...
    }
}

{% enddiv %}

{% div subfield %}

C#
varfirst_job = newFirstJob();
varsecond_job = newSecondJob();
 
// execute first_job
varfirst_job_handle = first_job.Schedule();
 
// second_job depends on first_job to complete
second_job.Schedule(first_job_handle);

{% enddiv %}

{% enddiv %}

Unity C# Job System–Native Container

Unity C# 作业系统——原生容器

A type of shared memory that can be accessed inside jobs

一种可以在作业内部访问的共享内存

  • Job cannot output result without native container (all data is a copy)

    没有本机容器,作业无法输出结果(所有数据都是副本)

  • Native containers support all safety checks

    本地容器支持所有安全检查

  • Native containers need to be disposed manually

    本地容器需要手动处理

C#
// Allocate one float with "TempJob" policy
// Allocator.Temp: Fastest allocation, lifespan is 1 frame or fewer
// Allocator.TempJob: Slower than Temp, lifespan is 4 frames
// Allocator.Persistent: Slowest allocation, can last as long as needed
NativeArray<float> a= newNativeArray<float>(1, Allocator.TempJob);
...
// Need to dispose manually for unmanaged memory
a.Dispose();

Unity C# Job System –Safety System

Unity C# 作业系统-安全系统

Support safety checks (out of bounds checks, deallocation checks, race condition checks) for jobs

支持作业的安全检查(越界检查、取消分配检查、竞争条件检查)

  • Send each job a copy of data it needs to operate on to eliminate the race condition

    向每个作业发送一份它需要操作的数据副本,以消除竞争条件

    • Job can only access blittable data types (reference is invalid)

      作业只能访问 blitable 数据类型(引用无效)

C#
public struct Job: IJob
{
    public float a;
    public float b;
    public void Execute()
    {
    	...
    }
}
webp

High-Performance C# and Burst Compiler

高性能 C# 和 Burst 编译器

High-Performance C# (HPC#) is a subset of C#

高性能 C#(HPC#)是 C# 语言的一个子集

  • Give up on most of the standard library (StringFormatter, List, Dictionary, and etc.)

    放弃大部分标准库(StringFormatter、List、Dictionary 等)

  • Disallow allocations, reflection, the garbage collector and virtual calls

    不允许分配、反射、垃圾收集器和虚拟调用

Burst Compiler translates from IL/.NET bytecode to highly optimized native code using LLVM

Burst 编译器从 IL/ 转换而来。使用 LLVM 将 NET 字节码转换为高度优化的本机代码

  • Generate expected machine code for specific platforms

    为特定平台生成预期的机器代码

webp

Unreal Mass Framework

webp

MassEntity–Entity

MassEntity–实体

  • FMassEntityHandle is pure ID as ECS Entity

    FMassEntityHandle 是 ECS 实体的纯 ID

  • Index indicates the index in Entities array in FMassEntityManager

    索引表示 FMassEntityManager 中实体数组中的索引

  • SerialNumberas salt to Index

    序列号作为索引的盐

    • Release an old entity

      释放旧实体

    • Create a new entity with the same Index

      使用相同的索引创建新实体

    • SerialNumberis increased so the ID will be different

      序列号增加,因此 ID 将不同

{% div subfields %}

{% div subfield %}

C++
struct FMassEntityHand1e
{
    ...
    int32 Index = 0;
    int32 SerialNumber = 0;
    ...
}

{% enddiv %}

{% div subfield %}

C++
struct MASSENTITY_API FMassEntityManager
{
    ...
    TChunkedArray<FEntityData> Entities;
    TArray<int32> EntityFreeIndexList;
    ...
}

{% enddiv %}

{% enddiv %}

MassEntity–Component

MassEntity——组件

  • Same as Unity, each type of entity has an Archetype

    与 Unity 相同,每种类型的实体都有一个原型

  • Fragments and tagsare componentsfor entities

    片段和标记是实体的组成部分

  • Tags are constant Boolean components to filter unnecessary processing

    标签是用于过滤不必要处理的常量布尔组件

C++
struct FMassArchetypeCompositionDescriptor
{
    ...
    FMassFragmentBitSetFragments;
    FMassTagBitSetTags;
    FMassChunkFragmentBitSetChunkFragments;
    FMassSharedFragmentBitSetSharedFragments;
}
webp

MassEntity–Systems

MassEntity——系统

  • ECS Systems in MassEntity are Processor sderived from UMassProcessor

    MassEntity 中的 ECS 系统是从 UMassProcessor 派生的处理器

  • Two important interface: ConfigureQueries() and Execute(...)

    两个重要的接口:ConfigureQueries()Execute(…)

C#
class MASSENTITY_API UMassProcessor: publicUObject
{
	...
protected:
    virtual void ConfigureQueries() PURE_VIRTUAL(UMassProcessor::ConfigureQueries);
    virtual void PostInitProperties() override;
    virtual void Execute(
        FMassEntityManager&EntityManager,
        FMassExecutionContext&Context) PURE_VIRTUAL(UMassProcessor::Execute);
	...
}

MassEntity–Fragment Query

MassEntity——片段查询

  • Interface ConfigureQueries() runs when the processor is initialized

    处理器初始化时,接口 ConfigureQueries() 运行

  • Use FMassEntityQuery to filter archetypes of entities meeting systems requirements

    使用 FMassEntityQuery 筛选满足系统要求的实体原型

  • FMassEntityQuery caches filtered archetypes to accelerate future executions

    FMassEntityQuery 缓存经过筛选的原型以加速未来的执行

webp
C#
void UMassApplyMovementProcessor::ConfigureQueries()
{
    EntityQuery.AddRequirement<FMassVelocityFragment>(EMassFragmentAccess::ReadWrite);
    EntityQuery.AddRequirement<FTransformFragment>(EMassFragmentAccess::ReadWrite);
    EntityQuery.AddRequirement<FMassForceFragment>(EMassFragmentAccess::ReadWrite);
    EntityQuery.AddTagRequirement<FMassOffLODTag>(EMassFragmentPresence::None);
    EntityQuery.AddConstSharedRequirement<FMassMovementParameters>(EMassFragmentPresence::All);
}

MassEntity–Execute

MassEntity——执行

webp

Conclusions

结论

Everything You Need Know About Performance

关于性能,你需要知道的一切

webp

这张图展示了 CPU 各种操作所耗的时间。

References

Cache

Parallel Programming

Parallel Frameworks in Game Engine

DOP

Unity DOTS

Unreal Engine Mass Architecture

Multimedia Material List