SIMD

康凯森大约 2 分钟

注意事项

AVX-512 is not always faster than AVX2. 原因是一些 CPU 可能会降频，还有编译器会 prefer 256-bit SIMD operations
内存对齐：直接调用 SIMD 指令编程时需要注意内存对齐

SIMD 基础知识

What SIMD

为什么向量化可以提升性能

SIMD Registers

Many Ways to Vectorize

Compile Auto Vectorize

Countable loop
Function call should be inline or simple math function
No data dependencies
No complex conditions

Compile Hint Vectorize: Restrict

Restrict Tells The Compile The Arrays In Distinct Locations In Memory

How to Ensure SIMD Instructions Used

Vector Intrinsics

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.htmlopen in new window

SIMD 优化案例

SIMD text parsing

SIMD time stamps parsing

Parsing time stamps faster with SIMD instructionsopen in new window

SIMD data filtering

SIMD string operations

SIMD json

SIMD join runtime filter

SIMD hash table probe

There are three opportunities to apply SIMD for the hash tables: hash computation and read or write values into the table.

SIMD Join

SIMD memcpy

SIMD memcmp

SIMD memequal

SIMD case when

SIMD Gather

往往内存会成为瓶颈，导致 simd gather 加速不明显。

SIMD branchless

核心思想是每个分支都计算，然后利用 mask 选择不同分支的结果

https://www.youtube.com/watch?v=g-WPhYREFjkopen in new window

SIMD filter

Starrocks optimize filter_range when avx512f availableopen in new window

SIMD aggregate

SIMD ascii Substring

SIMD is sorted 判断

标量代码：

bool is_sorted(const int32_t* input, size_t n) {
    if (n < 2) {
        return true;
    }

    for (size_t i=0; i < n - 1; i++) {
        if (input[i] > input[i + 1])
            return false;
    }

    return true;
}

avx512 代码：

bool is_sorted_avx512(int32_t* a, size_t n) {

    const __m512i shuffle_pattern = _mm512_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15);

    size_t i = 0;
    while (i < n - 16) {
        const __m512i curr = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(a + i));
        const __m512i next = _mm512_permutexvar_epi32(shuffle_pattern, curr);

        const __mmask16 mask = _mm512_cmpgt_epi32_mask(curr, next);
        if (mask != 0) {
            return false;
        }

        i += 15;
    }

    for (/**/; i + 1 < n; i++) {
        if (a[i] > a[i + 1])
            return false;
    }

    return true;
}

如何更好地触发向量化

循环外提

https://github.com/StarRocks/starrocks/pull/14510open in new window

循环展开 Unrolled Loop

Why are unrolled loops faster?open in new window

SIMD 收益

当数据都在 CPU Cache 中时，SIMD 指令的收益最明显
当一次要处理的数据集超过 Cache 容量时，Cache Miss 和内存访问就会成为最大的 Cost
所以当你已经按列处理，发现 SIMD 指令收益不明显时，就应该首先考虑是不是 Cache Miss 的原因

多平台多SIMD指令兼容

Automatic SIMD-instruction selectionopen in new window

参考资料

https://15721.courses.cs.cmu.edu/spring2023/slides/08-vectorization.pdfopen in new window

SIMD

# 注意事项

# SIMD 基础知识

# What SIMD

# 为什么向量化可以提升性能

# SIMD Registers

# Many Ways to Vectorize

# Compile Auto Vectorize

# Compile Hint Vectorize: Restrict

# How to Ensure SIMD Instructions Used

# Vector Intrinsics

# SIMD 优化案例

# SIMD text parsing

# SIMD time stamps parsing

# SIMD data filtering

# SIMD string operations

# SIMD json

# SIMD join runtime filter

# SIMD hash table probe

# SIMD Join

# SIMD memcpy

# SIMD memcmp

# SIMD memequal

# SIMD case when

# SIMD Gather

# SIMD branchless

# SIMD filter

# SIMD aggregate

# SIMD string to integer parser

# SIMD std::find & memchr

# SIMD bitshuffle

# SIMD ascii Substring

# SIMD is sorted 判断

# 如何更好地触发向量化

# 循环外提

# 循环展开 Unrolled Loop

# SIMD 收益

# 多平台多SIMD指令兼容

# 参考资料