LoreonLabsPlatform
DocsHome
  • Overview

Intelligence

  • Markets
  • Builders
  • Research
  • Ecosystems
  • Launchpads
  • Search
Ecosystems

Cuda

SageAttention

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

CudaEmerging
GitHub
Stars
—
Forks
—
Contributors
8
Last push
10mo ago

Recent commits

Latest commits.

  • Fix setup.py for machines without GPUs
    9053b45Dmitry Nedospasov10mo ago
  • Merge pull request #231 from BobQC/main
    798c791whx100311mo ago
  • fix bug: no skip when mask_dtype is float
    34ea987BobQC11mo ago
  • Merge pull request #227 from BobQC/main
    3c8f3b1whx100311mo ago
  • fix qk_mask when mask_dtype=float
    f78d412BobQC11mo ago
add attn_mask to sageattn_qk_int8_pv_fp16_triton
499828bBobQC11mo ago
  • Merge pull request #224 from thu-ml/fix/attention-mask
    f75e969whx100311mo ago
  • fix mask computation for better precision
    3917283whx100311mo ago
  • Top contributors

    Builders behind this project.

    jt-zhang
    69 commits
    jason-huang03
    34 commits
    XiaomingXu1995
    9 commits
    whx1003
    4 commits
    BobQC
    3 commits
    DefTruth
    1 commits
    Panchovix
    1 commits
    oraluben
    1 commits