195
submitted 1 day ago by Gork@lemm.ee to c/technology@lemmy.world

Contemporary high-level programming languages and advanced compilers greatly simplify software development and lower its costs. However, this way of programming can hide the performance capabilities of modern hardware, partly due to inefficiencies of application programming interfaces (APIs). Apparently, a good old assembly code path can improve performance by between three and 94 times, depending on the workload, according to FFmpeg. The hardware this multiplied performance was achieved on was not disclosed.

FFmpeg is an open-source video decoding project developed by volunteers who contribute to its codebase, fix bugs, and add new features. The project is led by a small group of core developers and maintainers who oversee its direction and ensure that contributions meet certain standards. They coordinate the project's development and release cycles, merging contributions from other developers. This group of developers tried to implement a handwritten AVX512 assembly code path, something that has rarely been done before, at least not in the video industry.

The developers have created an optimized code path using the AVX-512 instruction set to accelerate specific functions within the FFmpeg multimedia processing library. By leveraging AVX-512, they were able to achieve significant performance improvements — from three to 94 times faster — compared to standard implementations. AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation. This optimization is ideal for compute-heavy tasks in general, but in the case of video and image processing in particular.

The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.

This development is particularly valuable for users running on high-performance, AVX-512-capable hardware, enabling them to process media content far more efficiently. There is an issue, though: Intel disabled AVX-512 for its Core 12th, 13th, and 14th Generations of Core processors, leaving owners of these CPUs without them. On the other hand, AMD's Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.

Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.

top 9 comments
sorted by: hot top controversial new old
[-] avidamoeba@lemmy.ca 81 points 22 hours ago* (last edited 22 hours ago)

This is the right way to optimize performance. Write everything in a decent higher level language, to achieve good maintainability. Then profile for hotspots, separate them in well defined modules and optimize the shit out of them, even if it takes assembly inlining. The ugly stays its own box and you don't spend time optimizing stuff that doesn't need optimization.

[-] andyburke@fedia.io 28 points 21 hours ago

This person programs. ☝️ 🤝

[-] chellomere@lemmy.world 10 points 17 hours ago* (last edited 17 hours ago)

This is great, but the context is that this is for specific inner loops, and it is compared to the C version of that specific inner loop. Typically what was used before this on a computer with avx512 was the avx2 version of the inner loop, and the speedup compared to that version appears to be up to 60%: https://x.com/FFmpeg/status/1852542388851601913 . Then as not a specific inner loop isn't run all the time, the speedup is probably much less than 60%. This is still sizeable, but the actual speedup in practice with this implementation is far far from 94x.

[-] lol@discuss.tchncs.de 25 points 1 day ago

AMD’s Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.

I've got a Ryzen 7800x3D and can see a bunch of AVX-512 feature flags in /proc/cpuinfo: avx512f avx512dq rd avx512f avx512dq avx512ifma avx512cd avx512bw avx512vl avx512_bf16 avx512vbmi avx512_vbmi2 avx512_vnni avx512_bitalg avx512_vpopcntdq.

Does that mean it would improve performance for me as well or is some more specific feature required?

[-] MorphiusFaydal@lemmy.world 18 points 1 day ago

7000 series run AVX512 as two 256 bit data paths, while the 9000 series has a native 512 bit data path for AVX512.

[-] Decipher0771@lemmy.ca 4 points 21 hours ago

Yes, but it’ll likely still be faster, just not as dramatically. Half of 4-94x is still 2-47x faster.

[-] InverseParallax@lemmy.world 2 points 22 hours ago

I mean why not, that worked out perfectly fine for bulldozer...

[-] chellomere@lemmy.world 6 points 19 hours ago

Yeah 7000-series Ryzen benefits from the avx512 code paths in ffmpeg. I've benchmarked a 5900x vs a 7900x specifically for software H.265 decoding and there was a sizeable difference.

[-] 0x0@programming.dev 7 points 22 hours ago

Unsung heroes.

this post was submitted on 05 Nov 2024
195 points (97.1% liked)

Technology

59118 readers
2991 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS