Performance evaluation of macroblock-level parallelization of h.264 decoding on a cc-numa multiprocessor architecture
This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory...
- Autores:
-
Alvarez, Mauricio
Ramirez, Alex
Valero, Mateo
Azevedo, Arnaldo
Meenderinck, Cor
Juurlink, Ben
- Tipo de recurso:
- Article of journal
- Fecha de publicación:
- 2009
- Institución:
- Universidad Nacional de Colombia
- Repositorio:
- Universidad Nacional de Colombia
- Idioma:
- spa
- OAI Identifier:
- oai:repositorio.unal.edu.co:unal/28590
- Acceso en línea:
- https://repositorio.unal.edu.co/handle/unal/28590
http://bdigital.unal.edu.co/18638/
- Palabra clave:
- Video codec parallelization
multicore architectures
synchronization
H.264
multiprocessor architectures
- Rights
- openAccess
- License
- Atribución-NoComercial 4.0 Internacional
Summary: | This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. The study includes the evaluation of three different scheduling techniques: static, dynamic and dynamic with tail-submit. A dynamic scheduling approach with a tail-submit optimization presents the best performance obtaining a maximum speedup of 9.5 with 24 processors. A detailed profiling analysis showed that thread synchronization is one of the limiting factors for achieving a better scalability. The paper includes an evaluation of the impact of using blocking synchronization APIs like POSIX threads and POSIX real-time extensions. Results showed that macroblock-level parallelism as a very fine-grain form of Thread-Level Parallelism (TLP) is highly affected by the thread synchronization overhead generated by these APIs. Other synchronization methods, possibly with hardware support, are required in order to make MB-level parallelization more scalable. |
---|