The bound copy of my thesis is due in 3 weeks, final draft 3/31 or 4/1, don't remember which :).
I've only had Nvidia hardware to test on since my Radeon 4770 doesn't support the byte_addressable_store extension (5000-series and up only), but it runs on my GF9400m and a GTX 480 in current Ubuntu just fine. It also works fine on AMD Stream CPU-based OpenCL. I've gotten it working in Mac OS using CPU CL, but there's a bug in the Mac GPU-based acceleration that kills it every time and I haven't had time to track it down yet.
Like I said, I'm hoping to keep working on this after graduation, either as a hobby, or professionally if someone's willing to pay. I've gotten the OpenCL initialization framework in place, have all of the memory management taken care of, and have most of the major parts of the decoding available as CL kernels.
The next step that needs to be done is increasing the parallelism, as I'm currently capping out at 336 threads max, and the common case is only a few dozen threads, not enough to even approach achieve performance parity with the CPU-only paths. I've figured out a few ways to do that, especially in the loop filter (which accounts for 50% or so of the CPU-only execution time on a few of the 1080p videos I've profiled ). The sub-pixel prediction/motion compensation and Dequantization/IDCT will take a bit more work to thread effectively, but I think it can be done.
I'm sick of using the binary Nvidia drivers on my desktop/laptop, and I'd love to be able to switch back to the OSS drivers.
If anyone interested, or would pick up this GSoC project, I do have some very early vaapi state_tracker code. I just got more important things to do, so I haven't touched it for a while. But the one doing the GSoC project, could get it if he/ she wants it.
then someone might make reference to it and encourage uptake and OC then there's always an off site backup if you loose your local hard drive with all that work on :cool:
by the way although it's no direct use for for the gfx code side, i noticed on one of Jason Garrett-Glaser latest ffmpeg VP8: optimization patches Diego Elio Pettenò flameeyes mentioned the pahole utility from acmel's dwarves is designed to find the cacheline boundaries in structures, dont know if it's any good for the CPU side, but worth mentioning anyway just in case.
So now my desktop is running hardware RAID 1 with git checkouts in both Linux and Windows partitions, and my laptop has git checkouts of my stuff on all 3 of its operating systems (Win7, Mac, Linux). Both laptop and desktop are periodically backed up to external drives (separate drives for each system). Eventually, I'll probably store those drives in my desk at work, but for now they're on a shelf.
I've got a co-located server in another state, the github master repository, and a checkout on my work computer. My HTPC has a copy as well (also RAID 1), just to provide another machine to test on.
I know it's excessive, but I really don't want to try to use the "hard drive ate my homework" excuse. I knew people in undergrad who used that one, and it sounded lame even then.
As far as the cache-line software goes, it could come in handy for profiling the CPU decoder. The reference VP8 decoder does force alignment to certain boundaries on many of its structures, but I haven't seen any work on cache line boundary detection (it may have happened, I just haven't seen it).