How to choose card for GPU acceleration?

AndCycle · Post by **AndCycle** » Wed Mar 14, 2012 1:17 pm

that's all,
I think NeatVideo doesn't take any benefit from Double-precision calculation,
if there is no further suggestion I think I will take a 590 for sure.

Code: Select all

CUDA-Z Report
=============
Version: 0.5.95
http://cuda-z.sourceforge.net/
OS Version: Windows AMD64 6.1.7601 Service Pack 1

Core Information
----------------
	Name: GeForce GTS 450
	Compute Capability: 2.1
	Clock Rate: 1566 MHz
	Multiprocessors: 4
	Warp Size: 32
	Regs Per Block: 32768
	Threads Per Block: 1024
	Watchdog Enabled: Yes
	Threads Dimentions: 1024 x 1024 x 64
	Grid Dimentions: 65535 x 65535 x 65535

Memory Information
------------------
	Total Global: -2048 MB
	Shared Per Block: 48 KB
	Pitch: 2.09715e+06 KB
	Total Constant: 64 KB
	Texture Alignment: 512
	GPU Overlap: Yes

Performance Information
-----------------------
Memory Copy
	Host Pinned to Device: 5646.34 MB/s
	Host Pageable to Device: 2622.25 MB/s
	Device to Host Pinned: 5682.86 MB/s
	Device to Host Pageable: 3574.45 MB/s
	Device to Device: 7036.69 MB/s
GPU Core Performance
	Single-precision Float: 398876 Mflop/s
	Double-precision Float: 50088.9 Mflop/s
	32-bit Integer: 200119 Miop/s
	24-bit Integer: 198782 Miop/s

Generated: Wed Mar 14 20:05:38 2012

Code: Select all

CUDA-Z Report
=============
Version: 0.5.95
http://cuda-z.sourceforge.net/
OS Version: Windows AMD64 6.1.7601 Service Pack 1

Core Information
----------------
	Name: GeForce GTX 550 Ti
	Compute Capability: 2.1
	Clock Rate: 1820 MHz
	Multiprocessors: 4
	Warp Size: 32
	Regs Per Block: 32768
	Threads Per Block: 1024
	Watchdog Enabled: Yes
	Threads Dimentions: 1024 x 1024 x 64
	Grid Dimentions: 65535 x 65535 x 65535

Memory Information
------------------
	Total Global: 1024 MB
	Shared Per Block: 48 KB
	Pitch: 2.09715e+06 KB
	Total Constant: 64 KB
	Texture Alignment: 512
	GPU Overlap: Yes

Performance Information
-----------------------
Memory Copy
	Host Pinned to Device: 5690.54 MB/s
	Host Pageable to Device: 2580.25 MB/s
	Device to Host Pinned: 5694.54 MB/s
	Device to Host Pageable: 3559.29 MB/s
	Device to Device: 35197.9 MB/s
GPU Core Performance
	Single-precision Float: 464256 Mflop/s
	Double-precision Float: 58291.8 Mflop/s
	32-bit Integer: 232898 Miop/s
	24-bit Integer: 231269 Miop/s

Generated: Wed Mar 14 20:40:30 2012

Code: Select all

CUDA-Z Report
=============
Version: 0.5.95
http://cuda-z.sourceforge.net/
OS Version: Windows AMD64 6.1.7601 Service Pack 1

Core Information
----------------
	Name: Tesla C2050
	Compute Capability: 2.0
	Clock Rate: 1147 MHz
	Multiprocessors: 14
	Warp Size: 32
	Regs Per Block: 32768
	Threads Per Block: 1024
	Watchdog Enabled: Yes
	Threads Dimentions: 1024 x 1024 x 64
	Grid Dimentions: 65535 x 65535 x 65535

Memory Information
------------------
	Total Global: -1408 MB
	Shared Per Block: 48 KB
	Pitch: 2.09715e+06 KB
	Total Constant: 64 KB
	Texture Alignment: 512
	GPU Overlap: Yes

Performance Information
-----------------------
Memory Copy
	Host Pinned to Device: 5759.45 MB/s
	Host Pageable to Device: 2524.57 MB/s
	Device to Host Pinned: 5689.37 MB/s
	Device to Host Pageable: 3452.75 MB/s
	Device to Device: 50980.1 MB/s
GPU Core Performance
	Single-precision Float: 1.02294e+06 Mflop/s
	Double-precision Float: 512848 Mflop/s
	32-bit Integer: 513210 Miop/s
	24-bit Integer: 512631 Miop/s

Generated: Wed Mar 14 20:53:44 2012

Post by **NVTeam** » Wed Mar 14, 2012 4:42 pm

I would also consider the new generation of NVidia cards - GTX 670 and perhaps 680 (google these names). They are likely to be released in a few days and may be able to offer even better results.

Vlad

AndCycle · Post by **AndCycle** » Wed Apr 11, 2012 9:27 am

here is the report for 680, I just borrowed one for testing,

it kicks out

GPU error at GeForce GTX 680 (0): Invalid source (300) in GPUCUDAApi::createContext(.)

and lots of other issue

here is the screenshot

http://imgur.com/h4noi,TxHCe#0

with this error it only got 0.988 frames/sec for radius 5,
new design in 600 series may not work well with CUDA due to lack of register

Post by **NVTeam** » Wed Apr 11, 2012 9:38 am

Strange. I expected it to work correctly.

Which version of video driver is it? Which version of CUDA does that driver report?

Thank you,
Vlad

AndCycle · Post by **AndCycle** » Wed Apr 11, 2012 11:40 am

the nv driver is 301.10, the card has been returned so I can't give you further information

Post by **NVTeam** » Wed Apr 11, 2012 11:59 am

Thank you, we will investigate further based on the available data.

Vlad

Post by **NVTeam** » Thu Apr 12, 2012 11:51 pm

Yes, we could identify and resolve the problem with NV running on GTX 680. The solution will be included in the next regular update of the plug-in. If you need it earlier please contact support [at] neatvideo.com and we will prepare a special build for you.

Thank you,
Vlad

AndCycle · Post by **AndCycle** » Mon Apr 16, 2012 5:22 am

thanks for the fast response, but I already bought a 580,
it would be nice if you could provide some number about how 680 perform :p

Post by **NVTeam** » Mon Apr 16, 2012 11:45 am

680 with its current drivers shows about the same performance as 580. It seems 680 is optimized for games while computing applications do not get much improvement in 680 as compared with 580. Except that 680 consumes less energy.

Vlad

cdmikelis · Post by **cdmikelis** » Thu May 17, 2012 6:48 pm

I came to NeatVideo Having Q9550/8GB ram.
Than I gave that system GTX460 (not worth mentioning emprovement). So I upgraded to i7-2600K and overclocked it to 4,5 GHz. I added 16GB HyperX 1600 RAM and invest in SSD (Crucial M4 128GB + Intel 320 120GB, on top of existing RAID0 with 2x750WDRE4).

Same GPU was now able to do much more. I unfortunately do not have any number written down. Than I calculated FPS from post here by several cards CUDA number and frequencies and calculated out that 560Ti shoud give me great boost over 460. I sold 460 and bought 560ti (900-OC version from Gigabyte). I was dissapointed big, with only 10% more FPS over 460.

I again changed GPU to 570GTX. It should give me another 30% improvement. But again only 10%. I started to play with OC-ing this card, since ASUS cards goes huge amount over stock cklock. Again: there is no important improvement. If I test only GPU, it can be seen, but together with CPU, iz sports just the same FPS regardles how much I OC the GPU.

I post some test below. It must have been some "handbrake" pulled in my sistem, other than GPU. RAM??

i7-2600K (4500Mhz) / 16GB-1600Mhz / Crucial M4 128GB
ASUS GTX570 - stock clock (742/3800)

Frame: 1920x1080 progressive, 8 bits per channel, Radius: 1 frame
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (6 cores): 7.04 frames/sec
GPU only (GeForce GTX 570): 8.13 frames/sec
CPU (5 cores) and GPU (GeForce GTX 570): 12.8 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - stock clock (742/3800)

Frame: 720x576 progressive, 8 bits per channel, Radius: 1 frame
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 32.3 frames/sec
GPU only (GeForce GTX 570): 33.3 frames/sec
CPU (6 cores) and GPU (GeForce GTX 570): 47.6 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - OVER clock (900/4000)

Frame: 720x576 progressive, 8 bits per channel, Radius: 1 frame
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 32.3 frames/sec
GPU only (GeForce GTX 570): 37 frames/sec
CPU (4 cores) and GPU (GeForce GTX 570): 47.6 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - OVER clock (900/4000)

Frame: 1920x1080 progressive, 8 bits per channel, Radius: 1 frame
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (5 cores): 7.14 frames/sec
GPU only (GeForce GTX 570): 8.85 frames/sec
CPU (6 cores) and GPU (GeForce GTX 570): 13.7 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - OVER clock (900/4000)
Frame: 720x576 progressive, 8 bits per channel, Radius: 5 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 19.2 frames/sec
GPU only (GeForce GTX 570): 16.1 frames/sec
CPU (5 cores) and GPU (GeForce GTX 570): 25 frames/sec)

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - OVER clock (900/4000)
Frame: 1920x1080 progressive, 8 bits per channel, Radius: 5 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (6 cores): 4.33 frames/sec
GPU only (GeForce GTX 570): 4.35 frames/sec
CPU (7 cores) and GPU (GeForce GTX 570): 7.14 frames/sec

Best combination: CPU (7 cores) and GPU (GeForce GTX 570)

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - stock clock (742/3800)
Frame: 1920x1080 progressive, 8 bits per channel, Radius: 5 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (6 cores): 4.31 frames/sec
GPU only (GeForce GTX 570): 3.98 frames/sec
CPU (6 cores) and GPU (GeForce GTX 570): 6.71 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - stock clock (742/3800)
Frame: 720x576 progressive, 8 bits per channel, Radius: 5 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 19.2 frames/sec
GPU only (GeForce GTX 570): 15.2 frames/sec
CPU (5 cores) and GPU (GeForce GTX 570): 25 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - OVER clock (900/4000)
Frame: 720x576 progressive, 8 bits per channel, Radius: 0 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 50 frames/sec
GPU only (GeForce GTX 570): 66.7 frames/sec
CPU (7 cores) and GPU (GeForce GTX 570): 83.3 frames/sec

************************
i7-2600K (4500Mhz) / 16GB-1600Mhz
ASUS GTX570 - stock clock (742/3800)
Frame: 720x576 progressive, 8 bits per channel, Radius: 0 frames
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (4 cores): 50 frames/sec
GPU only (GeForce GTX 570): 55.6 frames/sec
CPU (6 cores) and GPU (GeForce GTX 570): 83.3 frames/sec

**************************
**************************
SO: As Vlad Said. Obviously my RAM is bottleneck, since OC-ing the GPU seems to not bring any significant improvement. With different "radus" and "resolutions" OCing the GPU shows difference, but together with CPU, it comes to same speed this way or another. Previously I had Gigabyte's 900Mhz GTX560ti (384cuda), previously ASUS GTX460. I can see some 10% each card is apart on my system. Does that pays off the price difference between them? (which was in my case 40% between each model). I can se clearly now even 590 SLI would not bring me so much more fps than cuda cores each card have. Even Going from Q9550 to 2600K didn't give exact improvement in NeatVideo as it did in othere rendering tasks (like transcoding). Maybe 4-channel LGA-2012 platform would sport more fps?

Best regards to all.

Post by **NVTeam** » Thu May 17, 2012 10:46 pm

With small frames, the overhead of organizing GPU computations (for example, data transfers from RAM to GPU and back) dominates processing; overclocking the GPU cannot make the whole process faster than the slowest part. With larger frames, actual image processing takes more time than preparing data for processing, and therefore GPU overclocking is starting to make a difference, as visible in your tests.

Generally, your figures look normal for that CPU and GPU. Faster RAM may help a bit. Also, make sure you run the video card in an x16 PCIE slot.

Hope this helps,
Vlad

cdmikelis · Post by **cdmikelis** » Fri May 18, 2012 12:41 am

Hi Vlad.

Thanks to read my post. During the day I read posts from others more closely. I can see, I should stop moaning about low speed. My results are even better than some 6 core SandyBridge-e cpus posted before :)

But during the day I saw interesting thing:
- my GPU is only 35-45% utilised when rendering in Premiere (CPU+GPU). If I put only GPU, it jumps from 40-60%. If I overclock GPU by 30%, GPU utilisation drops. It's obvious GPU is starving data.
- if I make some heavy 3D rendering, other than NV, or some game benchmarking, GPU utilisation is than 100% or near. In that case 30% OC result in near 30% performance gain (where with neat results in 10% but only if GPU alone).
- Even my CPU does not go over 60% (alone). If I render CPU+GPU, it's around 40% (GPU 40+ CPU 60). So both starving data to render.
- When CPU renders some other effects in Premiere it goes 100%.

I already ruled out "drive": I have fast SSD and I even tried RAMdisk (4GB from my 16). It does not matter much 8 or 16GB, for Neat i my system.

I found interesting usage of GPU now. I pust Neat on GPU only. Instead of not playing clip with unrendered NV effect now both processing units do self work: CPU bothers with clip, GPU with Neat. Now playback (though a bit jumpy) is possible with unrendered NEAT effect on clip.

QUESTION: How this generic benchmark works in NV? Why there is always some batter result in rendering speed than in real life when I measure rendering time and calculate back? I do much testing in a way that I repeat rendering (or exporting) same 200 frames 1080p file. Overclocking 560ti was measurable in real life by several seconds, with 470 no real difference. I realy think I have overkill GPU installed. I borrowed 560ti back for comparison. I'll post results later.

Regards,
MIHAEL

Post by **NVTeam** » Fri May 18, 2012 9:48 am

Mihael,

Please try to run the latest version 3.2, it may provide somewhat better rendering speed in Premiere, because we have added more optimizations for Premiere. And it is likely to show higher CPU and GPU utilization as well.

When you run Optimize in Neat Video, it is only the filter itself who is running. Optimise measures the speed of the filter alone. When you do the actual render in Premiere, both Premiere (reading, decoding, transforming, encoding, writing) and Neat Video (noise reduction) work, one after another, in turns. So rendering is more than noise reduction alone, which is why the rendering is slower than the results you see in Optimize. That is normal.

One more thing to check: try to disable GPU acceleration in Mercury and let NV alone use the GPU. In some cases, rendering this way may be faster than with GPU enabled in Premiere.

Hope this helps,
Vlad

cdmikelis · Post by **cdmikelis** » Fri May 18, 2012 2:13 pm

Hi Vlad.

Thank you for posting info about new version. I had 3.1.0 before.

During past day I overclocked RAM from XMP 9-9-9-27@1600 to 9-10-9-24@1866 and that bringed up to +13% of rendering speed in NeatVideo. CPU now jumps from 60-85% and GPU have bigger usage to (if only GPU is selected up to 80%). So having even faster ram, vould help more. Even OC-ing GPU now make more sense.

With version 3.2.0 I get 1fps more on benchmarking as with 3.1.0, but in real life there is no difference in rendering time on a 200-frame 1080p file.

I tried 4K file to render than: As you said, biger frames benefit more.
Yes at 4K everything matters:
- OC GPU/CPU
- OC RAM
- New Version of NV as well :)
-> CPU&GPU goes to 90%

My current benchmark:
Frame: 1920x1080 progressive, 8 bits per channel, Radius: 1 frame
Running the test data set on up to 8 CPU cores and on up to 1 GPU

CPU only (1 core): 2.1 frames/sec
CPU only (2 cores): 4.22 frames/sec
CPU only (3 cores): 6.02 frames/sec
CPU only (4 cores): 7.58 frames/sec
CPU only (5 cores): 7.75 frames/sec
CPU only (6 cores): 7.63 frames/sec
CPU only (7 cores): 7.52 frames/sec
CPU only (8 cores): 7.19 frames/sec
GPU only (GeForce GTX 570): 10 frames/sec
CPU (1 core) and GPU (GeForce GTX 570): 8.85 frames/sec
CPU (2 cores) and GPU (GeForce GTX 570): 10.6 frames/sec
CPU (3 cores) and GPU (GeForce GTX 570): 12.7 frames/sec
CPU (4 cores) and GPU (GeForce GTX 570): 13.9 frames/sec
CPU (5 cores) and GPU (GeForce GTX 570): 14.1 frames/sec
CPU (6 cores) and GPU (GeForce GTX 570): 14.3 frames/sec
CPU (7 cores) and GPU (GeForce GTX 570): 14.7 frames/sec
CPU (8 cores) and GPU (GeForce GTX 570): 14.7 frames/sec

Best combination: CPU (7 cores) and GPU (GeForce GTX 570)
=> 14.7 frames/sec

I need to compare to 560ti...

PS: About MercuryPlaybackEngine: with 560ti I swithed it off regulary, but here with 1280MB ram, no difference. Maybe this sole thing pays off new GPU:)

MPE helps when rendering native mpeg formats + MPE-Acc-Fx. But in my case, I use Cineform+non MPE-acc-Fx (like Neat). MPE actualy slows exporting. Often I switch MPE to SWonly and "use previews" option. But it depends... I'm happy with my whole night PC tuning. New NeatVersion is now cherry on the top of the cake.:)

Regards,
MIHAEL

PS: I need to try CS6 if they changed the pipeline somehow. I miss FAST/LIQUD NLE rendering where different cores/cpu star to render their own clips/fx, not like in PPro where all cores/cpu strugle for same clip and more time is wasted to distribute frames than for rendering.

VHSobsessive · Post by **VHSobsessive** » Sun May 20, 2012 7:15 pm

With regard to memory bandwidth, I asked in another thread whether it was worth shelling out for a setup with quad-channel memory. An Intel i7-3820 quad core cpu (with quad memory controller) and socket 2011 motherboard can be had for a semi-reasonable price.

At what point does extra memory bandwidth stop delivering speed improvements for a Sandy Bridge quad core CPU ? Is it worth doubling memory bandwidth by getting a quad channel system ?

Neat Video & Neat Image community

How to choose card for GPU acceleration?

ASUS GTX570

GTX470 benchmark

NeatVideo 3.2.0 + GTX470