Jump to content
Sign in to follow this  
MedicineBaby

Advice on upgrade for data analysis and management

Recommended Posts

Hi,

 

For the past year, I’ve been able to easily get by with using my PC for data analysis. The datasets were quite small. Now, I’m using very large datasets, and if my PC isn’t freezing it’s taking about 12-24 hours to do things like spatial prediction and cluster analysis. The programs I’m currently running include VESPER, QGIS, Management Zone Analyst, and R. What I’d like to know is if I upgraded my G3258 to an i5, or i7, would I see any improvement in performance? I don’t expect to be able to complete my analyses in 10 minutes, but if I could shave off a good couple of hours and have greater stability, that’d be a plus.

 

Current specs:

 

CPU: G3258 3.20GHz

MOTHERBOARD: ASRock Z97 Anniversary

RAM: 8GB DDR3-1600 Kingston ValueRAM

GPU: GeForce GTX 750 Ti

PSU: FSP Raider 450W

O/S: Windows 10

Edited by MedicineBaby

Share this post


Link to post
Share on other sites

The Pentiums are generally a cut down version of mainstream Core CPUs. In this case, less cache (usually), no Hyperthreading.

 

Seems yours is LGA1150 which is superseded though new CPUs are still generally availalbe. A quad-core would go well of course assuming your software isn't single-threaded. Easy way to find out is do a run, bring up task manager - all cores should be reasonably busy.

 

For number-crunching the Ram setup would also be beneficial if it's running at good speed/latencies and dual channel. You'd also want the utilization to be sufficiently low that paging isn't occurring to slow your runs down.

 

Seems MSY have the i5-4460 in stock @ $257, no doubt there's probably other suitable CPUs around. That CPU is 3.2 GHz, quad-core, HTT, turbo 0/1/2/2, 6 Meg cache.

Also integrated graphics but I'd not use it as it'd probably impact on anything that wants high memory bandwidth.

 

What sort of improvement you'd get - I'm not sure. My guess would be worst case - 50% faster, best case 100% or slightly more.

A good idea might be to generate a run that takes about 15-30 minutes, quarantine the data such that you can do repeated benches with it. Then use that to guage speed improvements for any upgrades you do.

 

Additionally, if your analysis does lots of disk I/O, consider an SSD or at least seperate dedicated HDD. Though if it only comes down to several Gig worth of I/O spread over 24 hours, the drive type would barely matter.


More Ram might also be of help. But only if the current setup is near maxed out or the disk IO is such that there are repeated sweeps over the same data but it's flushed from the cache by the time subsequent passes occur.

Share this post


Link to post
Share on other sites

It looks like the software is running across the cores. I'll look into the 4460, maybe even the 4590 from PCCaseGear, and the RAM.

Edited by MedicineBaby

Share this post


Link to post
Share on other sites

Severely? I kinda doubt it. You lose efficiency by multithreading too far beyond the # of logical cores available. Any decent software that utilizes it would allow configuration or test for available resources.

 

Another suggestion if it is a job that does lots of IO to different files would be to use multiple HDDs. Using seperate physical HDDs in such cases can be more efficient than RAID0 as an example since reducing simultaneous file operations per physical drive means less seeks - seek delay is the killer on mechanical HDDs.

Share this post


Link to post
Share on other sites

I've actually got an SSD and HDD. I run the programs from the SSD and read the data from and write the data to the HDD.

 

I should've said that it's VESPER and Management Zone Analyst I'm using for spatial prediction and cluster analysis. QGIS and R, while slow to load my data, I only use to map and manage it. Although it'd be great to be able to analyse large datasets in R.

 

Also, I don't know if this is helpful, but my CPU and memory usage when running VESPER is around 50% and 3,320 K, respectively. There's a similar usage pattern for Management Zone Analyst except its memory usage is 382,500 K.

Edited by MedicineBaby

Share this post


Link to post
Share on other sites

Don't trust a simple CPU% figure. It's best to pull up a live status with moving graph for each logical core. You don't want to go blowing money on a better CPU if this app is only doing single-threading.

Also for reference - with hyperthreading enabled you'll usually get CPU figures that aren't necessarily accurate. e.g. Cores 0-3 might be near 100% with the others barely active and the utilization will say something like 65% when the reality is it's near 100% as hyperthreading in no way doubles available grunt.

Share this post


Link to post
Share on other sites

I just overclocked my CPU to 3.8GHz and it flew through a benchmark I'm using in VESPER. It took 1,117 seconds to do 596,569 interpolations at stock speed. With the overclock it only took 556 seconds.

 

Also, while I was looking at the graph in task manager, it looked like one core was doing more work than the other.

 

You're absolutely right that I don't want to spend money upgrading my CPU. The G3258, while it's not spectacular, it's been the perfect workhorse for my needs.

 

I'll keep testing to see if I get similar results with MZA. If I do, then the solution may be as simple as temporarily overclocking my CPU.

Share this post


Link to post
Share on other sites

Generally Core 0 will be busier than the others. That overclock has given a 100% improvement, which is a bit strange given that 3.8 is only about a 19% OC. Normally if you're only bumping up the CPU speed, or even increasing memory speed along with it, the expectation is diminishing improvements.

 

 

ed -possibly the second run is benefitting from a lot of the required data being cached and available. Probably a good idea to power down between tests.

Edited by Rybags

Share this post


Link to post
Share on other sites

Wouldn't be surprised if it wasn't my overclock solving the problem ;)

 

However, my system started showing signs of instability. MZA also crashed when I was putting it through another run. I'll have another run tomorrow and see if I get the same results.

 

Cheers.

Share this post


Link to post
Share on other sites

Drop the clock back slightly for stability, and consider an SSD or cheap Raid0 for the data 'in use'. Backup somewhere secure.

 

I'll be genuinely surprised if your CPU is bottle-necking your conventional HDD.....

 

CPUs are rare to find 'broken'. Consider a 'real quad' second hand, keep an eye on guntree.

 

Also consider a Bulldozer Core AMD. Programs and uses like this are EXECTLY what a genuine 6~8 core CPU (not hyperthreaded) was designed for.

Share this post


Link to post
Share on other sites

Generally Core 0 will be busier than the others. That overclock has given a 100% improvement, which is a bit strange given that 3.8 is only about a 19% OC. Normally if you're only bumping up the CPU speed, or even increasing memory speed along with it, the expectation is diminishing improvements.

 

 

ed -possibly the second run is benefitting from a lot of the required data being cached and available. Probably a good idea to power down between tests.

 

I'm guessing core 0 is one the OS uses the most.

Share this post


Link to post
Share on other sites

Errm... the per clock throughput of AMD's so-called 8-core CPUs is somewhat less than that of Intel's 4-core whether HTT is enabled or not.

 

Really, they're still playing catchup with Core 2.

Share this post


Link to post
Share on other sites

The cache config for both is fairly similar. Per core/module for L1 and L2, shared for L3. Shared L3 is what you'd want anyway given that program threads don't usually run with LP affinity set.

 

Looking at a variety of the benches @ Tom's HW, even the 8-core 8350 gets matched or beaten by "lesser" competitors that are running as much as 0.8 GHz slower and 4c/4t.

Share this post


Link to post
Share on other sites

Just an update.

 

I did another run through last night with VESPER, this time with my processor overclocked to 4.2Ghz. Overall, it took about 12 hours, barring a stuff up on my part regarding one of the configuration files, to complete spatial predictions for 11 different variables.

 

So, I saved about 2 or 3 hours compared to last time which was pretty good.

 

Now, I'm doing some cluster analysis on a couple variables using MZA. I'm hoping this works out as well. Otherwise, I'll punch my screen :D

Share this post


Link to post
Share on other sites

Are there any appication side optimisations you can do?

 

Like having data pre-sorted, or placed on different physical drives. Or maybe there's some screen echoing, status updates, whatever that could be disabled to speed it up.

Share this post


Link to post
Share on other sites

I've disabled plotting of the predictions and other visualizations, placed the data on different physical drives, and with the cluster analysis I'm currently doing, I've stripped unnecessary data from it to reduce the size of the files (GPS information).

 

EDIT: Out of interest, if I were able to get my hands on a 4790K, would I be able to satisfy the power requirements? I think with my power at load would be around 350W.

Edited by MedicineBaby

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×