Building a Machine Learning desktop

jolmes

Posts: 453

Free Member

Topic starter

Looking for some help putting one together for work. Our ML department are currently using AWS servers but don't have any of their own standalone desktops to loan us across department. As the data we're using is highly confidential the company want to keep it in house and build something ourselves. I have very little ML experience but I know how to build a pretty solid gaming computer. Cant be that hard right??

Looking at two specs both AMD but I'm also putting together an Intel build based around the I9-9990xe, are there any other alternates to these commercial chips?

1st one comes a smidge under £3600

CPU: AMD - Threadripper 2990WX 3 GHz 32-Core Processor
CPU Cooler: Corsair - H100i PRO 75 CFM Liquid CPU Cooler
Motherboard: MSI - MEG X399 CREATION EATX TR4 Motherboard
Memory: G.Skill - Ripjaws V Series 64 GB (8 x 16 GB)
Storage: Samsung - 850 EVO-Series 500 GB 2.5" Solid State Drive
Video Card: EVGA - GeForce RTX 2070 8 GB XC
Case: Corsair - Air 540 ATX Mid Tower Case
Power Supply: SeaSonic - FOCUS Plus Gold 850 W 80+ Gold Certified Fully Modular

2nd one comes in at £2600
CPU: AMD - Threadripper 2950X 3.5 GHz 16-Core Processor
CPU Cooler: Noctua - NH-U14S TR4-SP3 82.52 CFM CPU Cooler
Motherboard: Asus - PRIME X399-A EATX TR4 Motherboard
Memory: G.Skill - Ripjaws V Series 64 GB (8 x 16 GB) DDR4-2666 Memory
Storage: Samsung - 850 EVO-Series 500 GB 2.5" Solid State Drive
Video Card: EVGA - GeForce RTX 2070 8 GB XC ULTRA
Case: Corsair - Air 540 ATX Mid Tower Case
Power Supply: SeaSonic - FOCUS Plus Gold 850 W 80+ Gold Certified Fully Modular ATX

I know the 2990/2970wx has issues with stability and performance with windows 10 so its a bit of curve ball to blow the extra money...

If the budget was unlimited I'd blow it on a DGX Nvidia box but unfortunately its not

Thanks!

Posted : 27/06/2019 8:38 am

retro83

Posts: 621

Free Member

What does the workload look like in terms of being GPU heavy, CPU heavy, highly threaded or not, is memory bandwidth critical, I/O etc ?

Also I wouldn't recommend the 850 Evo, I had firmware problems with mine. It may now be solved but I'd suggest an 860 would be a worthwhile upgrade.

Posted : 27/06/2019 9:08 am

shinton

Posts: 957

Free Member

So your company thinks their in house security is better than AWS? Time to review that thinking.

Posted : 27/06/2019 9:09 am

jolmes

Posts: 453

Free Member

Topic starter

@retro83 - I'll be getting all that info today hopefully. Currently we're running most algorithms in RStudio and through python on laptops left on overnight or for days on end...not ideal. We're essentially running modelling and assumptions on data that is anywhere between 10-150 million entries. This is what I have assumed though, I've only seen my own data sets and the last one capped out at 53 million rows. I know we are not doing any image processing or image recognition, not in our team anyway, its probably mainly data based if that makes sense. Good thing is I now have a budget of keeping it under 6 figures so I'll be changing the SSD for sure 🙂 Just need those requirements from the people using it! Thank you for the info

@Shinton - If our Risk department thought we were sending clients information to a 3rd party data center to be processed, we'd get shot. Its just company policy and client confidentiality, yes the data can be anonymised but that's irrelevant. The data the ML teams use is completely different to ours so theirs is acceptable to go through AWS

Posted : 27/06/2019 9:24 am

mogrim

Posts: 12072

Full Member

@Shinton – you got nothing to offer, don’t bother replying. If our Risk department thought we were sending clients information to a 3rd party data center to be processed, we’d get shot. Its just company policy and client confidentiality

Shinton's got a perfectly reasonable point - you should be able to sort out a cloud based provider with confidentiality. We manage it here in a major bank, can't imagine your company data is any more confidential.

Posted : 27/06/2019 9:33 am

jolmes

Posts: 453

Free Member

Topic starter

@Mogrim - I have asked that question and so have others, the IT dept have just said NO, full stop AWS has already been explored, its an easier approach yes everyone agrees which is irritating and the usage is cheap as chips in comparison but that's not the solution we are heading towards. If only it were that simple but I have rules to follow and it doesn't involve AWS. Our IT and Data Science depts do not see eye to eye, trust me when I say we've had many many painful conversations with IT about getting a cloud based approach going but its a backwards fight. Which is why we have to go around them...

This wasn't supposed to turn into a pissing contest about whose data is more confidential or whose company has better in house security btw.

Posted : 27/06/2019 9:48 am

Daffy

Posts: 10539

Full Member

What're you attempting to achieve with the machine learning? Are you trying to develop optimisation strategies for/from the data or are you simply trying to analyse the data, perhaps predicting errorbounds, correlation etc? Also what percentage of your dataset are you using for learning and what is for training?

If you're attempting to optimise patterns and methods running simulations and computations based on the data, I'd say CPU would be more important, but I'd be tempted to go Xeon over AMD. If you're merely analysing, I'd be tempted to go for multiple graphics cards with large onboard RAM.

In the above specs, your text indicates 64Gb of RAM but your numbers indicate 128Gb.

Posted : 27/06/2019 9:52 am

TheBrick

Posts: 4954

Free Member

As above without knowing the code setup. What resources it has been setup to use (open CL for example) where your bottlenecks are it's hard to say. Sounds like you are working on a relational dB?

Raid may help, duel xenon processors may help multiple graphics card may help. They could also all be wares of money.

Posted : 27/06/2019 9:54 am

jolmes

Posts: 453

Free Member

Topic starter

@Daffy - Thank you, regarding the RAM its intended to be 128gb I manually changed the 8 x 16 GB but forgot to change the 64gb to 128gb.

The machine will be used by a number of the team all which have different agendas based on their current projects but without asking them (still not in the office..) I'd say probably all of the above but I don't know the train/learning split. I guess it just needs to be for the starter, an all around machine.

I'll look into the Xeon chips cheers 🙂

Posted : 27/06/2019 10:03 am

Daffy

Posts: 10539

Full Member

Sadly, in a budget constrained application, requirements are going to be everything.

If you're budget allows. I'd be looking at dual CPU and dual GPU setups, both water cooled. I'd also be significantly upping your storage requirements. The swap space required for some ML and DL datasets is huge and you don't want to be running off down the LAN (especially if it's Gb) every time you want to store/retrieve data. If simulations are required, your RAM requirements may need to shift.

We run a couple of high performance workstations here which're dual 22core Xeons and dual Geforce 2070s on 512Gb DDR and raid 10 array of 4Tb (4*2tb SSDs mirrored and striped). They're adequate and can be ran hard. The CPU's can be significantly overclocked without affecting stability and the on-board cache is significant. The GPU's are less stable...I'd be tempted to use Quadro cards next time and accept the loss of outright speed.

Still, for Optimisation tasks we have small (400core) HPC with a couple of TB of RAM as the WS get overloaded.

As said above. I wound't worry overly about data security for cloud based solutions. Some of our data is defence related and it can be solved on Amazon cloud. Though it does cost us.

Posted : 27/06/2019 10:25 am

shinton

Posts: 957

Free Member

Don't forget the USB ports so users can put in a stick and steal the confidential data 😉

Posted : 27/06/2019 10:30 am

jolmes

Posts: 453

Free Member

Topic starter

Perfect, thanks Daffy. I'll take the advice from Shintons too and make sure there are spare USBs lying around and adequate ports for data sharing...

Posted : 27/06/2019 10:43 am

mogrim

Posts: 12072

Full Member

This wasn’t supposed to turn into a pissing contest about whose data is more confidential or whose company has better in house security btw.

Didn't mean it to be, just thought Shinton had a decent point!

Anyway, quick google suggests NVidia are the industry leader for ML, which may make a difference. I'd definitely be looking into whether your codebase / compilers are supported by the hardware before making any hard choices.

Interesting article here: https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/

Posted : 27/06/2019 10:59 am

jolmes

Posts: 453

Free Member

Topic starter

@Mogrim - He did yes and I was quick to judge which is why I edited my post.

I've seen that link, he has another very decent post which I read last night and was still none the wiser! I think as others have posted, without truly knowing what the machine will be used for I may end up building the incorrect type of machine.

Have you seen their DGX-1 box? I showed that to my manager and he just laughed and said good luck.

Posted : 27/06/2019 11:10 am

jolmes

Posts: 453

Free Member

Topic starter

It seems one of the future users I asked isn't fussed about the GPU after asking him his requirements, doesn't make it any easier either! Also I'm not happy about chucking in a £20 GPU just to boot...

• The machine will be running R/Python and fitting machine learning models such as neural networks and tree models
• We expect to fit these models using the CPU for now, so there is no need to add in a decent GPU. If a GPU is required to boot than an ultra cheap, £20 one for 2D graphics will do.
• Might as well ensure the motherboard has slots for multiple GPUs for potential future upgrade.

Posted : 27/06/2019 12:13 pm

jwray

Posts: 129

Full Member

I think as others are suggesting, you need to know more about both the algorithms and the implementations of those algorithms, especially for GPU choice.

- not all ML is deep learning
- ML algorithms other than deep learning can run faster on GPU
- not all ML algorithms run faster on GPU
- does your implementation environment (you mentioned R) support GPU accelerated implementations and if so for which algorithms? Do you want to use those?
- other environments can help. We use h2o via R for a lot of our ML. GPU support for sure, and supports distributed data structures and algorithms. So can work in-memory on large data sets when that memory is spread across a cluster of machines.

Posted : 27/06/2019 12:22 pm

IA

Posts: 563

Free Member

Key things:

- It's pretty unusual to be doing ML without GPU these days. Spare slots isn't the only consideration, cooling and power are too.
-- if it's just old-skool NN and no deep learning, or inference only, it could just be CPU.
- Double check the assumption on GPU use. If they are, then memory size is important. E.g. if your model needs 12Gb to train, smaller GPUs won't cut it.
- Scan are a good supplier for this sort of custom workstation thing. Speak to them on the phone.
- OS is important, most modern ML is under linux, so you need good support.
- If it's NOT deep learning, memory size might be important, in which case Xeons. Which xeons and which sockets depend on requirements. If you need large memory, then you need 2 sockets to go beyond 1.5Tb but that's way out your budget.
- Just say no to AMD kids. There've been some issues in linux in the past.

An obvious starting point - which EC2 instances do they use just now? Spec similar?

Posted : 27/06/2019 12:26 pm

scuttler

Posts: 6874

Full Member

Shinton's point may be valid from a security process / governance perspective but that doesn't trump the data retention and sovereignty requirements the OP's data owners and customers might put on that data. Sure there may be alternate cloud providers who can help and maybe AWS can if OP does the right amount of due diligence but this isn't purely about who has the 'best' security.

Posted : 27/06/2019 12:31 pm

jolmes

Posts: 453

Free Member

Topic starter

I'm gonna be quoting you all from here as this knowledge is invaluable and will certainly help spec something together, once I have all the required input from my end 🙂

Just got a spec from our main ML team, they do use AWS for 90% of things but have this sat in the basement that they can use:

Dual Intel Xeon Processor E5-2699 v4
512GB (16x32GB) 2400MHz DDR4 RDIMM ECC
2.5" 1TB SATA Class 20 Solid State Drive
Dual SLI NVIDIA Quadro P6000

Cost them £18k back in 2017.

@IA - Linux is something we want but something IT wont support. Not sure if its because they don't know how but we're hemmed in quite a bit, lots of red lines/tape/issues etc.

Posted : 27/06/2019 12:54 pm

nixie

Posts: 7954

Full Member

Sounds like your IT department need reminding that their purpose is to support the business not be a PITA :D.

Posted : 27/06/2019 1:07 pm

jolmes

Posts: 453

Free Member

Topic starter

@Nixie - So very true. They run things in an extremely old fashioned way or so I'm told. Its us vs them when it comes to tech. Its hard to uphold a mandate of pushing the boundaries of tech and data science when you're own IT dept refuses to move forward with the times.

Posted : 27/06/2019 1:16 pm

molgrips

Posts: 91000

Free Member

You can get used servers from eBay for buttons. Buy ten, job done 🙂

Re cloud - you can have private instances of services and infrastructure located in the UK. I may or may not have a professional interest in this area.

Posted : 27/06/2019 1:20 pm

mogrim

Posts: 12072

Full Member

Re cloud – you can have private instances of services and infrastructure located in the UK. I may or may not have a professional interest in this area.

You can also get on-premise AWS installations, no idea about performance but would no doubt speed up deployment if they're already using AWS elsewhere. Assuming your IT department supports it, of course 🙂

Posted : 27/06/2019 9:50 pm

molgrips

Posts: 91000

Free Member

We also do on-premise cloud, but I don't think that's going to help the OP with his single machine requirements.

Posted : 27/06/2019 9:56 pm

TheBrick

Posts: 4954

Free Member

I have felt with devel Vs IT before. The best thing with a development machine such as you are building is to claisify it out of their range. Classify it as another tool. It's not part of their IT inferstruter. Much as if your department needed a TV with network connection it would not run Windows, your tool doesn't run windows.

You probably will not be able to have it attached to "their" domain or have it placed behind a local firewall and switch that they can do whatever they want but it's on your own development private network. This way you should be able to get whatever is you want on it but you will have to administer.

We have done this when working for a very big multi national and managed to get unapproved items that we needed onto site.

Obviously get your manager onside, and get him to go over the head of the IT dept if needed. If not possible when the issue is raised break down each one of their arguments in but don't undermine them especially in front of their manager. "I understand your security concerns, any lapse in this area could be catastrophic etc, why don't we do xyz". Assume they have already given the go ahead and you are discussing how to solve their problems even when they are saying no rather than argue the "no". Eventually you will win.

When talking to higher management create business arguments not technical arguments, again don't undermine the IT dept as they will get defensive.

Posted : 27/06/2019 10:18 pm

AdamT

Posts: 779

Full Member

Can you profile your workloads first? It's easy to max out your CPUs with some stuff before hitting the GPU limits. +1 for considering power supplies and cabling etc when looking at multi GPU solutions. Rack servers will sometimes not support full size GPU cards btw, so along with being out of your budget, they're not so useful (irritatingly, I have access to some massive machines that can't fit GPUs in). I would always recommend Xeon® cpu's....because I work for Intel® 😀

Posted : 27/06/2019 10:49 pm

FuzzyWuzzy

Posts: 8613

Full Member

You still don't seem to know much about the workloads required, unfortunately I know nothing about ML workloads (although part of my job is spec'ing servers for business workloads).

If ML is CPU-intensive you might want to at least spec. a dual-socket motherboard, although the latest AMD CPUs are extremely good. A dual socket motherboard is usually a fairly large price premium but it would end up a lot cheaper than a second server if you had a CPU bottleneck

What's the GPU for? 2070's a pretty high spec but unless ML is using them is it just to drive a big monitor to display the data, if so it might be overkill (then again you have enough budget I guess it's not worth worrying about).

Storage concerns me. Firstly you want to separate the operating system from the application & data. I would spec. a pair of smaller SSDs for the OS (250GB should be fine, our standard Windows 2012-2019 OS build is a 60GB OS volume). Using an enterprise quality raid card set them up as a RAID 1 mirror.

For the app you could use another set of SSDs and RAID them (depending on how much storage it needs). Ideally you'd keep the data sets separate again, whether they could be HDD or need to be SSD depends on how the data is used (if it caches a load in memory before processing it then HDDs might be enough). If the data is several TB then look at external DAS storage (or NAS is if it needs to be shared with other servers, now or in the future).

Also what about backups?

Where is it going to be located? (Noise levels)?

Oh and:

I have felt with devel Vs IT before. The best thing with a development machine such as you are building is to claisify it out of their range. Classify it as another tool. It’s not part of their IT inferstruter. Much as if your department needed a TV with network connection it would not run Windows, your tool doesn’t run windows.

You probably will not be able to have it attached to “their” domain or have it placed behind a local firewall and switch that they can do whatever they want but it’s on your own development private network. This way you should be able to get whatever is you want on it but you will have to administer.

We have done this when working for a very big multi national and managed to get unapproved items that we needed onto site.

Please don't do this, shadow IT is a pain and is often the route in when company systems are compromised. Yes your IT department sounds a nightmare and should be more flexible but devs/scientists running their own IT in a little bubble without a clue about the wider picture and little understanding about security is a recipe for disaster, especially as your company places a high value on your data. If AWS isn't secure for it then neither is shadow IT

Posted : 28/06/2019 7:32 am

Shred

Posts: 363

Free Member

I keep getting my data scientists coming with requests for bigger PC's. I am pushing back as there will never be enough for them unless you go all out for some crazy machine.

One option would be go get a load of cheap servers off ebay, and configure a Spark cluster, going for scale out rather than scale up, but Linux seems like a problem for you.

We also run MS extensively, and I have got some of my team running the Windows Subsystem for Linux to be able to use the R and Python as they want since they are mainly written for Linux.
Windows Server 2019 also allows WSL, so might be an option.

We have had similar issues with cloud, but it has been compliance that has been the issue. We are finally getting over this bottleneck (just waiting on final rubber stamp), and we are heading for Azure (with some restrictions like no US, BYOK ....)

Posted : 28/06/2019 8:06 am

TheBrick

Posts: 4954

Free Member

It's not shadow IT is the fact that a development environment / system is not normal computer it's a different tool. As I said if it needs to be on a separate network or behind an isolated firewall then fine. IT still have control of the gate but devel need to have control of their development system. With proper isolation this is not an extra security risk.

Posted : 28/06/2019 8:22 am

FuzzyWuzzy

Posts: 8613

Full Member

As I said if it needs to be on a separate network or behind an isolated firewall then fine. IT still have control of the gate but devel need to have control of their development system. With proper isolation this is not an extra security risk

But unless it runs completely isolated, with no data ingest then even if air-gapped you still need a second level of AV, anti-malware systems etc. What if there's a requirement for Internet access (how are you patching or updating AV otherwise)? That's another set of proxys and content inspection devices. As the data seems to be valuable, where's the central control to prevent it's exfil? Where's the separate systems monitoring to audit and alert to insider threats?

The IT department may seem like an unnecessary, inflexible over-head but a decent IT department is doing a lot behind the scenes that your average dev or user has no clue about. Sure I need to get with the times a bit myself and am trying to keep an open mind with DevOps and am working on a couple of projects that utilise it, however I'll never believe the end user is the best placed person to manage the infrastructure (knowledge-wise or being able to think rationally when it comes to balancing conflicting requirements).

Posted : 28/06/2019 8:46 am

jwray

Posts: 129

Full Member

Couple of people have mentioned clusters to scale out vs scale up.

This can work (we do it that way) but there are limitations. You are basically limited to the algorithms implemented in Spark and/or h2o which is a much smaller subset of the larger potential. It was enough for us, but it needs to be checked.

I don't know anything about the scale out/clustering of GPU resources, we targeted CPU.

You can write your own distributed algorithms on top of Spark/h2o but that is serious development effort.

A backend Spark/h2o cluster running on Linux can be accessed via R, Python, other tools by users on other OSs.

Posted : 28/06/2019 9:18 am

rjmccann101

Posts: 214

Full Member

I suspect that a solution like Wee Archie might not work for you. I'm just throwing it in here as with the new faster Pi 4 you get a lot of bang for your buck provided the workload can be distributed across the Number of Pis * 4 cores you'd be getting.

Personally I have absolutely no use for a machine like this but I still really want to build one...

Posted : 28/06/2019 9:39 am

bensaunders

Posts: 0

Free Member

Depends on data types, but compiling your code with CUDA for GPU you will give you MUCH faster training times than relying on CPU. Works better on Linux than Windows though...

Posted : 28/06/2019 7:49 pm

jolmes

Posts: 453

Free Member

Topic starter

Finally an updated response from the guy who will probably be using it 99% of the time.

I don’t think that the machine needs to be so specialist that we should need to define the processes we will run. The machine needs to be good enough that we can get some models built, and if we find that they are useful and it would be beneficial to upgrade to something very expensive and bespoke then we can. We will be running models based on decision trees, neural networks, and other predictive modelling frameworks (GLMs, SVMs, etc.) in Python and in R. I understand that graphics cards boost neural network training times quite significantly, but I don’t want us to fall into the class of “all the gear but no idea” having spent a fortune on graphics cards. Neural networks are notoriously difficult to work with, and we might (will probably) find that tree-based models are better for us anyway and these run fine on processors.

Seems to me that it wont be such an intense build and I can reduce the spec levels quite a bit.

Thanks for all the help and advice so far, IT vs DevOps continues.

Posted : 01/07/2019 10:54 am

IA

Posts: 563

Free Member

In that case then, assuming you're budget sensitive and not in a "use it or lose it" situation, the high clock speed 6 or 8 core i7/i9s give a nice balance of raw CPU speed for your money, and can still host a reasonable amount of RAM.

E.g. i've a "cheap" 8700k based machine beside me, 6 cores up to 4.7gig. a 9700k gets you 8 cores. Still "consumer" parts, so limited to 128gb (or 64 depending on machine) but pretty cheap and cheerful. £1500 would easily get you something decent without a GPU, based on a 9700k + 64Gb, but space to take GPUs in the future. So if a fast 8 core, plus "some" ram will do, if it were me I'd phone up Scan and ask them to quote you for a 9700k, 64/128gb ram, decent boot SSD, win10pro (or whatever you use), data HDD, and ask them to spec the rest of the machine.

If you need more RAM/CPU than that, price jump to Xeon workstations.

Posted : 01/07/2019 12:41 pm

[Closed] Building a Machine Learning desktop

Singletrack Issue 163: Stop The Drops!

Singletrack Issue 163: Mumblings from a bike mechanic

Rampage: I was not entertained

Fresh Goods Friday 779: The Where’s Benji? Edition