Machine learning development environment

Submitted by Anonymous (not verified) on Fri, 08/19/2022 - 19:53
machine learning development environment

If you're new to machine learning, you may have a development environment you can access through your school or your workplace. If you're on your own, you can take an online course (see below) that can point you in the direction. Barring that, below are some tips to get you started.

Google colab: For a zero-cost, powerful machine learning environment, including access to GPUs and access-everywhere cloud storage, you can't really beat Google colab notebooks. You can follow an intro tutorial like the one in the resources below, or just browse over to colab.google.com, sign-in and get started, no credit card needed. Shopping for a GPU computer for DIY modelling? Consider a Colab pro account, currently priced at ~$10/month, you don't have to maintain a GPU and you'll always have the latest hardware. Plus Colab give you context-sensitive help, and you can share your notebook. But it may not be the best solution for your most sensitive IP. If you're more of a tinkerer and really want your own GPU server, read on.

Other Cloud providers: For a small cost, Microsoft, Amazon, IBM and others offer robust in-the-cloud machine learning environments. These are usually best for larger-scale deployments or organizations with business associates agreements (BAA) giving those companies a blank check. For smaller, individual consumers, be warned you will be required to give a "blank check" for billing compute, much like a utility, but it may be difficult to track expenses. For example imagine recieving a disproportionately high water bill for your beach house after discovering a pipe burst. Similarly, run-away servers can easily rack up cloud costs in the 5-6 figure range for GPU processors. Caveat emptor. 

iOS and Macbooks: Apple is notorious for its proprietary hardware that works together seemlessly. This means you may have to pay top-dollar for a GPU-enabled machine to do machine-learning, but maybe you have the money and just don't want to fuss with drivers. A google search will help you find everything you need, for an example see below.

PCs, Ryzen (and Linux): The nice thing about PCs is that a well-appoined laptop with NVIDIA GEFORCE GPU, 15" touch screen and 14 cores can be had for under $2k as of this writing. For even more fun, try building your own desktop from builds at pcpartpicker.com, or design your own!  That website has been around for decades, and the community is robust and helpful. Supplement with a trip to Best Buy for your case to get just-in-time advice from a local entusiast, or join a meet-up. Importantly, the number of cores on your CPU (AMD Ryzen, Intel Core i7, etc) doesn't matter because you'll be using your GPU to get up to 100x speed-up.

Once you have a rig that's working, here are some tips to get you started on setting up the software using Linux:

  • Yes, python is slow, but python is fast enough for machine learning because you'll just use it to talk to a pytorch or tensorflow backend, which is not implemented in python.
  • Install WSL2 on Microsoft to get your linux environment. As you advance your machine learning skills, you'll want to have access to a linux shell.
  • Get Ubuntu: as of this writing, 22.04 is the latest that works well with the NVIDIA cuda toolkit version 11.6, and tensorflow 2.6. This is less important if you're using pytorch, which came shortly after tensorflow and is a little less touchy about versions. You can also try various terminals (conemu, mobaxterm, ...) and different linux versions on the Microsoft app store, depending on your preferences.
  •  X Server: This will allow you to "show" any visualizations (e.g., from matplotlib.pyplot or R's ggplot2) right on your PC, in a separate window from your terminal. Xming is a very common server, but VcXSrv is also free, newer, and has a few more features.
  • zsh: Ubuntu comes with a bash shell which is fine, but if you haven't upgraded to zsh and you're not enamored with tcsh or something similar, there's no time like the present to give zsh a try! omyzsh with the Powerlevel10K theme is a killer combination, give it a go if you're feeling adventurous and want your mind blown. 
  • Editors: if you use vi, you're good to go since its standard on most Linux distros. Otherwise, `sudo apt-get install emacs`
  • git: If you hope to publish/share your work (I hope you will!) now is a good time to install auth keys. Now that zsh and emacs is to your liking, you may like to create a private github repo for your dot-files.
  • docker: Dockerhub desktop for containerizing your applications. This will be more critical if you want to deploy your apps without worrying about portability issues.

If you're still here, congratulations! Now you're ready to install the real work-horses, Conda and Tensorflow. Alternatively, you could install Conda and pytorch, or R. When deciding on R vs. Python, some say R has more hardcore statistical packages, but python is easier to use with a broader community of contributors. When deciding Tensorflow vs. Python, this article may help. For Conda and Tensorflow, here's a bit more help:

  • conda & tensorflow: If you're writing in python and not R, you're in luck: if your environment becomes really bloated or munged up, you can blow away just the python part to get a clean start without having to reinstall the operating system. Do this by installing conda. Better still, install miniconda and only install python libraries as needed and on the fly. If conda is too slow, check out mamba. Mamba will make quick work of installing tensorflow.

For detailed instructions on how to set up conda, tensorflow for analyzing GTEx, a public gene expression dataset, check out tiscla.

Resources

Tags