There are over 3 billion base pairs (molecular pieces of information) in the human genome, and on average, 3 million, differences between two individual humans. These differences are transcribed into RNA, which are further sliced and diced before being translated into proteins which drive cellular processes through enzymatic interactions and cell-cell signaling. All these differences and complex processes must be unravelled in order to better understand biological organisms and also to improve human healthcare. The complexity of this landscape has made the a nearly intractable puzzle, but with power computational platforms and techniques in machine learning, scientists are starting to unravel some of life’s greatest secrets.
In 2013, Bill Gates stated at the ASHG meeting that year that “we have finally achieved almost infinite compute and bioinformaticians are among the few who can fill it.” Bioinformaticians are hard-pressed to analyze and organize this plethora of data with manual and even traditional analytical techniques. Machine learning enables the scientist to let the computer learn inn a data-driven way, allowing the data itself to drive pattern-recognition and prediction. In this context, human training and skill is also essential to prevent over-fitting a model and/or introducing bias, which can be easily over-looked and difficult to detect downstream.
For these reasons, a solid education in bioinformatic techniques and machine learning is both rewarding and worth it. Once learning foundational biology, genomics, immunology, oncology, virology, proteomics, systems, computer science, statistics, data science, and machine learning, expect to spend a lifetime specializing and deepening these skills. Clearly, none of this can be done in a vacuum and the best bioinformaticians build a network of collaborators upon which to draw deep expertise in all these areas. Perhaps the most important area of focus for a modern bioinformatician is in machine learning and artificial intelligence, given the breadth of data required to find insights, and the importance of collecting, structuring, analyzing that data efficiently, effectively, and in a way that does not introduce bias.
For these reasons, a student might like to focus on the following before apprenticing to other more established practitioners: statistics (probability, Bayesian, linear regression), data structures and databases (SQL, noSQL, GraphQL), cloud computing (AWS, GCP, Azure), more machine learning (random forests, neural networks) and programming languages (Python, Tensorflow, Keras, R, MatLab).