Making it through the machine learning madness

Mahalakshmi Adabala
Nov 5, 2022
9 min read

Updated: Dec 20, 2022

Is the whole domain of machine learning an ordeal too tough, for people to master or understand? There's a lot of scientific reasoning and statistical knowledge that goes behind the codes and numbers, but as the entire world gets swept by the data revolution, it doesn’t hurt to know a little more than those around you.

Data is all around you and has occupied such a great role in your life that you probably don't even notice it most of the time. The madness of machine learning is one small drop in the ocean, making some of those roles possible. Everything from a smart chatbot to a recommendation system on your favourite online store runs at the helm of some smart algorithms designed to make the user experience seamless and simpler.

Getting straight to the basics:

Picture the typical case of a user just like you waddling through websites, generating the content, sharing opinions and creating enormous quantities of data. This data may just seem like a sequence of strings and numbers but when applied to a computing system can yield some magnificent results.

In the case of machine learning, a team of engineers and analysts first feed the data to softwares like SPSS and MATLAB. The objective here becomes to feed both the inputs and desired outputs to the system so that the machine can traverse through the building blocks and ultimately ‘learn’ how to arrive at the output.

From a technical standpoint, this is done through a system of layers and nodes that shoot up the informational input with different weights. These nodes can represent data points and can affect the accuracy of the final model. The primary objective of machine learning thus becomes to test the models by tweaking the parameters and comparing the produced results to the actual results.

Time to brush up some statistical analysis. Remember all those classes spent learning about standard deviation in school. The standard deviation becomes a tool to assess how close the values of the model match that of the actual values.

You might be wondering how this is done so? Analysts split up portions of the data with one part for ‘training’ the data and one for ‘testing it. In this context, training and testing allude to letting the system learn the trained data or train it and then juxtaposing it against the testing data for accuracy.

Machine Learning Advantages

Components of a typical network :

You’ll be hearing the words ‘neural networks’ being repeated many times throughout any machine learning text. It’s because a machine learning algorithmic system closely matches that of an organic neural network where data is transmitted and relayed from point to point. But for the analyst, there are some other components to take care of:-

Predictor:- A fancy way of referring to parameters, variables or inputs that affect the output and are used in the algorithmic model to build use cases. Not all predictors play an open role in the output which is why analysts spend some time determining correlations and coefficients between them.

Unsupervised Data:- Unsupervised data refers to information and datasets where the output is not directly stated or mentioned and thus the system works on developing a neural network by understanding what makes one input different from the other and how are all the parameters related to each other. A good example of an unsupervised dataset would be images that have to be classified as per their type which isn’t explicitly mentioned.

Machine Learning Gateway

Unsupervised data is usually worked on using clustering and association techniques.

Supervised Data:- Contrary to unsupervised datasets, output variables are well understood in supervised datasets and can be modelled against the inputs. Supervised datasets can have multiple outputs and multiple inputs which can make machine learning more complex. Such a machine can be said to be undergoing supervised learning.

Think about a dataset that has retention rates for students in a class against information like hours spent studying, test marks, and class engagement. Now imagine a machine working on this data and producing a model that predicts how well a new student performs based on pre-referential data.

Classification:- Classification is a machine learning technique commonly used to classify and distinguish separate subsets within the data to tell analysts what conditions lead to the output. The output data, in this case, is not numeric and belongs to a class.

Common classification methods include treeing, decision tables, Bayes classifiers, and many other algorithms. Using classification methods, you can determine things like what atmospheric conditions lead to rains, what concentration of different elements will produce a particular type of sand, or even what conditions lead to a person being invalidated for a loan.

Regression:- A pioneering tool of the polymath Francis Galton, regression put simply is a numerical machine learning technique to connect the output to the input through equations.

Unlike the simple graphical techniques whose equations people might find in a student's notebook, real-life regression equations are never so simple or in a single order. Just imagine correlating the price movements of a stock over several years to a number of factors and building an equation out of it. Tougher than it seems, and yet regression can be used to develop equations for realistic cases such as predicting consumer engagement for a product through online interactions.

Cross-Validation:- Cross-validation refers to the number of divisions or folds that are done in the data to be used for testing or training. You might be tempted to think- “Why not just use training data alone and get 100 % accuracy?”.

Using a model that has a higher training-to-testing ratio will usually yield good accuracies since there is smaller testing set to be compared against but ultimately fails when applied to new data in the future.

What makes a good machine learning model

Machine Learning Gateway

Most machine learning experts would tell you that it’s not just about the accuracy but rather the program’s adaptability to deal with new instances of the dataset in the future that makes the model useful. This lends many advantages to the predictive nature of machine learning algorithms which would otherwise provide erroneous and incorrect results.

That being said, having a good testing-to-training ratio is preferred with most simulations using a ten cross-fold validation. Another aspect happens to be the issue of biases and variances. Biases represent how far the values differ from the target values, and variances relate to the difference between the values themselves.

Any good machine learning program will try to minimize both and in most cases develop a trade-off between the two, to correctly identify each data instance.

Tracking machine learning in real life

Apart from the more lucrative cases of machine learning being used in self-driving cars and personalized assistance, which we'll discuss in the next segment, let's stroll past some simple cases right from the history of data:-

Google’s Flu Predictor:- Having begun the project of collecting search data for long periods of time, Google published findings of the search dynamics for flu-related cures all over the world. The algorithm was supported by a series of support vector machines(SVM) that measured search indexes on a number of parameters.

Based on past data, the company was able to distinguish patterns throughout the year and concentrate on regions that reported the highest of such searches. This enabled the company to build a theoretical model that predicted values for stock flu searches in the coming months and thus alert health officials to stockpile medicines in regions with the highest number of searches.

Ohio University’s CHD Case:- Here's a fun little machine learning exercise that uses a well-known technique called logistic regression. Analysts from the University of Ohio had spent many years studying the incidence of coronary heart disease(CHD) in patients classified by age groups, gender, race, lifestyle and dietary habits.

Logistic regression is a special type of machine learning technique where the output variable can take two values. In this case, an output of 1 meant that the patient would be diagnosed with CHD, and 0 meant that the patient wouldn't. The tricky part came with analyzing and confirming whether the statistical algorithm would hold well for patients in the future as the research took place when computing technology wasn't so advanced.

Thanks to the potential of SAS, a predictor model could be created with a reduced error rate that helps understand what led to the disease. Want to know the best part? The data set, learning material and software coding are all available for free on the University's online module.

Typical Machine Learning Workflow

Online Serch Engines:- All content online is distributed among an endless multitude of websites. So how is it that you still get the exact thing as your search query in such a short time?

The answer lies with the software. Web search engines use a ‘back-linking’ algorithm that latches onto the meta-tags, URL indexes and NAP indexes of any website online(Fun Fact:- Google’s supposed name was chosen to be BackRub). These backlinks work like tiny creepers that seep through the internet, searching for unique tags and deliver the content. A majority of these use advanced classification and treeing systems. Web scraping softwares that store user data and cookies also work in a similar fashion albeit with a much different purpose. You can download the Twitter or Facebook API and watch how the software collects data about users in a matter of seconds. Think of it as a form of data farming.

Netflix’s Recommendation Engine:- Part of the reason behind why Netflix became a truly unique viewing experience is because the streaming service is designed to keep the viewer glued to their sets as long as possible. The idea here is to recommend shows and movies that closely match the content of the user’s history.

Netflix even released its user data to be used for seasonal competitions to understand how exactly they work. Models from all over the world pooled in to create a recommendation engine that works behind the shadows while viewers watch.

Many such engines typically use augmented classification algorithms like C45, PART, Decision Stumps, Decision Tables, and Multilayered Perceptrons.

Jumping to advanced cases

Most data you’ve seen till now use simple machine learning tools and algorithms. But as technology marches on, so does the need for complex cases that can help the machine learn faster and continue without any sort of developer supervision.

It sounds like creating self-thinking machines that run on AI but no, it's just, machines put to good use. Some advanced concepts in machine learning cover the ideas of Bayesian clustering, multi-instance learning, locally wanted linear regression, model tree induction, and the often discussed multi-layered neural network.

If you've ever laid eyes on a neural network schema, you would see that it has several tinier nodes that cross over to other nodes that all join together at the output. At every node and junction, there are even more complex algorithms working such as gradient descent, stochastic gradient, linear chain, Levenberg-Marquardt, and sigmoidal.

Top Softwares For Machine Learning

These tinier nodes are filled with their own weights and biases set by the analyst. Another important parameter to control here is the number of simulations or iterations that can affect the final accuracy of the model while also determining how well the model fits with future input.

You might be wondering where these complex machine learning tools are used, and the answer is- pretty much everywhere.

Natural Language Processing (NLP):- Natural Language Processing is a growing part of the machine learning industry that has seeped through AI, marketing and contextual analysis.

All over the internet, you'll find corpus files that are stored with gigabytes of textual and vocal data from average consumers. NLP's best use comes as a stand-in customer service agent that quickly transfers customers to the information that they're looking for.

More importantly, it's being used as a means to interpret texts from many languages(mostly in legal circles) and convert them to plain language to help people understand them. Microsoft too had once hit off a similar project titled Tosh which was released as an online bot whose linguistic abilities improved as more and more people conversed with it ultimately reaching self-learning capabilities.

Security And Malware Detection:- A great self-learning algorithm was produced by company Deep Instinct in conjunction with Kaspersky Labs that uses a high ended support vector machine with SMO(simplified minimal optimisation). A result is a software that has ‘self-learned' how to detect fraudulent and malware in files.

Malware tends to evolve in terms of its code with only minor changes from iteration to iteration. It thus becomes important to have a security system that evolves simultaneously to check how the cloud is accessed, finding anomalies and predict security breaches.

Financial Trading:- Tough grounds to work on but several teams of machine learning enthusiasts have created predictor programs that study price patterns for certain stocks on a bundle of parameters. Many of these programs tend to use the Naïve Bayes classifier enhanced by regression techniques. Some of the more accurate ones end up on the pricing list of top sellers and distributors.

One must understand however that the reason it’s called ‘naïve’ because it runs on the assumption that all input parameters are independent of each other. Nevertheless, such programs are still the go-to locations for traders and hedgers online.

The future and beyond

Smart cars, personalized marketing systems and automated drug delivery channels are just some of the few achievements that machine learning has gifted to the world. As we teach machines to ‘learn’, they teach us how to solve some of the more pressing issues of the modern age.