This year, the UCL Data Science Society aimed to create a complete Data Science curriculum with the purpose of helping other students on their Data Science journey. To this end we created a series of workshops, building on the work of previous years, to cover three main areas of any Data Scientists' journey.
  • Introduction to Python: A series of four workshops covering the basic notation and structures used in Python to be able to understand the coding used in later workshops.
  • A Data Scientists Toolkit: A series of five workshops covering three key libraries in any Data Scientists toolkit of Numpy, Matplotlib and Pandas, alongside the key tools of Git, GitHub, and SQL.
  • Data Science with Python: A series of nine workshops covering the four main categories of Machine Learning models: Regression, Classification, Clustering, and Dimensionality Reduction through several of the most commonly used algorithms applied on different datasets and examples.

The purpose of this is to cover all of the basics that any Data Scientist would need on their journey without going into too much detail too quickly. Below you will find a description of all of the tools and methods that were presented in each workshop that fall under each of these headings, including their use, advantages, and disadvantages that you will need to understand on your Data Science journey. This includes links to all the Medium articles that were created throughout the year to cover an overview of each topic, each with further links to the full workshop and problem sheet.

We hope this is able to help you on your Data Science journey in the future!

Introduction to Python

For any Data Science beginner, one of the first questions you need to be able to answer is which language will you choose? While there are a few options out there, including Python and R, we start off with Python due to its applicability and usability beyond Data Science and the broad range of libraries that can be used to support any Data Science workflow. In doing so, we cover the main fundamentals that anyone would need to be able to continue a career in Data Science and beyond in Python by introducing you to concepts such as Python fundamentals, sequences, logic, and object oriented programming. This lays the foundation for being able to understand what the code does in later workshops and also how to find solutions to coding challenges you may come across.

Python fundamentals

The first task of anyone learning Python is setting up your environment and then learning what Python code represents. In this workshop we provide an introduction of how to set up your programming environment through Anaconda, talk through what is a Jupyter notebook and cover the basics of variables, data types, and basic operations in Python. This will help you understand how to read Python code and how to start interacting with your own code as well.

Python sequences

Python has various in-built sequences which can be used to store multiple points of data inside rather than creating many variables within your environment. It becomes important early on to understand what each of these sequences can or cannot do so that you know how to store your data in the future. To this extent, we cover the main functionality of lists, tuples, sets and dictionaries in Python which are the main sequences/data structures that you will come across in your data science journey.

Python Logic

When it comes to building more advanced programs, understanding how logic works in Python becomes key. This includes creating code that runs when a given condition is met or running alternative code if it is not, performing repeat actions in Python and also defining pieces of code that can be used repeatedly within your code. To this extent we introduce condition statements, logic statements (if, else, and elif), loops (for and while), and functions in this workshop where you can see how they work separately and also together to make more complex code.

Python Object Oriented Programming

While most Data Science workflows tend to use Procedural Programming when working within Jupyter Notebooks, it is useful to understand the benefits and use cases of Object Oriented Programming. This is a programming paradigm that structures code so that both characteristics and behaviours of data can be bundled together into a single structure and often forms the basis of the majority of libraries that you will encounter in your programming journey. This means that understanding how this form of code is structured is important for being able to interact with many of the libraries that you will be introduced to on your Data Science journey.

Toolkits for Data Scientists

Once you’ve covered the basics of Python so that you can create basic programs, it then becomes important to learn some of the tools that you will be using day in and day out. These tools are libraries and software that have been created others already so that you don’t have to reinvent the wheel and it will make your code much easier to read and understand. Three of the main libraries that you will often encounter as a Data Scientist include Numpy, Pandas and Matplotlib, and you will often use the tools of GitHub and SQL as well in your data science journey.

Numpy

Numpy focuses a lot on mathematical functionality and is a fundamental library that underpins a lot of methods and functions in other Python packages. This means that it is a foundational package that is often very useful to understand. To this extent we cover the basics of Numpy Arrays, mathematical operations within the package and how to interact with the Random Number functionality.

Pandas

The second fundamental package that you will often encounter is that of Pandas. This is a package used widely for data science and analysis tasks that builds on top of the Numpy package. It is one of the most popular data-wrangling packages for data science workflows and integrates well with many other libraries used during the Data Science workflow such as SciKit Learn. In this workshop we cover how to create a Pandas Series and a Pandas DataFrame, how to access information from this data structure and then what operations can we perform when the data is in the structure.

Matplotlib

Being able to visualise data is a key skill for any Data Scientist to be able to communicate your results and findings to both a technical and non-technical audience. While there are many different packages that you can use from this in Python, one of the main ones you encounter and a good one to start with, is Matplotlib. In this workshop we cover how to construct a basic plot, plotting multiple sets of information on the same graph and then plotting information across multiple axes.

Git and Github

Beyond the libraries in Python that form a part of a Data Scientists toolkit, there are many other software and tools that are useful in a Data Science Workflow. One of the main tools that any Data Scientists should be familiar with should be that of Git and GitHub as a means of version control. This ensures that you are versioning your results in a controlled manner rather than naming it draft1, draft2, draft3 etc. This can then be linked to GitHub so that you can store these versions somewhere other than your desktop and allows you, and your team, access to this from anywhere in the world as long as they have an internet connection. In this workshop we cover the basics of creating a local Git repository, committing changes to that repository and then linking this to GitHub.

SQL

The final tool in our Data Scientists toolkit is that of SQL. SQL stands for Structured Query Language and is one of the most widely used programming languages when working with databases within relational database management systems. It is using for performing various different operations such as selection data, querying data, grouping data and extracting summary measures from data outside of a Python environment. The benefit of this is working with large stores of data, especially when these are held on a centrally managed system. In this workshop we cover setting up an SQL instance on your own machine and then using this to manipulate a dataset including selecting data, searching on conditions, summary statistics and grouping data.

Data Science with Python

Once you know and understand how to use Python and its key libraries, alongside other key software in a Data Science workflow, you can then move onto learning and understanding each of the Machine Learning algorithms that can be used. To this end, there are two main distinctions between machine learning tasks which can then be split into four overall machine learning groups.

The first split is between supervised and unsupervised machine learning tasks. The first of these means that we have a defined target that we want to work towards, such as detecting diabetes or cancer, modeling house prices or modeling the position of NBA players. This is often done using regression or classification machine learning methods which aim to get as close as possible to the defined target.

The second means that we don’t have a clearly defined target but we still want an outcome, such as grouping consumers based on the shopping habits or identifying behaviors within a set of outcomes. These tasks are often done with clustering or dimensionality reduction methods which aim to identify groups of similar data points or to reduce the number of dimensions to visualise the data or to input into another machine learning algorithm.

Regression

The first of these groups is regression in machine learning. This is a method for modeling the dependence or relationship between two or more quantities such as house prices and house characteristics or energy efficiency on building characteristics. The aim of this methodology is to find the strength and direction of these relationships to either model unseen results based on data that you have or to understand the relationship between two variables.

Linear regression

The first method that you come across under this umbrella is that of linear regression which models the relationship between the variables in a linear way. The aim of this method is to reduce the distance between the actual value and the predicted value using least squares and allows you to extract parameters that show the strength and direction of the relationship between them and the target variable. In this workshop, we cover how to implement a basic regression through scikit learn and then who to implement and interpret a multiple linear regression equation.

Logistic regression

A second common method under the regression umbrella is that of logistic regression. While linear regression is often applied to continuous variable prediction (those that can take an infinite number of values within a given range), logistic regression can be applied to predict categorical outputs (those that contain a finite number of points or categories within a given range). The main aim of this method is to predict which category or observation a data point belongs to, rather than an exact value, such as in the case of whether a patient has diabetes? This therefore falls under the regression umbrella but also acts as a classification methodology using regression. In the workshop we cover how to implement and evaluate a basic logistic regression to model the incidence of diabetes.

Advanced regression

Beyond linear and logistic regression there are a variety of other regression methods that are often useful to understand. This can include Lasso and Ridge Regression that build on basic linear regression methodologies by introducing regularisation to try and avoid the problem of overfitting, or machine learning methods of decision trees and random forests that are able to model the non-linear relations between variables. These have both advantages and disadvantages in that these methods are often better able to model the relationships between variables but this can come at the cost of increased computing resources required or reduced interpretability.

Classification

After regression, another common supervised machine learning task is that of classification. The aim of this, rather than modeling a specific value, is to model which group or class a data point belongs to based on a target dataset that we know. This can include modeling whether a patient has diabetes, whether a patient has cancer, whether a user will resubscribe or not, or what position an NBA player is based on their statistics. There are many methods that fall under this umbrella, many of which can also be used for classification, but three common methods include: Decision Tree Classification, Random Forest Classification, and Support Vector Machine Classification.

Decision Tree Classification

A Decision-Tree follows a tree-like structure (hence the name) which is similar to some of the decision trees that you probably made in primary or high school at some point. This method is able to perform classification by using the decisions to reach a prediction as to which outcome a data point belongs to. Specifically, it works by splitting the dataset according to different attributes while attempting to reduce a given selection criterion. In this workshop, we cover how to implement a basic decision tree, how to visualise how this has performed, and then how to evaluate the performance of the model.

Random Forest Classification

A Random Forest Classifier is an ensemble method that utilises the Decision Tree classifier algorithm, but instead of a single decision tree being created, multiple are. In doing so, this takes advantage of a random sampling of data and features to ensure that the model does not overfit and produces better predictions as a result. This follows the logic that the performance of the crowd is better than the performance of an individual. If you are able to implement a Decision Tree it is often better to implement this method instead, although this can come at the cost of increased computational resources required. This workshop covers the basic implementation of a random forest along with how to evaluate the results.

Support Vector Machine Classification

Beyond Decision Trees and Random Forests, while there are many other classification methods that can be used, Support Vector Machines is one that is often encountered. It works towards classification by trying to find a boundary in the data that separate the two or more different classes that we are trying to define. In doing so, this model can be used for prediction by finding what side of the boundary a point may lie on and thus what group the point may belong to. The usefulness of this algorithm is that the boundary can take many different forms whether that is linear, non-linear, or defined by the user. In this workshop we cover the basic implementation of the model along with how to visualise the result.

Clustering

Clustering is part of the unsupervised branch of machine learning meaning that we don’t have a defined target to work towards like we would with regression or classification tasks. The objective of these algorithms then is to be able to identify distinct groups of objects that share similar characteristics such as shoppers, films, or stores. This then enables decisions makes to focus on these groups and see how they may be able to better serve them such as customer retention or incentivising them to spend more money. To this end, two common clustering algorithms include k-means clustering and hierarchical clustering.

K-means clustering

K-means clustering is one of the most commonly used clustering algorithms. It works by firstly defining the target number of groups to create which the algorithm then seeks to define based on different distance metrics between points and groups. In this workshop, we cover how to implement a Kmeans clustering algorithm, how to choose the optimal number of clusters, and then how to evaluate the results.

Hierarchical clustering

Hierarchical clustering works by creating these groups in a hierarchy based on a distance metric that is used to separate different groups. The uniqueness of this algorithm is that you can identify a hierarchy of how different groups may fit into each other or separate from each other based on the distance that is chosen. This means that we don’t need to know the number of clusters before performing the algorithm, although this can come at increased time complexity. In this workshop, we cover how to implement and evaluate this algorithm.

Dimensionality reduction

Finally, we have dimensionality reduction, which also comes (in most cases) under the heading of unsupervised machine learning algorithms. The main of this method is either to reduce the number of features in a dataset, so as to reduce the resources required for the model, or to aid in visualising the data before any analysis is performed. This is done by reducing the number of attributes or variables in the dataset while attempting to keep as much of the variation in the original dataset as possible. This is a preprocessing step meaning that it is mostly performed before we create or train any model. There are two main forms of this in that of linear algebra of manifold learning where in the workshop we introduce Principal Component Analysis from the former and t-distributed Stochastic Neighbor Embedding from the latter.

Conclusion

This curriculum is aimed at giving individuals a start in Data Science with each of the libraries, software, and techniques presented in a way that all the basics of implementation and how they work are covered without going into too much detail. This should give any new data scientist a platform from which they can explore topics they are more interested whether that is more regression, classification, clustering, and dimensionality methods or whether that is going on a deeper dive into each of the models we have already presented. We wish you the best of luck on your Data Science journey!