Data Mining

From APL
Jump to: navigation, search

To see an example of taking a dataset and applying machine learning techniques go to


Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.

History as I know it Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.


With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.

There are three type of learning. Supervised, unsupervised and re-enforcment. Supervised learning tasks are regression and classification, reinforcement learning is for stuff like the stock market, unsupervised is a really advanced topic.

Overview of the process

Get the data Explore the data Create model/score model Tweak things

Collecting Data: There are many interesting data sets that are available for exploration Uci machine learning repository, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,

clean the data: decide what to do with missing variables, explore how complete the data is. Create graphs and statistics that are relevant to the variables. Find the type of each variable and convert categorical to numeric.

The model: There are lots of models out there that work with different degrees of success. There are linear models, clustering, networks, and trees.

evaluation: create a separate set of data to use as cross validation in order to avoid overfitting the model to the data. There are many metrics, such as accuracy, recall, F1.It's best to choose one and stick with it.

What to do when struggling:

R and python both have complete documentation so you can find information about any function in it. Also if you press "tab" after a dot ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of users and developers so information on everything is plentiful.

Python Tutorials and blogs. If you've never touched python start by doing the first 20 exercises of this: [LPTHW]

Then follow the scipy primer on this wiki

to practice python skills [Subreddit]

To learn machine learning: This progression of tutorials take you from a little knowledge of numpy to a modern neural net with all the bells and whistles. It'll make you feel amazing about what you can do with python. first tutorial : [NeuralNet] All: [[1]]

intro to working with datasets: