Tuesday, October 15, 2013

Knowledge Discovery and Machine Learning

Knowledge Discovery or Data mining cycle: Following steps constitutes the entire cycle and used to solve a data problems.
  • Data cleansing: This is used to remove noise and inconsistent data.
  • Data integration: In this step we can combine data from multiple sources.
  • Data selection: We might be interested in a specific subset of the data to solve the problem.
  • Data transformation: In this step, we would like to format and convert data set in to form suitable for mining. 
  • Data mining: In this step we would like to use specific algorithms on formatted data set to get some patterns/ expected output.
  • Pattern evaluation: Here, we would like to analyze and interpret the results.
  • Knowledge presentation : We would like to use visualization and other techniques to represent our finding in front of end user.
Sometimes first four steps (cleansing, integration, selection and transformation) are grouped together under the term "data pre-processing".

Dual usage of term data mining: The term "Data Mining" can be used in two cases. First is a specific step in knowledge discovery process as explained above. Second usage is more generic which refers to entire "knowledge discovery from data" process.

Descriptive vs predictive tasks: Descriptive tasks generally characterizes/ summarizes properties of target data set. However predictive tasks try to find some hidden pattern in order to make prediction.


  • Class and Concept: Consider the data set of Best Buy. Here in sales data mobiles, laptops, camera are classes and concepts can be urban buyers, rural buyer. So concepts are more abstract and derived entities. 
  • Characterization: We could divide customers into higher class, middle class and lower class based on the income and get average income for higher class.
  • Discrimination: We would compare factors/parameters for geographical region with sales increase of 10% against geographical region with sales decrease of 10%.
  • Frequent Pattern Mining: Here we would like to do association rule mining in order to determine frequent patterns in data.
  • Classification and regression: Here we would like to predict the label or exact value based on all dependent variables.
  • Clustering analysis: In these kind of tasks there are no labels. However input records are grouped together based on their attributes.
  • Outlier analysis: Sometime we might like to find outliers and see if there is some pattern. In case of credit card transaction, fraud transactions are outliers and it makes them easy to identify. Another example can be fake profiles/bots on twitter. 

Machine learning

"Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders." [1]

Roughly we can categorize all machine learning techniques in to three main categories.

Supervised learning: Systems are training and optimized on data set with known results.

  • Classification: Based on data set with actual outputs we can train a system (algorithm) to classify emails in to "spam" or "not spam" categories. Once we train and optimize the model it can be used to automatically categorize emails. Basically we are trying to predict the category for a given input (email) based on different parameters. 
  • Regression: Regression is somewhat similar to classification except we will use it for continuous attributes. It will help us to predict the dependent attribute (output) for changes in independent attribute (input). Sales forecasting is one of the problems which can be solved using regression forecasting.
Unsupervised learning: Systems are supposed to find structure/ pattern in data where we do not know any output.
  • Clustering: In given data set we can use clustering techniques to find pattern and group similar observations. For examples, in a large data set of customers with many attributes we can use clustering techniques and identify different segments of customers.
Semi-Supervised learning: This type of techniques fall in between supervised and unsupervised learning. In the data set of study both labelled and unlabelled records are present.
  • In real life scenarios labelled data is hard to get, so we need to work with very small amount of labelled data and large amount of unlabelled data together. This is where semi-supervised learning is useful. 
  • As the training data will consists of both labelled and unlabelled records, unlabelled data points will help us to refine the model build by labelled data points.
Active Learning: This allows users to play active role in model building process. In this approach a user can be asked to label few examples. These examples can be from actual unlabelled data set or synthesized by learning program. Basically, we are trying to acquire constant knowledge from user (with some constraint like how many times we should ask user) in order to form a better learning model.

No comments:

Post a Comment