Thursday, December 10, 2009

Data Mining

Data Mining evolved from a process called Knowledge Discovery in Databases (“KDD”) which is about finding interpretive, new and useful data. The phrase “knowledge discovery in databases “was coined at a data industry workshop in 1989 to emphasize that knowledge is the end product of a data-driven discovery.

There are many parts to KDD, but it is roughly pre-processing raw data, mining it, and interpreting the results, and it embodies the overall process of discovering useful knowledge from data. Data Mining is an application of KDD using statistical analysis and automated methods for the discovery of patterns in, and the extraction of data, the overriding goal of which is to discover hidden facts contained in the data. However, the term data mining has also been referred to as knowledge management, knowledge discovery, and “sense-making”, among other euphemisms.

By mining and extracting data, it is possible to prove or disprove existing hypotheses or ideas regarding data or information, while also discovering new or previously unknown information. In particular, unique or valuable relationships between and within data can be identified and used proactively to categorize or anticipate additional data. Through the use of exploratory graphics, in combination with statistics, machine learning tools, and artificial intelligence, critical bits of information can be “mined” from large repositories of data, thereby aiding in decision making.

However, a basic problem faced by Users is mapping low-level data, which are typically too voluminous to understand and digest easily, into other forms that might be; 1) more compact- such as a short report); 2) more abstract- such as a conceptual or logical model of the process that generated the data; or 3) more useful- such as a predictive tool for estimating the value of future instances.

The traditional method of converting data into knowledge relies on manual analysis and interpretation. That is, a database user such as a financial or data analyst might periodically analyze current trends and changes in the data and then provide a report detailing the analysis to his/her supervisor or company management. This report(s) then becomes the basis for future decision making and planning.

This has been the classical approach which relies fundamentally on one or more analysts becoming intimately familiar with the data and serving as an interface between the data and the users and management. However, this form of manual probing of a data set can be slow, expensive, and potentially subjective.

Furthermore, this type of manual data analysis has become impractical as databases have grown in size because of: (1) the number of records or objects in the database and (2) the number of fields or attributes being recorded per object. That is, databases containing a large number of entities and or objects have become the norm in industry and science. For example in a field such as medical diagnostics, the number of attributes can easily exceed 100. Therein lies a raison d’être for data mining.

Data mining is typically applied through software, which analyzes relationships and patterns in stored data, based on open-ended user queries. The objective is prediction and description. A common source for data are data marts or data warehouses, and generally, any of the following types of relationships are sought:

Classes:
  Stored data categorized into predetermined groups. For example, a restaurant chain could mine
  customer purchase data to determine when customers visit and what they typically order. This
  information could be used to increase traffic by having daily specials.

Clusters:
  Data items grouped according to logical relationships or consumer preferences. For example, data
  could be mined to identify market segments or consumer affinities

Regression:
  Functions which model the data with the least error. An example would be estimating the probability
  that a patient will survive given the results of a set of diagnostic tests, or predicting consumer demand
  for a new product as a function of advertising expenditure.

Associations:
  Data mined to identify associations. The classic “beer-diaper” case study is an example of associative
  mining.

Sequential patterns:
  Data mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer
  could predict the likelihood of a backpack being purchased based on a consumer's purchase of
  sleeping bags and hiking shoes.

The data mining process typically consists of the following major elements:

• Extract, transform, and load transaction data into a data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.

In addition, different levels of analysis are available such as:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble
    biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination,
    mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate
    rules for the classification of a dataset.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination
    of the classes most similar to it in a historical dataset.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics
    tools are used to illustrate data relationships.

Data mining is used by companies in the retail, financial, communication, and marketing fields. For example, in retail marketing, the primary applications are database marketing systems, which analyze customer databases to identify different customer groups and forecast their behavior. Other marketing applications are market-basket analysis systems, which find patterns such as, if a customer bought X, he/she is also likely to buy Y and Z. These patterns can be valuable to retailers.


Another example is a retailer using point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. Also, a global firm such as Wal-Mart employs data mining to manage inventory and supplier relationships and to identify customer buying patterns and new merchandising opportunities.

Data mining is used in other areas such as science, where one of the primary application areas is astronomy. In addition, in the financial field, numerous investment and mutual fund companies use it for quantitative analysis and portfolio management. In addition, data mining is used by banks and credit card issuers for fraud detection and the U.S. Treasury through its Financial Crimes Enforcement Network, has used data mining to identify financial transactions that might indicate money laundering and terrorist criminal activity.

It is through its use by Governmental Agencies that controversy over data mining arises. There have been previous data mining methods employed to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE), and the Multistate Anti-Terrorism Information Exchange (MATRIX). These programs were discontinued due to controversy over whether they violated the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organizations, or under different names.

Two plausible data mining techniques in the context of combating terrorism include "pattern mining" and "subject-based data mining". Pattern mining is a data mining technique that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. In the context of pattern mining as a tool to identify terrorist activity, pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity. Subject-based data mining is a mining technique involving the search for associations between individuals in data.

Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. There are generally two critical technological drivers:

• Size of the database: the more data being processed and maintained, the more powerful the system
  required.
• Query complexity: the more complex the queries and the greater the number of queries being processed,
  the more powerful the system required.

Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time.

While data mining can be used to uncover patterns in data samples, it is important to understand that non-representative samples of data may produce results that are not indicative of the domain. In addition, data mining can only uncover patterns already present in the data; the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. Furthermore, “blind” applications of data-mining methods (aka "data dredging") can be a risky activity, potentially leading to the discovery of meaningless and invalid patterns.

Also, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for some observers to attribute "magical abilities" to data mining, treating the technique as a sort of crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Thus, an important part of the process is the verification and validation of patterns on other samples of data.

Some people believe that data mining is ethically neutral. Others claim that it is unethical, an invasion of privacy in violation of the Fourth Amendment of the U.S. Constitution. With respect to the latter, data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations.

A common way for this to occur is through data aggregation. Data aggregation is when the data is accrued, possibly from various sources, and put together so that it can be analyzed. This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, causes the data miner, or anyone who has access to the newly-compiled data set, to be able to identify specific individuals, especially when originally the data was anonymous. Thus, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics.

In summary, data mining has proven to be a useful tool on the discovery path for knowledge and truth. It provides tangible benefits to our society in areas such as national security and crime prevention and in economic terms can enhance the efficient allocation of scarce resources thereby lowering costs for both producers and consumers of goods and services. The risks and costs to society lie in the manner(s) in which data mining is deployed. This remains a delicate balancing act as what may be beneficial for society as a whole, may infringe on an individual’s right to privacy. One way to maintain this balance is to seek the prior, informed consent from those whose private data is being “mined”. To this end, it would be appropriate for an individual be made aware of the following before data is collected:

• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use it,
• the security surrounding access to the data,
• how collected data can be updated.

In addition, methods may be additionally employed to modify the data so that it is anonymous, so that individuals may not be readily identified. However, even de-identified data sets can contain enough information to identify individuals. (Note: this situation occurred a few years ago with AOL).

Nonetheless and notwithstanding informed consent and privacy concerns, with rapid advances in technology and artificial intelligence, I believe the applications of data mining will continue to increase.




















References

http://www.the-data-mine.com/bin/view/Misc/IntroductionToDataMining

Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From Data Mining to Knowledge Discovery in Databases" AI Magazine, Fall, 1996.

The DBMS Guide to Data Mining Solutions (1998). A collection of articles by Estelle Brand and Rob Gerritsen including: Data Mining and Knowledge Discovery, Predicting Credit Risk, Neural Networks, Naýve-Bayes and Nearest Neighbors, and Decision Trees.

Data-Mining. California Computer News (October 27, 2004).

Data Mining: Exploiting the Hidden Trends in Your Data. By Herb Edelstein. DB2 Online Magazine (Spring 1997).

Machine Learning and Data Mining. By Tom M. Mitchell, Center for Automated Learning and Discovery at Carnegie Mellon University. (1999). Communications of the ACM, Vol. 42, No. 11; pages 30 - 36.

IT Versus Terror - Preventing a terror attack is invaluable. But even invaluable IT projects need realistic business case analysis to succeed. By Ben Worthen. CIO (August 2006).

Coaxing Meaning Out Of Raw Data. By John W. Verity. Business Week (February 3, 1997).

No comments:

Post a Comment

Get your own Widget