PHILIP DUHE'S BLOG: December 2009

Monday, December 28, 2009

The Relational Model and Data Normalization

The relational model is based on the concept that any predefined path through a data structure is too restrictive, particularly in light of ad hoc requests for information. This model presents data in a two-dimensional table format, similar to a spreadsheet, and provides the ability to relate records as needed irrespective of how the data records were first stored in a database.

Normalization is a data modeling technique developed by Dr. E.F. Codd at IBM in 1972 for producing a set of relations that possess certain properties. Through his work with relational theory, Codd discovered that unnormalized data was subject to certain anomalies(insert, delete, update) when changes to data are made.

The normalization process is straightforward. You take any relation(data presented logically in column and row format) and choose a primary key. Then through a series of steps(normal forms), the data is reorganized to eliminate these anomalies. Lets consider a sample invoice:

                                                             Sample Company
                                                                   Invoice

Customer No: 12345                                                                         Terms: Net 30
Customer: ABC, Inc. Ship Terms: USPS
               2222 Main Street
               Fallbrook, CO 83555     Order Date: 11/30/2009
              (222)222-2222

Product No.    Description             Qty    Unit Price Extended Amt
-----------            -------------    ------            -----------    -------------
RPS-4490            Springs    2                   15.00    $ 30.00
STR-3300    Belts                        2                    5.00                      10.00
MLT-1200           Straps                    1    3.00                       3.00

TOTAL ORDER AMOUNT $ 43.00

This invoice would be considered a user view and displays the data problems identified by Codd. If one were to make a database table from this view, a new customer could not be inserted into the database without them buying something. This is because the customer data is embedded in the invoice. This is the insert anomaly.

Similarly, if this was the last invoice for a customer, deleting it would cause the loss of all data on the customer. Again, this is because data from two entities are mixed into a single relation. This is the delete anomaly. Finally, if one wanted to change the customer's address, it would have to be changed on every invoice and the possibility of an update being being missed is high. This is the update anomaly.

To begin normalizing the data, one should first identify a unique identifier for the invoice, e.g., invoice # and then separate the customer and product data into separate relations. Each additional relation gets its own unique identifier, called a primary key(#). The normalization process would result in the following tables:

INVOICE: #Invoice Number, Customer Number, Terms, Ship Terms, Order Date
INVOICE LINE ITEM: #Invoice Number, #Product Number, Quantity, Sales Price Unit
PRODUCT: #Product, Description, List Unit Price
CUSTOMER: #Customer Number, Customer Name, Customer Address, City, State, Zip Code
Phone No.

A relation is in 1st normal form if it contains no multivalued attributes. That is, every intersection of a row and column must contain at most one data value. With this example, the product line items are placed into a separate relation with a composite key combining the invoice# and product#.

A relation is in 2nd normal form if all non-key attributes are dependent of the entire primary key. Here, the product information depends only on the product# and not the composite key of invoice# and product#. As such, product is moved to a separate relation.

A relation is in 3rd normal form if all the non-key attributes depend only on the primary key. Here the invoice includes attributes of another entity(Customer) included in the invoice relation. An attribute that depends on another attribute that is not the primary key of the relation is considered to be transitively dependent. The solution is to simply move any transitory dependent attributes to relations where they depend only on the primary key.

The end result of the data normalization process is every non-key attribute must depend entirely on the primary key. Although there are now 4 relations instead of 1, the problems of insert, update, and delete anomalies are eliminated. The data is linked through the unique key identifiers and ad hoc requests for information is not dependent on how the records are stored in the database.

Friday, December 18, 2009

Non-Equi Joins

Typically in SQL, tables joined on the equality of a condition based on the concept of primary and foreign keys. However, as the name implies, a non-equi join query is based on the "inequality" of a condition. Consider a result set based on the inequality of an employee ID in an address table. A non-equi join would join all address records to other employees, but not to the employee record to which it should belong. However, the non-equi join does this, but on fields other than the primary and foreign key. Take a look at an example. Suppose we had the following two tables, Employees and Salary_Grade:

Employees
emp_id, first_name, last_name, job_id, salary
Salary_Grade
grade, min_salary, max_salary

Here the salary field in the Employees table is joined on a range of minimum and maximum salaries in the Salary_Grade table, join conditions which do not match.

SQL> select E.last_name,E.salary,G.grade
2 from employees E INNER JOIN sal_grade G
3 on E.salary BETWEEN G.min_sal AND G.max_sal
4 order by G.grade;

The result is shown below:

LAST_NAME SALARY G
------------------------- ---------- -
Philtanker 2200 A
Markle 2200 A
Olson 2100 A
Gee 2400 A
Landry 2400 A
Perkins 2500 B
Himuro 2600 B
Matos 2600 B
OConnell 2600 B
Grant 2600 B
Mikkilineni 2700 B

Although not very common, there are some situations where a non-equi join makes sense.

Wednesday, December 16, 2009

XML and Web Services

This post is the first of a series on XML and its use in Web Services. XML stands for EXtensible Markup Language and is a markup language you can use to create your own tags. XML was created by the World Wide Web Consortium (W3C) to overcome the limitations of HTML, the Hypertext Markup Language that is the basis for all Web pages.

Like HTML, XML is based on SGML -- Standard Generalized Markup Language. Although SGML has been used in the publishing industry for decades, its perceived complexity intimidated many people that otherwise might have used it. XML was designed with the Web in mind.

XML is not a replacement for HTML as XML and HTML were designed with different goals. XML was designed to transport and store data, with focus on what data is. In contrast, HTML was designed to display data, with focus on how data looks. HTML is about displaying information, while XML is about carrying information.

HTML is the most successful markup language of all time. You can view the simplest HTML tags on virtually any device, from palmtops to mainframes, and you can even convert HTML markup into voice and other formats with the right tools. Given the success of HTML, why did the W3C create XML? To answer that question, take a look at this example HTML code:

Mr. John Smith 
325 Knightsbridge Court 
Some Place, CA 94503-4160

Here anyone can determine that this is a mailing address. The problem with HTML is that it was designed with humans in mind. Even without viewing the above HTML code in a browser, one can determine it is a mailing address. (Specifically, it's a mailing address for someone in the United States; even if you're not familiar with all the components of U.S. postal addresses, you could probably guess what this represents.)

As humans, we have the intelligence to determine the meaning and intent of most documents. A computer or machine, can't do that yet. While the tags in this document tell a browser how to display this information, the tags don't tell the browser what the information is. We know it's an address, but a machine doesn't.

Now let's look at an example of XML coding. With XML, you can create your tags and assign some meaning to the tags in the document. In HTML, tags are pre-defined. More importantly, it's easy for a machine to process the information as well. You can extract the postal code from this document by simply locating the content surrounded by the and tags, technically known as the element.

<address>
<name>
<title>Mr.</title>
<firstName>John</firstName>
<lastName>Smith</lastName>
</name>
<street>325 Knightsbridge Court</street>
<city>Some Place</city>
<state>CA</state>
<postalCode>94503-4160</postalCode>
</address>

XML was not designed to be a replacement for HTML. The latter was designed to display data with the emphasis on how the data looks in different browers. XML was designed to store and transmit data with emphasis on what the data is. XML is about creating documents with self-describing data.

How XML is changing the Web? Here are a few key areas:

* XML simplifies data exchange. Different organizations (or even different parts of the same organization) rarely standardize on a single set of tools. As such, it can take a significant amount of work for applications to communicate with each other. Using XML, each organization creates a single utility that transforms their internal data formats into XML and vice versa. In addition, there's a good chance that their software vendors already provide tools to transform their database records (or LDAP directories, or purchase orders, and so forth) to and from XML.

* XML enables smart code as XML documents can be structured to identify every important piece of information (as well as the relationships between the pieces). It's possible to write code that can process XML documents without human intervention. The fact that software vendors have spent massive amounts of time and money building XML development tools means writing that code is a relatively simple process.

* XML enables smart searches. Although search engines have improved steadily over the years, it's still quite common to get erroneous results from a search. If you're searching HTML pages for someone named "Amber," you might also find pages on amber colors, amber light, and other non-relevant matches. Searching XML documents for elements that contained the text Amber would give you a much better set of results.

XML is now as important for the Web as HTML was to the foundation of the Web. It has the most common tool for data transmissions between all sorts of applications, and is becoming more and more popular in the area of storing and describing information.

Thursday, December 10, 2009

Data Mining

Data Mining evolved from a process called Knowledge Discovery in Databases (“KDD”) which is about finding interpretive, new and useful data. The phrase “knowledge discovery in databases “was coined at a data industry workshop in 1989 to emphasize that knowledge is the end product of a data-driven discovery.

There are many parts to KDD, but it is roughly pre-processing raw data, mining it, and interpreting the results, and it embodies the overall process of discovering useful knowledge from data. Data Mining is an application of KDD using statistical analysis and automated methods for the discovery of patterns in, and the extraction of data, the overriding goal of which is to discover hidden facts contained in the data. However, the term data mining has also been referred to as knowledge management, knowledge discovery, and “sense-making”, among other euphemisms.

By mining and extracting data, it is possible to prove or disprove existing hypotheses or ideas regarding data or information, while also discovering new or previously unknown information. In particular, unique or valuable relationships between and within data can be identified and used proactively to categorize or anticipate additional data. Through the use of exploratory graphics, in combination with statistics, machine learning tools, and artificial intelligence, critical bits of information can be “mined” from large repositories of data, thereby aiding in decision making.

However, a basic problem faced by Users is mapping low-level data, which are typically too voluminous to understand and digest easily, into other forms that might be; 1) more compact- such as a short report); 2) more abstract- such as a conceptual or logical model of the process that generated the data; or 3) more useful- such as a predictive tool for estimating the value of future instances.

The traditional method of converting data into knowledge relies on manual analysis and interpretation. That is, a database user such as a financial or data analyst might periodically analyze current trends and changes in the data and then provide a report detailing the analysis to his/her supervisor or company management. This report(s) then becomes the basis for future decision making and planning.

This has been the classical approach which relies fundamentally on one or more analysts becoming intimately familiar with the data and serving as an interface between the data and the users and management. However, this form of manual probing of a data set can be slow, expensive, and potentially subjective.

Furthermore, this type of manual data analysis has become impractical as databases have grown in size because of: (1) the number of records or objects in the database and (2) the number of fields or attributes being recorded per object. That is, databases containing a large number of entities and or objects have become the norm in industry and science. For example in a field such as medical diagnostics, the number of attributes can easily exceed 100. Therein lies a raison d’être for data mining.

Data mining is typically applied through software, which analyzes relationships and patterns in stored data, based on open-ended user queries. The objective is prediction and description. A common source for data are data marts or data warehouses, and generally, any of the following types of relationships are sought:

• Classes:
Stored data categorized into predetermined groups. For example, a restaurant chain could mine
customer purchase data to determine when customers visit and what they typically order. This
information could be used to increase traffic by having daily specials.

• Clusters:
Data items grouped according to logical relationships or consumer preferences. For example, data
could be mined to identify market segments or consumer affinities

• Regression:
Functions which model the data with the least error. An example would be estimating the probability
that a patient will survive given the results of a set of diagnostic tests, or predicting consumer demand
for a new product as a function of advertising expenditure.

• Associations:
Data mined to identify associations. The classic “beer-diaper” case study is an example of associative
mining.

• Sequential patterns:
Data mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer
could predict the likelihood of a backpack being purchased based on a consumer's purchase of
sleeping bags and hiking shoes.

The data mining process typically consists of the following major elements:

• Extract, transform, and load transaction data into a data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.

In addition, different levels of analysis are available such as:

• Artificial neural networks: Non-linear predictive models that learn through training and resemble
    biological neural networks in structure.
• Genetic algorithms: Optimization techniques that use processes such as genetic combination,
    mutation, and natural selection in a design based on the concepts of natural evolution.
• Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate
    rules for the classification of a dataset.
• Nearest neighbor method: A technique that classifies each record in a dataset based on a combination
    of the classes most similar to it in a historical dataset.
• Rule induction: The extraction of useful if-then rules from data based on statistical significance.
• Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics
    tools are used to illustrate data relationships.

Data mining is used by companies in the retail, financial, communication, and marketing fields. For example, in retail marketing, the primary applications are database marketing systems, which analyze customer databases to identify different customer groups and forecast their behavior. Other marketing applications are market-basket analysis systems, which find patterns such as, if a customer bought X, he/she is also likely to buy Y and Z. These patterns can be valuable to retailers.

Another example is a retailer using point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. Also, a global firm such as Wal-Mart employs data mining to manage inventory and supplier relationships and to identify customer buying patterns and new merchandising opportunities.

Data mining is used in other areas such as science, where one of the primary application areas is astronomy. In addition, in the financial field, numerous investment and mutual fund companies use it for quantitative analysis and portfolio management. In addition, data mining is used by banks and credit card issuers for fraud detection and the U.S. Treasury through its Financial Crimes Enforcement Network, has used data mining to identify financial transactions that might indicate money laundering and terrorist criminal activity.

It is through its use by Governmental Agencies that controversy over data mining arises. There have been previous data mining methods employed to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE), and the Multistate Anti-Terrorism Information Exchange (MATRIX). These programs were discontinued due to controversy over whether they violated the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organizations, or under different names.

Two plausible data mining techniques in the context of combating terrorism include "pattern mining" and "subject-based data mining". Pattern mining is a data mining technique that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. In the context of pattern mining as a tool to identify terrorist activity, pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity. Subject-based data mining is a mining technique involving the search for associations between individuals in data.

Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. There are generally two critical technological drivers:

• Size of the database: the more data being processed and maintained, the more powerful the system
required.
• Query complexity: the more complex the queries and the greater the number of queries being processed,
the more powerful the system required.

Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time.

While data mining can be used to uncover patterns in data samples, it is important to understand that non-representative samples of data may produce results that are not indicative of the domain. In addition, data mining can only uncover patterns already present in the data; the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. Furthermore, “blind” applications of data-mining methods (aka "data dredging") can be a risky activity, potentially leading to the discovery of meaningless and invalid patterns.

Also, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for some observers to attribute "magical abilities" to data mining, treating the technique as a sort of crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Thus, an important part of the process is the verification and validation of patterns on other samples of data.

Some people believe that data mining is ethically neutral. Others claim that it is unethical, an invasion of privacy in violation of the Fourth Amendment of the U.S. Constitution. With respect to the latter, data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations.

A common way for this to occur is through data aggregation. Data aggregation is when the data is accrued, possibly from various sources, and put together so that it can be analyzed. This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, causes the data miner, or anyone who has access to the newly-compiled data set, to be able to identify specific individuals, especially when originally the data was anonymous. Thus, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics.

In summary, data mining has proven to be a useful tool on the discovery path for knowledge and truth. It provides tangible benefits to our society in areas such as national security and crime prevention and in economic terms can enhance the efficient allocation of scarce resources thereby lowering costs for both producers and consumers of goods and services. The risks and costs to society lie in the manner(s) in which data mining is deployed. This remains a delicate balancing act as what may be beneficial for society as a whole, may infringe on an individual’s right to privacy. One way to maintain this balance is to seek the prior, informed consent from those whose private data is being “mined”. To this end, it would be appropriate for an individual be made aware of the following before data is collected:

• the purpose of the data collection and any data mining projects,
• how the data will be used,
• who will be able to mine the data and use it,
• the security surrounding access to the data,
• how collected data can be updated.

In addition, methods may be additionally employed to modify the data so that it is anonymous, so that individuals may not be readily identified. However, even de-identified data sets can contain enough information to identify individuals. (Note: this situation occurred a few years ago with AOL).

Nonetheless and notwithstanding informed consent and privacy concerns, with rapid advances in technology and artificial intelligence, I believe the applications of data mining will continue to increase.

References

http://www.the-data-mine.com/bin/view/Misc/IntroductionToDataMining

Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From Data Mining to Knowledge Discovery in Databases" AI Magazine, Fall, 1996.

The DBMS Guide to Data Mining Solutions (1998). A collection of articles by Estelle Brand and Rob Gerritsen including: Data Mining and Knowledge Discovery, Predicting Credit Risk, Neural Networks, Naýve-Bayes and Nearest Neighbors, and Decision Trees.

Data-Mining. California Computer News (October 27, 2004).

Data Mining: Exploiting the Hidden Trends in Your Data. By Herb Edelstein. DB2 Online Magazine (Spring 1997).

Machine Learning and Data Mining. By Tom M. Mitchell, Center for Automated Learning and Discovery at Carnegie Mellon University. (1999). Communications of the ACM, Vol. 42, No. 11; pages 30 - 36.

IT Versus Terror - Preventing a terror attack is invaluable. But even invaluable IT projects need realistic business case analysis to succeed. By Ben Worthen. CIO (August 2006).

Coaxing Meaning Out Of Raw Data. By John W. Verity. Business Week (February 3, 1997).

Monday, December 7, 2009

Preventing SQL Injection Attacks

In this posting, I will show how to use PHP prepared statements to prevent SQL injection attacks when using a MySQL database. An SQL injection attack, as the name implies, is an attempt to insert malicious code into a site's SQL queries. One aim of such attacks is to create a syntactically invalid query, thereby revealing something about the script or database in the resulting error message. An even bigger goal would be to alter, destroy, or expose your stored data.

To prevent such an attack, you should validate all data to be used in queries and perform typecasting whenever possible. In addition, you can use a PHP function like mysqli_real_escape_string(), which makes the data safe to use. in queries.

As an alternative is to use Prepared Statements which were introduced to MySQL in version 4.1 and available to PHP in version 5 through the Improved MySQL extension. Without prepared statements, the entire query, including the SQL syntax and the specific values, is sent to the MySQL parser and executed as one long string. With a prepared query, the SQL syntax is sent to the MySQL parser first. Then the specific values are sent separately. MySQL then assembles the query using those values and then executes it.

Prepared statements can be created out of any SELECT, INSERT, UPDATE, or DELECT query. Start by structuring your query, marking placeholders using question marks. An example is shown below:

   Take a query such as:
   $q = "SELECT first_name, last_name, email FROM customers WHERE customer_id=$cust_id";
   As a prepared statement, this query becomes:
   $q= "SELECT first_name, last_name, email FROM customers WHERE customer_id=?";
   Next, prepare the statement in MySQL:
   $stmt=mysqli_prepare($dbconnection, $q);
   At this point, MySQL will parse the query for valid syntax, but won't execute it.
   Next, bind the PHP variables to the query's placeholders:
   mysqli_stmt_bind_param($stmt, 'i', $cust_id);
   The i part of the command indicates what kind of variable, integer in this case, to expect.

The table below shows the characters to use when binding variables:

Letter	Represents
d	Decimal
i	Integer
b	Blog(binary data)
s	All Other Types

Prepared statements will always be more secure than running queries the traditional way. In addition, they may also be faster. If a PHP script sends the same query to MySQL multiple times, using different values each time, prepared statements could speed things up. In this situation, the query itself is only parsed by MySQL once. Then the values are sent to MySQL separately.

Preventing Cross Scripting Attacks with PHP

This post will discuss Cross Scripting attacks to your HTML forms; what are they and how to prevent them. HTML is simply plain text, like the bold tag, which is given special meaning by web browsers. As such, visitors to your web site could easily place HTML in their form data, like in the comments field in an email form.

What could be the problem? Well, many dynamic Web applications take information submitted by a user, store it in a database, and re-display that information on another page. For example, a message board or forum are two examples. If a user were to enter HTML code in their form data, this code could throw off the layout and and appearance of your site.

In addition, suppose a user placed Javascript into your web form. Javascript is also plain text, but text that has special meaning, which is executable code within a browser. If such code was entered into a form and re-displayed in the browser, it could create popup windows, steal cookies, or redirect the browser to another page. This is known as a Cross Scripting or XSS attack.

PHP includes several functions for handling HTML and other code found within strings, thus preventing XSS attacks. These include:

htmlspecialchars(), which turns &, ', ", <, > into empty HTML formats
htmlentities(), which turns all applicable characters into their HTML element format.
strip_tags(), which removes all HTML and PHP tags.

These functions are listed from the least disruptive to the most. A couple of examples of the syntax is shown below:

  //Check for the form submission and compare original data with what was submitted.
   if (isset($_POST['submitted'])) {
{$_POST['data']}
//To keep submitted information from messing up a page or hacking the web browser.
htmlentities($_POST['data'])
//The strip_tags function completely removes any javascript, HTML, or PHP tags and is the most effective

   way to prevent XSS attacks.
   strip_tags($_POST['data'])
    }

Unrelated to security but quite useful is the nl2br() function. It turns every return(such as those entered into a text area) into an HTML br tag.

Saturday, December 5, 2009

Oracle SQL Merge Statement

The Oracle SQL Merge Statement is a powerful utility that allows the combination of inserts, updates and deletes all in one statement. It is particularly useful in data warehousing operations.

Let's review an example. Suppose under a hypothetical scenario, there is an employee and bonus table that contains a flat bonus for all employees in a sales position(commission pct. is not null).

Suppose HR decided that all sales employees should receive an additional bonus of 1% of their salary, if their salary was less than $50000. However, all other employees not in sales receive a bonus of 10% of their salary if it is equal to or less than $50000. The appropriate SQL Merge statement is depicted below:

MERGE INTO Bonus B
USING (SELECT employee_id,salary,department_id
FROM employees) S
ON(S.employee_id=B.employee_id)
WHEN MATCHED THEN
UPDATE SET B.Bonus=round(B.Bonus+0.01*S.salary,2)
DELETE WHERE (S.salary >50000)
WHEN NOT MATCHED THEN
INSERT (B.employee_id,B.Bonus) VALUES (S.employee_id,round(S.salary*0.1,2))
WHERE(S.salary<=50000); SELECT B.Bonus,B.employee_id,S.salary,S.commission_pct FROM employees S LEFT OUTER JOIN Bonus B on B.employee_id = S.employee_id ORDER BY B.employee_id;

Now for an explanation:
MERGE INTO:
List the table you want to either insert, update or delete data.
USING:
List the source data; can be either a table, view or subquery.
ON:
Represents the condition you want to match the source and target table on.
WHEN MATCHED:
If there is a match, perform and update and possible delete based on the WHERE condition.
WHEN NOT MATCHED:
If there is no match, perform an insert and possible delete based on the WHERE condition.

And there you have it, a complex, but useful data warehousing solution.

Pagination Using PHP

PHP is a server-side scripting language used to make dynamic web pages. Ever wonder how to paginate results from a MYSQL database using PHP? Well here is a script that I have written that could help you.

// First, make the MySql database connection:
$con=mysqli_connect('localhost','username','password','db');
if (!$con)
{
die('Could not connect: ' . mysqli_connect_error());
}

Next, determine the number of records per page you want to see and then determine how many pages. One way to do this is by using an IF conditional and the GET method.

$display=10;
if (isset($_GET['p']) && is_numeric($_GET['p'])) { //Already been determined
$pages=$_GET['p'];
} else {// Need to determine:

// Next, count the number of records:

$q= "SELECT COUNT(id) FROM db";
$r=@mysqli_query($con,$q);
$row=@mysqli_fetch_array($r,MYSQLI_NUM);
$records=$row[0];

//Calculate the number of pages...

if ($records > $display) { // More than 1 page
$pages=ceil ($records/$display);
} else {
$pages=1;
}
}

// End of p IF.

//Determine where in the database to start returning results...
if (isset($_GET['s']) && is_numeric($_GET['s'])) {
$start = $_GET['s'];
} else {
$start=0;
}

//Make the query

$q= "SELECT id FROM db ORDER BY id ASC LIMIT $start,$display";
$r= mysqli_query($con,$q);
//Count the number of returned rows:
//$num= mysqli_num_rows($r);
//if ($num> 0) { //If it ran ok, display the records

//Make the links to the other pages, if necessary.
if ($pages > 1) {
//Determine what page the script is on:

$current_page=($start/$display) + 1;
//If it's not the first page, make a previous button:
if ($current_page !=1) {
echo 'Previous ';
}

//Make all the numbered pages:

for ($i = 1; $i <=$pages; $i++) { if ($i !=$current_page) { echo '' . $i . ' '; } else { echo $i . ' '; } }

// End of For loop.
//If it's not the last page, make a Next button:

if ($current_page !=$pages) {
echo 'Next';
}

Happy coding!

Correlated Subqueries

This post will review a correlated subquery in SQL programming. To understand a correlated subquery, lets first discuss a non-correlated subquery. A non-correlated subquery is a sub-select where the inner select makes no reference to the outer select that contains it. Here is an example:

SELECT column1, column2...
FROM table1
WHERE column1 NOT IN
(SELECT column1
FROM table2);

In this query, the inner select can be run first, and the result set of the inner select used in the outer select. A correlated subquery works fundementally different as the inner select refers to values produced by the outer select. The syntax is shown below:

SELECT Out.column1, Out.column2,...
FROM table1 "Out"
WHERE Out.column2 operator (SELECT In.column2...
FROM table 2 "In"
WHERE In.column1=Out.column1);

The correlated subquery references a column of the main query to establish a link to the main query. This means that for every row of the outer query, the sub(inner) query is evaluated. Note the alias assigned to the tables in the outer and inner queries and the use of them in the WHERE clause in the inner query.

Also note, a correlated subquery could be performed where the outer and inner tables are the same. A scenario would be where you want to select all employees whose salary exceeds the average salary of all employees. Furthermore, a correlated subquery can be used as a column expression as a filter expression in the WHERE clause or as a data source expression in the FROM clause.

However, correlated subqueries are far less efficient than non-correlated subqueries because the inner query must be invoked for each row found by the outer query. With a non-correlated subquery, the inner query is run only once.

PHP Connection Script to MySQL

This post will discuss using the XAMPP for Windows tool to install a local Apache web server, PHP, and a MySql database server. I will also show how to connect to MySql by using PHP.

XAMPP can be downloaded from http://www.apachefriends.org/en/xampp-windows.html. Besides a PHP parser and a MySQl database server, XAMPP also includes a ftp server, a mail server and MyPhpAdmin, a web-based php client. There are 3 install options: Basic Package, EXE, and a ZIP file. EXE is a self-extracting RAR archive. The Basic Package is a Windows installer and is one I used.

With the Windows installer, XAMPP is installed in a directory on the C drive. There is a subdirectory called htdocs which is the root folder for loading your html or php files. To view pages in your browser, you must type http://local host or 127.0.0.1 followed by the file name and file extension.

PHP script to connect to MySQL database:
A database connection variable is established through a mysql_connect function. The function contains three parameters, the host system, username, and password. On a local server, the host name would be local host or 127.0.0.1; the username is root and the password is user defined.

An IF conditional tests whether a connection to MySQL occurs. An error message is displayed if there is no connection. Once a connection is established, the you select the database with the mysql_select_db function which takes the connection and the name of the database.

?php #Connection Script to MySQL

$dbc = mysql_connect('hostname', 'username', 'password');
if (!$dbc) {
die('Not connected : ' . mysql_error());
}

// make contacts the current db
$db = mysql_select_db('dbname', $dbc);
if (!$db) {
die ('Can\'t use dbname : ' . mysql_error());
}

?>

Geographic Information Systems

A Geographic Information System (“GIS”) is a computer system capable of capturing, storing, analyzing, and displaying geographically referenced information; that is, data identified according to location. Technically, a GIS includes mapping software and its application with remote sensing, land surveying, aerial photography, mathematics, photogrammetry, geography, and tools that can be implemented with GIS software. In addition, a GIS can also be defined as the procedures, operating personnel, and spatial data that go into the system. However, a GIS typically doesn’t include all tools connected to topology.

The computer equipment used in GIS is like any other computer hardware such as keyboards, display monitors (screen), cables, Internet connection. In addition, the hardware configuration may include extra components such as large printers and plotters to make map output and devices to scan and input data from maps to GIS digitizers.

However, not every GIS needs this configuration, as the essential element is the type of information being stored. That is, a computer being used for GIS would include maps and images. In addition, GIS functions can be much more sophisticated such as to:
• measure distances, areas
• combine maps of the same area together
• keep inventories of what is where
• manage properties, facilities
• judge the suitability of areas for different purposes

The power of a GIS comes from the ability to relate different information in a spatial context and to reach a conclusion about this relationship. For example, most of the information we have about our world contains a location reference, placing that information at some point on the globe. In the strictest sense, the term “GIS” describes any information system that integrates stores, edits, analyzes, shares, and displays geographic information. In a more generic sense, GIS applications are tools that allow users to create interactive queries (user created searches), analyze spatial information, edit data, maps, and present the results of all these operations.

One of the most common products of a GIS is a map. Maps are generally easy to make using a GIS and they are often the most effective means of communicating the results of the GIS process. A GIS can also be used to emphasize the spatial relationships among the objects being mapped. While a computer-aided mapping system may represent a road simply as a line, a GIS may also recognize that road as the boundary between wetland and urban development between two census statistical areas.

Data capture—putting the information into the system—involves identifying the objects on the map, their absolute location on the Earth's surface, and their spatial relationships. Software tools that automatically extract features from satellite images or aerial photographs are gradually replacing what has traditionally been a time-consuming capture process. Objects are identified in a series of attribute tables—the "information" part of a GIS. Spatial relationships, such as whether features intersect or whether they are adjacent, are the key to all GIS-based analysis. As such, databases are extended to include GIS by using object oriented programming.

Thus, In terms of database design, GIS combine features of the relational model, with the object-oriented model and thus could be described as an “object-relational” system.
GIS technology can be used for scientific investigations, resource management, asset management, archaeology, environmental impact assessment, urban planning, cartography, criminology, geographic history, marketing, logistics, mapping and other purposes.

For example, when rainfall information is collected, it is important to know where the rainfall is located. This is done by using a location reference system, such as longitude and latitude, and perhaps elevation. Comparing the rainfall information with other information, such as the location of marshes across the landscape, may show that certain marshes receive little rainfall. This fact may indicate that these marshes are likely to dry up, and this inference can help us make the most appropriate decisions about how humans should interact with the marsh. A GIS, therefore, can reveal important new information that leads to better decision making.
In addition, GIS might allow emergency planners to easily calculate emergency response times in the event of a natural disaster, GIS might be used to find wetlands that need protection from pollution, or GIS can be used by a company to site a new business location to take advantage of a previously under-served market.

An active commercial GIS market (for example Autodesk is a major participant) has resulted in lower costs and continual improvements in the hardware and software components. These developments will, in turn, result in a much wider use of the technology throughout science, government, business, and industry, with applications including real estate, public health, crime mapping, national defense, sustainable development, natural resources, landscape architecture, archaeology, regional and community planning, transportation and logistics.

GIS is also appearing in location-based services (“LBS”). LBS allows GPS enabled mobile devices to display their location in relation to fixed assets (nearest restaurant, gas station, fire hydrant), mobile assets (friends, children, police car) or to relay their position back to a central server for display or other processing. These services continue to develop with the increased integration of GPS functionality with increasingly powerful mobile electronics (cell phones, PDAs, laptops. For example, web mapping services such as MapQuest utilize GIS.

GIS industry applications are diverse such as:
Utilities:
This category includes gas, phone, electric, water, cable TV companies. A single company may have hundreds of thousands of customer search with a connection to the network plus thousands of miles of wires, underground pipes with transformers, switches, poles...representing billions of dollars of installed infrastructure. A utility company receives thousands of maintenance calls per day. As such, utilities need to:
• keep track of all this activity
• maintain accurate information about what is where
• keep records up to date
• make daily work assignments to crews
• provide information to others e.g. another company wishes to dig up a street, what do they need to avoid?

Transportation:
A commonly encountered GIS application are the information systems used by airlines and travel agents to make reservations, check in passengers, etc. In addition, a state department of transportation needs to:
• store information on the state of pavement everywhere on the state highway network
• maintain an inventory of all highway signs
• analyze data on accidents, look for 'black spots'

Also, a company could use GIS for enabling traveling salespeople to find locations and time-saving routes. Firms such as Federal Express, UPS, and government agencies such as the U.S. Postal Service use GIS to:
• keep track of shipments, know where they are
• plan efficient delivery routes

Agriculture:
Farmers increasingly use GIS to produce detailed maps and images for crop planting to:
• analyze yields
• plan efficient application of fertilizers, chemicals
These techniques are known as precision agriculture.

Foresters:
Use GIS to keep track of what timber is growing. They need to be able to plan timber harvest for how to provide for timber needs now, but maintain a healthy forest resource for the future. In addition, they use GIS to plan locations of roads, methods of cutting and removing logs to comply with environmental regulations. There is also a need to manage forests for many purposes, including recreation.

In summary, a GIS stores relational geographic information including maps and images. With rapid advances in technology and deceasing costs for storage and memory devices, the industry applications for GIS have grown in recent years from a small niche market. Similar to other relational databases, GIS enable users to perform interactive queries and its object oriented features aid in decision-making where data is linked to geography.

Electronic Databases And The Right to Privacy

Collecting personal information is often a source of controversy. While obtaining personal information has a clear value in criminal justice applications, it also raises civil liberty and privacy issues in other areas. In addition, with the worldwide web and electronic record keeping, cyber criminal and terrorist organizations, pose increasing threats to individual financial well-being and national security.

Personal information already exists in a wide variety of databases. For example, health and life insurance companies possess personal information such as medical, criminal, motor vehicle, and credit records. The Medical Information Bureau (“MIB”) contains approximately 15 million files in a central database. Every time an insurance claim is filed, a copy of this information is forwarded to the MIB. In addition, governmental agencies such as Medicare and the Social Security Administration collect personal information.

Here are some ways collecting personal information (including DNA) in electronic databases is helpful to society:

Accurate and fast identification of offenders and potential terrorist threats

Critical leads for police investigators

Exonerating innocent suspects

Increased conviction of offenders

Earlier detection and treatment of disease; better organized and legible medical files

With encryption technology, computerized records may actually be more secure than paper records. A fax a copy of a paper record can be stolen without leaving a trace.

However, some potential negative consequences that could arise are:

Denial of health coverage

Denial of life insurance

Discrimination by employers

Violation of privacy through leakage of information

A step on the road to a 'Big Brother' state'

National databases could be massively expensive and bureaucratic

Digitizing records will allow many more people legitimate access; personal information can easily be misused if it is accessed by so many individuals.

Some institutions gather medical information and sell it to drug companies.

In the US, the current health care reform debate has centered on methods of reducing costs in the delivery of patient care. One area that is getting strong attention is the implementation of a national medical computerized record keeping system. Legislators and health care policy advocates view eliminating the paperwork burden on practitioners and third-party payers as a potential huge cost savings to society, as well as the means to offer some form of medical coverage for all citizens, if a government health care option is offered.

Already, some health care providers and insurance companies are forming regional information networks to share electronic medical records. Their reasoning for setting up these data banks is to help with the reduction of paperwork, help with billing, identify the most cost-effective treatment, and to fight against false claims.

In this scenario, a person's medical information is immediately available for the attending doctor. Therefore if an individual was injured in another part of the country, the attending physicians would have the patient's entire medical history at their fingertips. Included in this information could be life saving information that would be invaluable to the attending doctor.

Thus, with advances in technology, more personal information being collected and maintained in databases. However, as databases grow, so too have fears about privacy. Many people fear that recording and storing personal information such as DNA samples in a national database violates the Fourth Amendment of the U.S. Constitution, which protects American citizens from unlawful searches and seizures without probable cause. Many argue that DNA databases treat all people with database profiles as potential suspects without probable cause. In other words, you are guilty until proven innocent. If this argument is correct, a national DNA database would be a violation of the Fourth Amendment. Others argue that a DNA database is an effective law enforcement tool.

Another area of concern are Iris scans which have similar privacy issues as DNA collection. The human iris is as distinctive as a fingerprint, and is increasingly used by law enforcement agencies for identification. As such, privacy advocates charge that Iris scans and DNA privacy issues are symptoms of an increasingly surveillance-based society.

Nonetheless, I believe the electronic collection of sensitive personal information is here to stay. Furthermore, emerging technological advances such as virtualization and cloud computing will make it easier and cheaper to collect and maintain an increasing amount of personal information.

So, I believe the collection of personal data should be seen in the context of the personal information that is already held by outside agencies. For example, insurance brokers commonly require an extensive medical history of their clients. Mortgage lenders usually demand a full credit record of each applicant. ( Note: all real estate transactions in California require a notary public to fingerprint the mortgagee). Employers subject their employees to random urine tests for drug and alcohol consumption. However, these actions are governed by the individual’s informed consent. When the citizen releases information to outside agencies he/she receives a service or benefit in return. If not explicit, there is an implied consent.

Thus, I don’t believe we as a society can turn back the clock and there will be an increasing societal need to collect and maintain personal information in electronic databases, as a means of reducing costs and making government more responsive and efficient to its citizens.

In areas like criminal justice and law enforcement, the collection of personal data such as DNA would have to be mandatory, for otherwise those liable to commit crime would simply refuse to provide a sample. In the health care industry, medical records are being computerized into databases, and through HIPPA, are subject to a significant degree of protection. For example, the HIPAA Security and Privacy Rules require all covered entities to protect personal information that they use or disclose to business associates, trading partners or other entities. New standards and technologies have significantly simplified the way in which data is transmitted throughout the healthcare industry and created tremendous opportunities for improvements in the healthcare system. However, these technologies have also created complications and increased the risk of loss and unauthorized use and disclosure of this sensitive information. Furthermore, the use of genetic information by medical insurance companies remains highly controversial, as there is considerable potential for abuse of information that is so private.

On balance, I am inclined to allow collecting personal information as long as it comes with informed consent. Relative to consumer, financial, real estate, employment and insurance transactions, an individual must consent in writing before such information is collected. Initially, this consent is within the application and subsequently in the form of an agreement or contract. The legal issue is one of consideration. That is in consideration for providing this service, I agree to provide or allow you to collect such and such information on me.

However, I do feel there are public policy issues on what type of information should be collected and the appropriate security mechanisms on the data. The invasiveness of the database resides in the information being maintained on file, rather than in the procedure for obtaining the data. The extent of the information collected should be governed by what is reasonable, prudent, and customary in the specific line of business and what is governed by law. Furthermore, the individual should be entitled to “opt-in” and not “opt-out” privileges.

I believe the needs of society for information can be balanced against individual rights to privacy by the security measures taken. This can be addressed through risk analysis and risk management strategies areas such as the access, storage and transmission of personal data. In addition, collectors of personal data should have access policies and procedures ensuring that users only access data for which they are appropriately authorized.

Database information is accessible through a wide variety of devices such as laptops; home-based personal computers; PDAs and Smart Phones; hotel, library or other public workstations and Wireless Access Points (WAPs); USB Flash Drives and Memory Cards; floppy disks; CDs; DVDs; backup media; Email; Smart cards; and Remote Access Devices (including security hardware). The security issue is the vulnerability of such devices. Remote access to should only be granted to authorized users based on their role within the organization and their need for access.

Storage policies and procedures should address the security requirements for media and devices which are moved beyond the covered entity’s physical control. Such media and devices include laptops, hard drives, backup media, USB flash drives and any other data storage item which could potentially be removed from the organization’s facilities.

Transmission policies should be in place ensuring the integrity and safety of data sent over networks, and include both the direct exchange of data (for example, in trading partner relationships) and the provisioning of remote access to applications hosted by the organization (such as a provider’s home access to ePrescribing systems or “web mail” in organizations where personal data might be included in internal communications).

However, no amount of risk analysis will be effective if the collector’s workforce does not have an appropriate security awareness and training program; it is important that a covered entity’s security awareness and training program specifically address any vulnerability associated with remote access. Training should provide, at minimum, clear and concise instructions for accessing, storing and transmitting personal data.

If applicable, collectors’ should include in their workforce awareness and training programs, password management procedures (for changing and safeguarding passwords); remote device/media protection to reinforce policies that prohibit leaving devices/media in unattended cars or public thoroughfares; as well as training on policies prohibiting the transmission over open networks (including email) or downloading information to public or remote computers. Also, when addressing the development and implementation of data security policies, a collecting entity should consider at least requiring employees to sign a statement of adherence to security policies and procedures as a prerequisite to employment.

In summary, collecting sensitive information from individuals(particularly DNA) is a difficult public policy issue where the economic and national security interests of society must be balanced against the privacy rights of the individual under the U.S. Constitution. In the opinion of informed parties, since Sept. 11, 2001, there have been substantial national security benefits to collecting personal information. In the case of an imminent threat, the national interests of defense may have to trump the privacy rights of individuals. But this is still a tough call.

Furthermore, relative to the treatment and prevention of disease, there are tangible economic benefits to the nation through the collection of genetic information. However, there remains the question of how much information is needed? When is enough, enough?

Notwithstanding, I believe unless the individual is receiving a direct service or benefit or for law enforcement purposes, the collection of personal information (specifically DNA) should be voluntary. An example of this situation would be a parent voluntarily having their baby infant fingerprinted in the fear of a stranger abduction.

Nonetheless, this public debate will continue and I foresee these issues and concerns being resolved through future legislative means and judicial review.

SQL Pivot Statement

This post reviews the PIVOT statement introduced with the Oracle database, version 11g. Tables in relational databases are displayed in a tabular format with a column-value pair structure. Consider the table named EMPLOYEES. Using the desc command, the table structure is shown below:

SQL> desc employees;

Name Null? Type
------------ --------------       -------------
EMPLOYEE_ID         NOT NULL    NUMBER(4)
FIRST_NAME VARCHAR2(25)
LAST_NAME           NOT NULL    VARCHAR2(25)
BRANCH_ID            NOT NULL       VARCHAR2(3)
CALLS_MADE NUMBER(5)

If the following query is run:
SELECT employee_id,branch_id,calls_made
FROM employees
ORDER BY employee_id;
The output would be as follows:

EMPLOYEE_ID BRANCH_ID CALLS_MADE
---------- ----------          ---------------
1                  BR1     7
2    BR2                10
3    BR3 5
... etc....

Here, the data is represented as rows of values. For each employee, the output shows the branch_id and number of calls made by each. As an employee makes more calls, the column CALLS_MADE is updated.

Now suppose you want a report of calls made by Branch, and a count of employees who made 1 call , 2 calls, 3 calls, etc. You would write the following SQL query:

SELECT branch_id, calls_made, count(1) cnt
FROM employees
GROUP BY branch_id, calls_made;

Here is how the output would appear:

BRANCH_ID     CALLS_MADE    COUNT(1) CNT
------------ --------------    ----------------
BR2                       1    51
BR2                       2    19
BR2                       3    15
BR2                       4    24
BR2 5                                19
... etc. ...

This is the data you requested but it could be difficult to read. Another way to represent the same data would be through a cross-tab report, where the calls_made is shown vertically and branch_id's shown horizontally, in a spreadsheet format:

Calls_made            BR1    BR2    BR3
----------- -------       -------           ---------
1                          60 51 40
2                          13 19 17
3    11 15 10
4    10 24 15
5 8              19     12
... etc. ...

Prior to Oracle Database 11g, you could have used the Decode function for each value and write each distinct value as a separate column. However, this approach is not intuitive per se. Now, you can use the Oracle Pivot statement for obtaining results in a cross-tab format. Depicted below is how you would write the query:

SELECT * FROM (
SELECT calls_made, branch_id
FROM employees
)
PIVOT
(
count(branch_id)
FOR branch_id in ('BR1','BR2','BR3')
)
ORDER BY calls_made;
Here is the output:
CALLS_MADE 'BR1' 'BR2'       'BR3'
---------------   ----------     ----------       ---------------
1                          60 51 40
2    13 19 17
3                          11 15 10
4    10 24 15
5 8    19     12
... etc. ...

This shows the utility of the pivot operator. The branch id's are presented on the header row, instead of a column. In a crosstab report, the CALLS_MADE column is transposed to the header row. In other words, the column becomes the row, such that the column is rotated 90 degrees anti-clockwise to become the header row. This rotation requires a pivot point and in this situation, the pivot point is the count(branch_id) expression.

There you have it, a nice addition to your SQL toolkit.

PHILIP DUHE'S BLOG

Monday, December 28, 2009

The Relational Model and Data Normalization

Friday, December 18, 2009

Non-Equi Joins

Wednesday, December 16, 2009

XML and Web Services

Thursday, December 10, 2009

Data Mining

Monday, December 7, 2009

Preventing SQL Injection Attacks

Preventing Cross Scripting Attacks with PHP

Saturday, December 5, 2009

Oracle SQL Merge Statement

Pagination Using PHP

Correlated Subqueries

PHP Connection Script to MySQL

Geographic Information Systems

Electronic Databases And The Right to Privacy

SQL Pivot Statement

Archive's

Apture

Monday, December 28, 2009

Friday, December 18, 2009

Wednesday, December 16, 2009

Thursday, December 10, 2009

Monday, December 7, 2009

Saturday, December 5, 2009

Archive's

Subscribe

Apture