Category Archives: General

101 – Data Mining and Predictive Analytics

In today’s world mining of text, Web and media (unstructured data) plus structured data mining, the term information mining is a more appropriate label. Mining a combination of these, companies are able to make the best use of structured data, unstructured text and social media. Static and stagnant predictive models of the past don’t work well in the world we live in today. Predictive analytics should be agile to adapt and monetize on quickly changing customer behaviors in our world, which are often identified online and through social networks.

Better integration of data mining software with the source data at one end and with the information consumption software at the other end has led to improvement in the integration of predictive analytics with day-to-day business. Even though there haven’t been significant advancements in predictive algorithms, the ability to apply large data sets to models and the ability to enable better interaction with business has led to improvements in the overall outcome of the exercise.

There is a great introduction to the world of data mining and predictive analytics here.

BI market consolidation: What does it mean for you?

A must read article for BI entusiasts on the recent consolidation of the BI industry. Well researched and informative article by Stuart Lauchlan,

Still, it’s encouraging that the BI market is still showing signs of life after a period of considerable turmoil and consolidation over the past two years with IBM, Oracle and SAP swallowing up Business Objects, Cognos and Hyperion Solutions. Since March 2007, the three enterprise giants have dished out $15 billion to bolster their BI credentials. Oracle offered $S3.3 billion for Hyperion, SAP pitched $6.8 billion for Business Objects while picking up Cognos cost IBM $5 billion.


Another development is the blurring of boundaries as BI starts to encroach on other technology areas. For example, Forrester Research cites the merging of BI and search technologies to provide business people with better context and information to make daily decisions. “As search and BI get ever closer, the lines could eventually blur to the point of simply going away,” said Forrester in its ‘Search + BI = Unified Information Access’ report. “This will help bridge the artificial system boundaries between structured data and unstructured content. It will not only affect the interfaces we use to search for, discover, analyse, and report on what we need to know, but help us learn more about what we don’t know.”

This is one of the immediate advantages of convergence between BI and search – the ability to discover things you didn’t know you didn’t know. Forrester noted: “As search gets more powerful and begins to understand the meaning behind unstructured text, entity extraction and other linguistic analysis methods will be able to be used to reveal unforeseen and highly illuminating connections among documents or between documents and data.”

Text Analytics Accuracy

Seth Grimes writes a very interesting article in the B-Eye Network, on Text Analytics and how accurate they are in deployments.

A Must Read for Text Analytics Teams.

Here’s an interesting paragraph from the article -

The accuracy of information retrieval (for instance, the results returned by a search) and of information extraction (where important entities, concepts and facts are pulled from “unstructured” sources) is typically measured by an f-score, a value based on two factors – precision and recall.

Precision is the proportion of information found that is correct or relevant. For example, if a Web search on “John Lennon” turns up 17 documents on Lennon and also 3 exclusively about Yoko Ono, who is of little interest but was associated with Lennon due to co-occurrence of the two individuals’ names in a large number of documents, then the precision proportion would be 17/20 or 85%.

Recall, by contrast, is the proportion of information found of information available. If there were actually 8 documents legitimately about John Lennon that were not found, perhaps because only a small portion of each was devoted to Lennon, leading to low “term density,” then the recall would be 17/25 or 68%.

Using Marketing Analytics to Slingshot Sales

According to Ian Michiels, an analyst with Aberdeen Group Inc., successful companies find a way to more tightly integrate marketing and sales. “How do you prioritize the leads that go into the sales pipeline? Metrics that help you do that are going to make you more effective,” he said. The goal is to find ways to define the best leads, whether that’s by the number of times a potential customer has gone to a Web site, attended a webinar or visited a trade-show booth. Only send the best leads to sales.

Once you provide this information, Rego said, your reps have the information they need to close more sales. “The carrot is providing sales with as much ammunition as needed to close the sale. The way you need [to] do that is tight integration from the marketing automation system into the SFA system.” Once you do this, you begin to build a consensus around the definition of a “good lead.” Rego said this approach gives marketing a view into the sales funnel and can provide marketing-level information as the sales staff approaches a sales opportunity.

Read the entire article on Marketing analytics at Inside CRM.

Data Warehousing on a Shoestring Budget

TDWI is running a series on developing and deploying Data Warehousing, frugally. It’s a 3 part series. Read Part 1 and 2.

Although seemingly difficult, you can make choices, which allow for the beneficial realization of data warehousing while also minimizing costs. By balancing technology and carefully positioning your business, your organization can quickly create cost-effective solutions using data warehousing technologies.

There are a few simple rules to help you develop a data warehouse on a shoestring budget:

* Use what you have
* Use what you know
* Use what is free
* Buy only what you have to
* Think small and build in phases
* Use each phase to finance or justify the remainder of the projects

It’s also a must read for businesses which have enough business sponsorship and enormous resources. Tough times in the marketplace like these call for an economical way of staying ahead on the business curve. And that’s exactly the point of this series.

I like the detailed approach Nathan Rawling towards this topic.

Data Mining Prescribed To Ensure Drug Safety

From Info Week -

This week, WellPoint — one the nation’s largest health insurers — revealed it’s investing millions of dollars in a three-year project to build such a drug surveillance system in collaboration with the FDA and several academic institutions, including Harvard University, University of Pennsylvania, and the University of North Carolina. The Safety Sentinel System will mine and analyze aggregate claims, lab, and pharmaceutical data from WellPoint’s 35 million members, who generate 1.4 billion “claim lines” of data each year, said Marcus Wilson, president of HealthCore, WellPoint’s medical outcomes research subsidiary, which WellPoint acquired in 2003 and is overseeing the new project.

9 Cost-Cutting Tactics in Data Management and Integration

When aiming to optimize costs in data management and integration initiatives, it is critical to know what steps to take and where significant savings can be realized while maintaining success in these projects,” said Ted Friedman, vice president and distinguished analyst at Gartner. “In most cases, the cost of implementing the steps will be far outweighed by the savings that can be realized.”

Gartner identified nine key areas in which CIOs can significantly reduce costs during 2008 as they continue to support data management and integration-related initiatives:

From Gartner, these are high performance, efficent ways to cut costs on Data Management initiatives.

Deploying the Integrated Customer Database

An excellent case study by Andres Perez, on how a company tried to deploy a single integrated customer database and practical challenges that they faced from financial to ROI questions. A must read.

The demand for integrated information has created a vendor response that has spawned a market for what many call customer data integration (CDI) or master data management (MDM). These approaches are characterized in many ways; however, they are typically presented as a “federation” or “consolidation” of disparate databases and applications to present an “integrated” or “unified” view of the customer, product, supplier, etc. The vendors offering customer relationship management (CRM) tools, CDI or MDM capabilities usually focus on facilitating and accelerating data movement from one or more databases or files to another using extract, transform and load (ETL), messaging (message queues), and other capabilities. How are these “solutions” meeting the customers’ expectations? In a previous article, I mentioned that data movement increases costs (adds more complexity to the information management environment), information float or delays (whether batch or messaging), reduces semantic value (much semantic value is casted in the context of the existing applications), and significantly increases the opportunity for introducing information defects. Customers are realizing that these “solutions” are more focused on attacking the symptoms (e.g., moving data around faster) instead of attacking the root cause (e.g., keeping the information integrated in one place in the first place).

Trends driving Real-Time Data Access

Chris McAllister at TDWI jots down in a convincing manner , the reasons behind the need for real-time data access. Not just that but why these trends will peak in 2008. A very interesting read.

With a growing number of business users and activities dependent on real-time access to real-time information, it is nearly impossible to find a company or function that wouldn’t benefit from having accurate, up-to-date data. For equity markets and currency changes, account balances and user authentication, help desks, marketing promotions, supply chain, patient care, and sales and manufacturing, any organization can justify a demand for faster and more accurate information. Key trends will drive the demand for real-time data in 2008, including: standardization of low-latency data integration across disparate systems, stricter regulations and service level agreements (SLAs), heterogeneous IT environments, management and maintenance of very large database (VLDB) implementations, and globalization.

How to avoid a plane crash – Crunch Numbers

From the detailed Washington Post report

Pilots and executives at 16 other airlines have similar data-monitoring initiatives approved by the Federal Aviation Administration that are known as flight operations quality assurance programs. The carriers scour the flight data, which is often combined with pilot reports, to identify potential “precursors,” a buzzword in aviation circles used to describe events that often go unnoticed until they lead to an accident. The data are amazingly detailed — small onboard memory discs (not the “black boxes”) capture hundreds of parameters that include airspeed, pitch angles, engine temperatures and movements.

Such data initiatives have grown so extensive in recent years that the FAA has launched its own effort to mine the information in search of precursors. Seven carriers have signed on to the initiative, which began in October. The FAA, which already combs government safety databases looking for precursors, thinks the flight data will be a powerful tool when combined with other information, including pilot reports and radar plots.