Category Archives: Data Mining

posts on data mining.

101 – Data Mining and Predictive Analytics

In today’s world mining of text, Web and media (unstructured data) plus structured data mining, the term information mining is a more appropriate label. Mining a combination of these, companies are able to make the best use of structured data, unstructured text and social media. Static and stagnant predictive models of the past don’t work well in the world we live in today. Predictive analytics should be agile to adapt and monetize on quickly changing customer behaviors in our world, which are often identified online and through social networks.

Better integration of data mining software with the source data at one end and with the information consumption software at the other end has led to improvement in the integration of predictive analytics with day-to-day business. Even though there haven’t been significant advancements in predictive algorithms, the ability to apply large data sets to models and the ability to enable better interaction with business has led to improvements in the overall outcome of the exercise.

There is a great introduction to the world of data mining and predictive analytics here.

The Jargon of the Novel, Computed

Scholars in the growing field of digital humanities can tackle this question by analyzing enormous numbers of texts at once. When books and other written documents are gathered into an electronic corpus, one “subcorpus” can be compared with another: all the digitized fiction, for instance, can be stacked up against other genres of writing, like news reports, academic papers or blog posts.

One such research enterprise is the Corpus of Contemporary American English, or COCA, which brings together 425 million words of text from the past two decades, with equally large samples drawn from fiction, popular magazines, newspapers, academic texts and transcripts of spoken English. The fiction samples cover short stories and plays in literary magazines, along with the first chapters of hundreds of novels from major publishers. The compiler of COCA, Mark Davies at Brigham Young University, has designed a freely available online interface that can respond to queries about how contemporary language is used. Even grammatical questions are fair game, since every word in the corpus has been tagged with a part of speech.


Microsoft Unveils Apps for Crime-Fighting Data Mining

Once again, software is fighting crime. Microsoft unveiled a suite of tools and initiatives for law-enforcement groups “specifically designed to improve public security and safety,” the company said.
It’s also the latest example of law enforcement officials arming themselves with better technology to help fight crime. The FBI, for instance, said that new database and data-sharing efforts have resulted in solving a number of difficult highway serial killings.

Gathering that data is key. That’s why Microsoft this week said it is giving a free tool to INTERPOL called the Computer Online Forensic Evidence Extractor (COFEE), an application that “uses common digital forensics tool to help officers at the scene of the crime.”

The company is working on a mobile version for future release, said Richard Domingues Boscovich, senior attorney for Microsoft’s Internet security program, told in an e-mail.

A larger tool set for large-scale crimes is Microsoft Intelligence Framework, which is aimed at helping intelligence and law enforcement agencies coordinate information to detect and prevent terrorism, and to solve organized and major crime cases. The framework offers tools for storing and analyzing evidence and information across a variety of sources

From EarthWeb article.

SPSS Rebrands Its Analytical Offerings

The new version of the SPSS modeling product — the erstwhile Clementine — is now known as PASW Modeler 13; its text analysis product (formerly Text Mining for Clementine) is now PASW Text Analytics 13. SPSS says that, over the course of the year, the rest of the SPSS product line will update under the PASW umbrella — including Statistics and Data Collection.

David Vergara, director of product marketing for SPSS, explains that the change was intended to help customers and prospects understand what the products are doing and how each offering pieces together within the broader portfolio.

Aside from the name change, the new versions of SPSS products focus on usability — and not just for data experts. Wettemann says that SPSS has “recognized that moving beyond the data analyst audience is where you get the real power.” PASW Modeler 13 features a drag-and-drop interface, and functionality that will appeal to business users. Two integral updates include a “comments” tool, in which users can flag notes within the software, and automated data preparation. Data automation mitigates human error and avoids common issues in data quality.

From Destination CRM.

Data Mining Moves to HR

For most of its eight-year history, Cataphora has focused on digital sleuthing. The company hunts for statistical signs of fraud. But in the past few years, Cataphora has been dispatching its data miners into a new market: statistical studies of employee performance.

The trend, though early, is unmistakable, and it extends far beyond Redwood City. Number crunching, a staple for decades in the quantifiable domains of engineering and finance, has spread in recent years into marketing and sales. Companies can now model and optimize operations, and can calculate the return on investment on everything from corporate jets to Super Bowl ads. These successes have led to the next math project: the worker. “You have to bring the same rigor you bring to operations and finance to the analysis of people,” says Rupert Bader, director of workforce planning at Microsoft (MSFT).

Such a mission might have been laughable a decade ago. But as the role of computers in the workplace expands, employees leave digital trails detailing their behavior, their schedule, their interests, and expertise. For executives to calculate the return on investment of each worker, their human resources departments are starting to open their doors to the quants.

From Business Week, an insightful article on how value of each employee is determined by HR using Data Mining/Analytics.

What your cellphone knows about you – Reality Mining

Here’s a follow-up on Reality Mining and Surprise Modelling, which are called as one of the 10 technologies that we think are most likely to change the way we live.

Read more from interview with Sandy Pentland, director of MIT’s Human Dynamics Research program. What is “reality mining?”

Sandy Pentland: Reality mining is about using sensors to understand human beings. The sensors could be security cameras, they could be devices that you wear on yourself, they could be cell phones. The point is it’s about people. Data mining is about finding patterns in digital stuff. I’m more interested specifically in finding patterns in humans. I’m taking data mining out into the real world.

What kind of reality-mining experiments have you actually performed?

We developed this thing called a sociometer, a little badge that you wear around your neck that records your body language, your motion and your tone of voice–the tone, not the words. It gives us a nice little package for reality mining.

We’ve done all sorts of interesting things with this. Just listening to peoples’ tones of voice and how they move, we can measure interest level and attention, factors that account for 40% of the variation in the outcomes of things like salary negotiation, dating scenarios, closing a sale, pitching a business plan.

Microsoft Sets Sights on Data Mining Dominance

“[We don't] have all the functionality of something like a SAS or an SPSS, because that’s just not our market,” he concedes. It comes down to a difference of scale, Farmer argues: SAS and SPSS typically target larger, more expensive deployments — typically with users well-versed in the usage of their tools. Microsoft is targeting a different kind of data mining consumer: the Excel analyst, for example, who might not have much (if any) experience — with data mining, predictive analytics, or statistical analysis for that matter.

“By the way, I don’t mean to say we can’t hit the high-end. Within Microsoft, we have our own database marketing team. We’re one of the largest companies in the world. We have a huge database marketing team who do classic customer analysis. These guys were all SAS users, but when they joined Microsoft, they started using our tools. The entire process runs on our database, they actually use the Excel [data mining] add-ins to do it. It’s not that there’s nothing they don’t miss, [it's that] they are able to achieve the same business results using our tools.”

Last year, Microsoft released a data mining and predictive analytic add-on for its Excel 2007 product (see The add-on, which is similar to Microsoft’s well-known SQL Server BI Accelerator products, integrates natively with Excel 2007. It introduces a new “Data Mining” tab that exposes several pre-built functions, including forecasting, accuracy charting, cross-validation, exception highlighting, category detection, key influencers, shopping basket analysis (the last is a SQL Server 2008-only function) and many others.

From an article on ESJ.

Data Mining Prescribed To Ensure Drug Safety

From Info Week -

This week, WellPoint — one the nation’s largest health insurers — revealed it’s investing millions of dollars in a three-year project to build such a drug surveillance system in collaboration with the FDA and several academic institutions, including Harvard University, University of Pennsylvania, and the University of North Carolina. The Safety Sentinel System will mine and analyze aggregate claims, lab, and pharmaceutical data from WellPoint’s 35 million members, who generate 1.4 billion “claim lines” of data each year, said Marcus Wilson, president of HealthCore, WellPoint’s medical outcomes research subsidiary, which WellPoint acquired in 2003 and is overseeing the new project.

MS in Verticals – Buys Predictive Analytics company, Farecast

Seattle Pi’s Venture Blog has the full story from the start to the end.

Farecast was started by University of Washington computer scientist Oren Etzioni, initially bankrolled by Madrona, built with people from local companies such as Alaska Airlines and AdRelevance and, ultimately, acquired by Microsoft.

Though Farecast had multiple bidders, McIlwain said Microsoft was a good fit since the two companies had worked together in the past and had a similar vision for online search. The proximity of the two companies also played a part, he said.

The acquisition follows the merger of and SideStep, the market leader in next generation travel search. That deal led to new opportunities for Farecast, including discussions with Microsoft which heated up in the past 90 days.

“That consolidation presented opportunities for Farecast … partly differentiated because of their predictive capabilities but also because of who they might have been able to align with in the industry to be a strong and differentiated number two, hoping some day to overtake and become number one,” he said.

Madrona has produced a number of hits recently, with the sales of ShareBuilder, World Wide Packets and iConclude.

Also a quick analysis from Motel Fool on this buy -

Microsoft needs more deals like this one, especially if the Microhoo deal comes undone, and the software giant has the means to go shopping. I’ve suggested that Microsoft pursue potential buyout candidates like The Knot (Nasdaq: KNOT) and Bankrate (Nasdaq: RATE) for the same reason that Farecast works. Whether it’s wedding planning, home refinancing, or booking that flight to visit your parents in Chicago, this is the quality traffic that Microsoft and Yahoo! lack right now.

Stanford students working on Netflix Algorithms

Anand Rajaraman, the co-founder of Kosmix also teaches Data Mining at Stanford. Here’s an interesting note from his blog.

Some of his students are working to crack algorithms for the on-going Netflix “Better Recommendation Logic” Prize of $1 million. Read it !!

Here’s how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix’s proprietary algorithm by a certain margin wins a prize of $1 million!

Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?