Archive for the ‘Data Mining’ tag
The Petabyte BI World - Wired

Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn’t just more. More is different.
This month’s Wired magazine carries one of the most important growing concerns of the scientific community, the uncontrollable growth of data. This growth of data in many directions is nearly killing theories as everything is becoming more and more data controlled.

There are a series of articles ranging from what data miners are digging today to elaborate algorithms that predict air ticket prices to how we can monitor epidemics hour by hour.
If you are a BI entusiast or not, this month’s Wired cover story will challenge all your predictions about science and technology, even if you have a petabyte of data to support it !! Read it, like, right now !!
From Text Analytics to Data Warehousing
I liked the recent article of Seth Grimes which talks about Text Analytics Accuracy. His article, today, on Intelligent Enterprise, pointed me to the IBM article on IBM® OmniFind™ Analytics Edition which talks in detail about extracting unstructured data from e-mail, Web pages, news and blog articles and building a data warehouse out of them to unlock the huge potential which was previously untapped.
In recent months/weeks, the focus on unstructured data is becoming more and more as businesses and vendors are starting to understand the power of this unstructured data and how it can text mined and used to the benefit of the exterprises. And its a good this.
A must read. Highly Recommended.

Text analytics enables you to extract more business value from unstructured data such as emails, customer relationship management (CRM) records, office documents, or any text-based data. IBM® OmniFind™ Analytics Edition provides rich text analysis capabilities and interactive visualization to enable you to find patterns and trends hidden in large quantities of unstructured information. The text analysis results from OmniFind Analytics Edition are in XML-format and can also be stored, indexed, and queried in a DB2 database. This allows you to incorporate your text analysis results into existing business applications and reporting tools by using regular SQL or SQL/XML queries. This article provides an overview of text analytics with OmniFind Analytics Edition and describes several ways of bringing its analysis results into DB2, in relational or pureXML™ format.
..
..
OmniFind Analytics Edition provides the ability to interactively explore and mine the results of text analysis, as well as structured data that is typically associated with unstructured text. For those of you familiar with business intelligence applications, you can think of it as content-centric business intelligence, in that it aggregates the results of text analysis to detect frequencies, correlations, and trends. Typical use cases include:Analysis of customer contact information (e-mails, chats, problem tickets, contact center notes) for insight into quality or satisfaction issues
Analysis of blogs and wikis for reputation monitoring
Analysis of internal e-mail for compliance violations or for expertise location
Microsoft Sets Sights on Data Mining Dominance
“[We don't] have all the functionality of something like a SAS or an SPSS, because that’s just not our market,” he concedes. It comes down to a difference of scale, Farmer argues: SAS and SPSS typically target larger, more expensive deployments — typically with users well-versed in the usage of their tools. Microsoft is targeting a different kind of data mining consumer: the Excel analyst, for example, who might not have much (if any) experience — with data mining, predictive analytics, or statistical analysis for that matter.
“By the way, I don’t mean to say we can’t hit the high-end. Within Microsoft, we have our own database marketing team. We’re one of the largest companies in the world. We have a huge database marketing team who do classic customer analysis. These guys were all SAS users, but when they joined Microsoft, they started using our tools. The entire process runs on our database, they actually use the Excel [data mining] add-ins to do it. It’s not that there’s nothing they don’t miss, [it's that] they are able to achieve the same business results using our tools.”
Last year, Microsoft released a data mining and predictive analytic add-on for its Excel 2007 product (see http://www.microsoft.com/downloads/details.aspx?FamilyId=7c76e8df-8674-4c3b-a99b-55b17f3c4c51&DisplayLang=en). The add-on, which is similar to Microsoft’s well-known SQL Server BI Accelerator products, integrates natively with Excel 2007. It introduces a new “Data Mining” tab that exposes several pre-built functions, including forecasting, accuracy charting, cross-validation, exception highlighting, category detection, key influencers, shopping basket analysis (the last is a SQL Server 2008-only function) and many others.
From an article on ESJ.
Data Mining Prescribed To Ensure Drug Safety
From Info Week -
This week, WellPoint — one the nation’s largest health insurers — revealed it’s investing millions of dollars in a three-year project to build such a drug surveillance system in collaboration with the FDA and several academic institutions, including Harvard University, University of Pennsylvania, and the University of North Carolina. The Safety Sentinel System will mine and analyze aggregate claims, lab, and pharmaceutical data from WellPoint’s 35 million members, who generate 1.4 billion “claim lines” of data each year, said Marcus Wilson, president of HealthCore, WellPoint’s medical outcomes research subsidiary, which WellPoint acquired in 2003 and is overseeing the new project.
MS in Verticals - Buys Predictive Analytics company, Farecast
Seattle Pi’s Venture Blog has the full story from the start to the end.
Farecast was started by University of Washington computer scientist Oren Etzioni, initially bankrolled by Madrona, built with people from local companies such as Alaska Airlines and AdRelevance and, ultimately, acquired by Microsoft.
Though Farecast had multiple bidders, McIlwain said Microsoft was a good fit since the two companies had worked together in the past and had a similar vision for online search. The proximity of the two companies also played a part, he said.
The acquisition follows the merger of Kayak.com and SideStep, the market leader in next generation travel search. That deal led to new opportunities for Farecast, including discussions with Microsoft which heated up in the past 90 days.
“That consolidation presented opportunities for Farecast … partly differentiated because of their predictive capabilities but also because of who they might have been able to align with in the industry to be a strong and differentiated number two, hoping some day to overtake and become number one,” he said.
Madrona has produced a number of hits recently, with the sales of ShareBuilder, World Wide Packets and iConclude.
Also a quick analysis from Motel Fool on this buy -
Microsoft needs more deals like this one, especially if the Microhoo deal comes undone, and the software giant has the means to go shopping. I’ve suggested that Microsoft pursue potential buyout candidates like The Knot (Nasdaq: KNOT) and Bankrate (Nasdaq: RATE) for the same reason that Farecast works. Whether it’s wedding planning, home refinancing, or booking that flight to visit your parents in Chicago, this is the quality traffic that Microsoft and Yahoo! lack right now.
Stanford students working on Netflix Algorithms
Anand Rajaraman, the co-founder of Kosmix also teaches Data Mining at Stanford. Here’s an interesting note from his blog.
Some of his students are working to crack algorithms for the on-going Netflix “Better Recommendation Logic” Prize of $1 million. Read it !!
Here’s how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix’s proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?
Baseball Association Analyzes Statistics with Cognos
Its interesting that more and more sports associations are starting to use Business Intelligence software to analyze statistics. As Ian Ayres points out in his latest book, Super Crunchers, the competition between the traditional experts and number crunching softwares has ended. And number crunching softwares are being increasingly used by tranditional “intutional” experts to analyze the data better.
These are clearly the days of Data Mining Softwares. This one is about IBM Cognos. Read more -
“Our analysis of player performance is as complex and dynamic as the work of high-powered business analysts in Fortune 500 companies, and we need to use the same robust, flexible interface to achieve reliable results,” said Doyle Pryor, Assistant General Counsel of the MLBPA. “Conducting complex analysis in real-time allows us to improve our planning processes and IBM Cognos TM1 Executive Viewer enables the agents themselves to view reports and perform almost limitless ‘what-if’ scenarios for further analysis of the data.”
“The interface for analysis will provide sophisticated users with the tools they’re familiar with and the ability to quickly modify views and reports with as little effort as possible,” said Doug Barton, vice president, product marketing, Cognos, an IBM Company. “Users of IBM Cognos TM1 Executive Viewer continue to gravitate to its features that provide interactivity, immediacy, and flexibility, which, in turn, enable them to accelerate the management of their business’s performance.”
Data Visualization Helps Panoratio Data Mining Users
From the Press Release -
Panoratio, a provider of innovative technology that maps statistical content from large and complex datasets, has selected OpenViz data visualization software from Advanced Visual Systems (AVS) to be incorporated into its Data Explorer product.
Panoratio uses OpenViz to provide highly interactive and graphical displays of dense imagery in near-real time, with virtually no restrictions on the complexity or amount of data that can be analyzed.
Panoratio’s Data Explorer is a smart data analysis tool that rapidly queries Panoratio Portable Database Images and delivers results in seconds with built-in intelligence that assists analysts in finding patterns and relationships in the data which they might not otherwise discover.
According to Dr. Oliver Mihatsch, Chief Technology Officer of Panoratio, “We selected OpenViz because it was by far the most flexible data visualization system and was more-than-able to meet the real-time demands of our high performance data mining technology.”
Independent software makers such as Panoratio use OpenViz to serve as an embedded graphics platform for interactive analytics and data visualization. Designed to overcome the limitations of static charting packages, OpenViz enables application designers and product managers to create high performance solutions from extremely complex data, algorithms and integrated corporate content.
Deepest Data Mining
New York times ran a story last week about a online data mining company called Phorm. While the data that the company mnes is controversial, they are starting to be talked about in the industry. Read more about Phorm at NYT-
Amid debate over how much data companies like Google and Yahoo should gather about people who surf the Web, one new company is drawing attention — and controversy — by boasting that it will collect the most complete information of all.
The company, called Phorm, has created a tool that can track every single online action of a given consumer, based on data from that person’s Internet service provider. The trick for Phorm is to gain access to that data, and it is trying to negotiate deals with telephone and cable companies, like AT&T, Verizon and Comcast, that provide broadband service to millions.
Phorm’s pitch to these companies is that its software can give them a new stream of revenue from advertising. Using Phorm’s comprehensive views of individuals, the companies can help advertisers show different ads to people based on their interests.
Reality Mining and Surprise Modeling - Future Tech
Reading this Technology Review, it seems inevitable that such advanced mining technologies will pop-up in the near future. The world has a wealth of information and every single thing will be data mined in the future. And what a movement that will be.
By the way, the MIT Technology Review calls Reality Mining as one of the 10 technologies that we think are most likely to change the way we live. Exciting, Ain’t it ?
Also Surprise Modeling which combines data mining and machine learning to help people do a better job of anticipating and coping with unusual events is also one of the Top 10 Technologies listed by MIT Tech Review. This is being advocated by Eric Horvitz, Microsoft Research.
From the article on Reality Mining -
Reality mining, he says, “is all about paying attention to patterns in life and using that information to help [with] things like setting privacy patterns, sharing things with people, notifying people–basically, to help you live your life.”
Within the next few years, Pentland predicts, reality mining will become more common, thanks in part to the proliferation and increasing sophistication of cell phones. Many handheld devices now have the processing power of low-end desktop computers, and they can also collect more varied data, thanks to devices such as GPS chips that track location. And researchers such as Pentland are getting better at making sense of all that information.
To create an accurate model of a person’s social network, for example, Pentland’s team combines a phone’s call logs with information about its proximity to other people’s devices, which is continuously collected by Bluetooth sensors. With the help of factor analysis, a statistical technique commonly used in the social sciences to explain correlations among multiple variables, the team identifies patterns in the data and translates them into maps of social relationships. Such maps could be used, for instance, to accurately categorize the people in your address book as friends, family members, acquaintances, or coworkers. In turn, this information could be used to automatically establish privacy settings–for instance, allowing only your family to view your schedule. With location data added in, the phone could predict when you would be near someone in your network. In a paper published last May, Pentland and his group showed that cell-phone data enabled them to accurately model the social networks of about 100 MIT students and professors. They could also precisely predict where subjects would meet with members of their networks on any given day of the week.
The article has
2 responses