FREE MOBILE CLOUD
COMPUTING CONCEPTS - TRAINING_MODULES_WITH_TONS_OF_VIDEOS
data-mining-000000
Cloud
Mining is a new approach to apply Data Mining to customer data. This article introduces Cloud Mining in a quick overview.
Data
Mining is a determined technique to analyse Data in CRM, Marketing and Distribution. For example it helps optimizing
customer interaction, shows buying potentials of customers and the churn probability by the use of statistical-mathematical
methods on big amounts of data. Thereby companies can make marketing efforts more precise – they spendings less and
achieve better effects.
Data Mining is practically only used by big companies
Data Mining can be useful at less as 10.000
customers. But since the relative cost per customer to apply it virtually decreases by the number of customers, mostly big
companies with millions of customers using Data Mining right now.
The high costs of personal, hardware and software
licensing just don’t make it cost-effective. Main source of the costs are the Data Mining experts, that are needed to
prepare the data and understand the domain knowledge of the company.
They have to be hired or provided by external
service-providers (consulting companies). Yearly license-fees that actually equal the costs for a Data Mining expert have
to be paid as well. The hardware has to be efficient and causes remarkable costs. That is why Data Mining usually can’t
be afforded by small companies.
Reduce Data Mining costs by SaaS – Cloud Mining is born
The SaaS Distribution model (Software-as-a-Service)
helps to reduce costs by providing flexible license options and outsourcing the hardware effort. At SaaS, the software is
not applied in the company, it lies at a software service providers server.
That means the provider deals with
the hardware, looks after software updates and maintains technically everything. In Cloud Mining, the servers that provide
the software are the Cloud. This can be the public cloud from Google, Amazon.com etc, or a private cloud on the servers of a single provider. That has two main effects;
on one hand the customer only pays for the tools of Data Mining he needs.
That makes him save a lot compared to
complex Data Mining suites that he is not using exhaustive. And on the other hand he just pays for the costs that are generated
by using the Cloud. He does not have to maintain a hardware infrastructure, he can apply data mining just via his browser.
This reduces the barriers that keep small companies from benefiting of Data Mining.
Later we will come to pros and cons of Cloud Mining, and present
some companies that already providing this service.
cloud data mining layout
cloud data
Business Intelligence is all about making better decisions from the
data you have. However, all too often, the data you have is difficult to process by typical BI tools. These failures generally
come in two areas.
The data is too voluminous
to be properly digested by your BI system.
The
data records are messy, inconsistent and difficult to join together.
Each of these problems is commonplace, and relatively easy to solve in isolation. High‐volume datasets can
be mastered by simply (if expensively) throwing more hardware and software at the problem—larger servers, cluster licenses,
faster networks, bigger memories, faster disks, etc. Messy data can be cleansed with the appropriate use of script and SQL
logic to make records consistent and well‐defined.
But
what do you do when datasets are both large and messy? As we have learned from a recent project with a large financial institution,
large amounts of data that are difficult to correlate can bring down even a state‐of‐the‐art BI system.
But large volumes of messy data are a fact of life—indeed; it's probably the bulk of the data in the enterprise. Add
to that the complexity and cost associated with trying to tame it, the business case for analyzing it is overwhelmed.
“large volumes of messy data are a fact of life—indeed;
it's probably the bulk of the data in the enterprise”
Enter the Cloud
One of the primary concepts in
cloud computing is low‐cost scalability—systems that can grow to handle larger volumes of users and data by adding
more low‐cost hardware. Google's entire infrastructure is built on this approach of distributing work out to thousands
of inexpensive servers, instead of relying on centralized "supercomputers" to provide the horsepower.
The scalability strategy that Google uses is called MapReduce. The MapReduce
model provides a conceptual framework for dividing work up into small, manageable sets that can be distributed across 1 or
10 or 100 or 1000 or even 10000 servers, which can all work in parallel. This technology can be used with BI to meet the challenge
of large‐scale, messy data, but you can’t use Google’s infrastructure to run your own MapReduce system.
Luckily, there’s Hadoop – an open source implementation of the Google MapReduce system.
Even though it’s technically still in“Beta”, Hadoop
is in use at many large organizations, including:
Amazon
Yahoo
Facebook
Adobe
The New York Times
AOL
Twitter
Rackspace
Introducing Hadoop
In 2004,
Google published papers describing their Google File System and MapReduce algorithms. Doug Cutting, a Yahoo employee and Open
Source Evangelist, partnered with a friend to create Hadoop (amed after his son’s stuffed elephant), an opensource implementation
of GFS and MapReduce.
In essence, Hadoop was a software system
that could handle arbitrarily large amounts of data using a distributed file system, and distribute it to be worked on by
an arbitrary number of workers, using MapReduce. Adding more storage or more workers is simply a matter of connecting new
machines to the network—there is no need for larger devices or specialized disks or specialized networking.
The two main parts of Hadoop are:
HDFS
MapReduce
HDFS
HDFS (Hadoop Distributed File System) is a system for managing files that runs "on top of" standard computers
and standard operating systems. When a file is loaded into HDFS, the master “Name Node” invisibly breaks these
files into large chunks, and stores them in multiple places (for redundancy) on the native file systems of the computers in
the "cluster".
There's no requirement that the
disks be the same size or that the computers be the same as the others in the "cluster". When a file is retrieved
from HDFS, the Name Node fetches the chunks from the appropriate machines, re‐assembles it and delivers it to the caller
(This is a simplification of the actual process, which is a lot more technically sophisticated).
MapReduce
MapReduce is a three‐step process that provides a structure for analyzing data and manipulating it in a scalable
way. The three steps are:
Map
Shuffle/Sort
Reduce
Map
Raw data is translated/standardized/manipulated ‐ usually in fairly
"lightweight" ways. The output of the Map step is a key‐value pair, which represents some sort of unique or
nearly‐unique key (in many cases, you want the same key to be used for multiple records, for grouping purposes) and
then whatever data is needed later (in the Reduce step) for the value.
Shuffle/Sort
The "secret sauce" of Hadoop
is the distributed sort, where the records are all sorted by key (using either a default alphabetical sort or another comparator
of your choice). Once records are sorted by key, all the records with the same key are sent to the same Reduce processor ‐
essentially this represents a way to group data intelligently.
Reduce
In the last step, these groups of records with the same keys are handed
one‐by‐one to the Reduce task. Sometimes, the Reduce step will cache all of the records, so it can operate on
the entire group at one time (often to perform aggregations). Other translations and manipulations may occur here ‐
for example, the data might be output in a format that's easier to import into a database. Finally, the resulting data is
written back out to HDFS, and the job is done.
Map
Shuffle/Sort
Reduce
How Does This Help BI?
Imagine you have data where some of the dimensions are well defined, but others change over
time, in non‐trivial ways. Imagine that the data is in many different places, in many different formats, and you want
to create a "holistic" view of the data. Last, but not least, imagine that the size of the overall dataset is so
large that it will swamp the capabilities of your BI tool. How do you solve this problem?
The Traditional Way (Take a deep breath)
You can write translators for the different datasets ‐ but if the dataset is large, those translators will
take a long time to run. So you consider manually splitting these large datasets into smaller sets, but then, of course, you
have to get the data onto all of the computers, get the scripts running, and the computers need to have enough storage for
the subset of the data.
Once you clean the data, you have to rejoin the cleansed data together again from the
multiple machines, and if you need to sort the data to help you aggregate it, you're going to have to find a sort solution
that works on the huge volumes of data that you're dealing with. Odds are, you'll have to sort smaller subsets of the overall
dataset, and then find ways to merge the subsets back together.
Then you still haven't dealt with the fact that
you need to aggregate it, so you have to write more scripts, divide the data into subsets again, and make sure you got all
the records that belong in a group on the same machine.
Every
step in the above scenario is error‐prone, complex, difficult to predict and, if you have to do this process on a regular
basis, probably maddeningly tedious.
“every
step in the [traditional way] is error‐prone, complex, difficult to predict and…maddeningly tedious”
Hadoop to the Rescue
Instead, consider this option:
Load all
the datasets into the HDFS. Hadoop will take care of how to partition the data, where to put it, and it also handles redundancy.
(In other words, you don't need to use RAID in your Hadoop solution)
Write Map jobs that will take the data from each of the formats, and clean them, organizing the data into a general
format.
Specify how to sort the data to properly
group it
Write Reduce jobs to aggregate the
data ‐ averaging and summing columns as needed, and then outputting the final aggregated data into a SQL‐friendly
format
Now you run the Map/Sort/Reduce process
on the data.
At the end of this process, you
pull the aggregated data out of HDFS, and load it into your BI system.
You'll need to perform some quality checking on the output, but if the steps have been done properly, you have a
repeatable, scalable process for generating aggregated data, with minimal manual intervention.
Some Current Uses of Hadoop:
Product Search Index generation
Data Aggregation & Rollup
Data
mining for ad targeting
ETL
Analyzing & Storing Logs
Data Analytics
RDF
Indexing
Where Hadoop Doesn't Fit
Hadoop is a tool like any other, and it is not applicable to every problem.
Some areas where Hadoop is not the right solution:
Highly
Interdependent Data
Hadoop is not well suited for data where
each record is heavily dependent on a number of other records. For example, consider weather forecasting ‐ predicting
what's going to happen to a storm front over time requires a view of pretty much the entire dataset at once. This is the realm
for supercomputers.
Ad‐Hoc, "Casual" BI
Hadoop provides a framework for data analysis, but it usually requires
a fairly sophisticated user to create queries and aggregations. And, of course, there's little‐to‐no support for
visualizations, etc. Hadoop's SQL‐related sub‐projects, such as Hive and Pig help mitigate some of this, but casual
reporting is still somewhat difficult.
Real‐time Processing
Hadoop is designed to trade off startup speed forscalability and parallelism
‐ in other words, Hadoop is more like a locomotive than a sports car ‐ it takes a fairly long time to get everything
set up, but once it's moving, it's doing a lot of work.
Dependencies
on Other Systems
If you set up a 10,000 node Hadoop cluster
to process a huge dataset, and one of the steps involves a query to a database on an old computer in a dusty corner of the
datacenter, that database server is going to be a bottleneck for the entire cluster. Hadoop jobs work best when they have
few (or best case, none) dependencies on external systems. There are various tricks and strategies that can mitigate this
problem, but in general, remove as much external dependency as possible from your Hadoop jobs.
“Hadoop jobs work best when they have few (or best case, none) dependencies on external
systems”
Conclusions
In terms of Business Intelligence, Hadoop is a tool that makes the ETL process easier, and
can bring the size and quality of data under control. It provides low‐cost scalability, a reliable and easily expandable
file system, and a framework for dramatically increasing the scope and robustness of your data‐mining and data‐analysis
business strategies. In situations where your data is too large, too messy or both, Hadoop can help you get it under control,
and focus on your business, instead of focusing on one‐off IT infrastructure data analysis projects.