Clear Cloud HomeWhat is 4KGrooVe IPCloud TopologyPinterest ArchitectureWhat-is-a-Wi-Fi-PhoneBe an Android DeveloperPost PC EraMobile Device LiquidationsMobile RevolutionEnrollment-ApplicationFacebook ArchitectureTHIS WEEK'S TOP FREE TRAINING_MODS:Wuala Cloud Storage peer-to-peerContact UsWhat is Wireless Video ConferencingFAQsTablet_PCsPC Broker GuideCloud Computing ArchitectureWhat's a Set-Top TV Internet Box?3D PrintersLaptop Categories for the Mobile BrokerWhy Mobile Device Brokers Are NeededMobile Device Brokers and Pocket CloudMobile Device Brokers and Mobile PaymentsHow Mobile Device Brokers Make Revenuebroker_mechanics_2Job Oppys for Mobile Device BrokersHigh End Netbooks for Mobile Device BrokersMobile Brokers top SmartphonesMobile Device Brokers go Mobile BroadbandCloud_RevolutionWhat is Mobile Cloud ComputingMobile Cloud Computing TrendsMobile Device Broker BasicsNotebook_MarketingMobile Cloud & Brokers New DevelopmentsCloud RevolutionYour Opportunity as a Mobile Device BrokerWhy Mobile Device Brokers are WantedTraining_ModulesVIDEO LEARNING CENTERAbout UsLTE 4G for Mobile BrokersBroker_B2B_MechanicsWhat's WIMAX Wireless 4G Connect?Lets follow a successful mobile device brokerSee Ted & his Swap Meet Profit SessionsTed in Action at the Swap Meet - WORKINGDelight in Ted's Mobile Momentum to PROFITSTed's Post Sale Activities & Cloud FunASK CLEAR CLOUD NETWORKMarket OpportunityHow it Worksnew_broker_channelsWhat is Social TVEnterprise MobilityWhat is Micro CloudLTE TabletsBlu Ray Player with SkypeSkype on my TVIntel SSDFacebook CloudFree Storage CloudGoogle CloudGoogle StorageWiMax WiFiLTE Verizon AdvancedLTE Dish NetworkWhat is an SSD DriveLTE AdvancedVideo Game CloudVideo Cloud EncodingWhat is an IP TV playerBlu ray TabletsExploring the LTE NetworkAndroid TetheringExploring-4g-tabletsUltra High Speed InternetWhat is Desktop VirtualizationCloud ApplicationsUnderstanding Mobile VideoQWERTY PhonesIntel UltrabookTablet EvolutionIn Demand Cloud Computing JobsEnterprise Optimized TabletsCloud Services BrokeragesPolice Mobile Data TerminalsExploring Wireless Computing ConceptsWhat is an Intelligent NetworkExamining Augmented Reality AppsEmergence of the Personal CloudWhat is Ultra Fast Internet ConnectionExplaining Mobile Payment SystemsWhat is a VOIP Business Phone SystemPrivate Cloud Storage ServicesExploring Today's IT Data CentersWhat is Network VirtualizationWhat is Cloud Multi-TenancyWhat is Hosted CloudCloud Artificial IntelligenceFacebook MobileIndia CloudPrivate Cloud Storage SolutionsWhat is Private Cloud ComputingIP Video Surveillance4G Service ProvidersExplaining What is a Cloud APICloud Management4G AndroidWhat is a Mobile PaymentWhat is Samsung Cloud PlatformWhat's Google App EngineDescribing What's a Private CloudWhat's Google Compute EngineWhat is Cloud StackWhat is OpenstackWhat is Amazon Cloud Computing ServicesWhat is Android LTEWhat is Windows 8 TabletWhat is the HP CloudMobile Web Apps Against Native AppsIs it True the Desktop PC is Dead?What is Data VisualizationWhat is an InfographicWhat is Hadoop and Big DataBYOD OptimizationWhat is a Private CloudWhat are Cloud AppsWhat is a Google App AccntWhat is Enterprise Resource PlanningCloud Storage 2012What is a Cloud APIWhat is PinterestWhat is an Internet Enabled TVSMB and the CloudMobile Cloud ServicesWhat is WAPMobile Social NetworkWhat is Free WiFiWhat is Cloud CommunicationsWhat's the Mobile CloudWhat is Cloud CRMWhat is the Social CloudWhat is a Cloud ServerCloud Computing Glossary 2012What is a Virtual MachineExplaining Facebook ArchitectureWhat is WiDiWhat's High Performance ComputingIP-TV-BasicsWhat is a Hybrid CloudWhat is SkyDriveWhat is Google DriveWhat is FlexPod Cloud ArchitectureVideo Apps in the CloudElastic CloudAmazon Cloud SevicesWhat is Cloud Data MiningLTE 4G and TabletsWhy WiMax FailedBig Data Computing in the CloudOne brain your brain & my brain is the CloudThick Clients and Cloud ComputingThin Clients and Cloud ComputingLTE Trends 2012Cloud Computing in IndiaCloud Trends 2012new-developments-in-the-cloudWuala-vs-Dropbox-and-othersWhat is Wuala Social Grid StorageOccupy Wall Street and Cloud ComputingExploring What is Sony Internet TVCorporate TabletWhat-is-Amazon-Cloud-ServicesWhat are LTE 4G Cloud ServicesCloud Services for Tablets and Mobile4G LTE new developmentsWhat is Mobile Cloud ServicesSSD in the Data CenterHow Does Facebook Architecture Work?SSD and Cloud Computingwhat is CPU GPU computingWhat is Augmented-Realitywhat is HPC in the cloudWhat 's 4G LTE and WiMaxwhat is a Cloud ClusterWhat is a M2M NetworkWhat is M2M Communicationswhat is Grid ComputingExploring what is NFCChrome Vs. AndroidWhat is a HypervisorTablets for Enterprisewhat is cloud based virtualizationFacebook and the Cloud PlatformWhat is Cloud Based Video StorageIs HSPA+ same as 4GWhat is a Massively Scaled Data CenterWhat is an Internet Enabled TVwhat is a Mobile OSTablets and 4Gwhat is Google Cloud Printwhat is a Cloud Based Video EditorWhat is Cloud SOAExplaining Augmented Reality Layerswhat is Video Chatis a Router a SwitchExploring IPTVwhat are Corporate Cloud ServicesLTE Vs WiMaxCloud Computing Platformswhat's a high bandwidth 4G NetworkExplaining Amazon Instant Videowhat-is-a-SaaS-Home-Security-systemWhat is a 3D Smartphonewhat is Mobile TVwhat is Smart TVwhat is AOL In2TVwhat is ATT Project Lightspeedwhat is cloud-in-a-boxwhat is SONETwhat is Verizon FiOSwhat exactly is Autostereoscopic 3DVideo and Cloud ComputingWhat is DropboxCloud Video Delivery PlatformsLTE Verizon 3G to 4GWhat is Android Rootingwhat-is-WebOSInternet TV GuideWhat's Ivy Bridge 3D TransistorWhat's SaaSFacebook Oregon Data CenterWhat is an Apache Web Serverwhat-is-Buffalo-CloudStorwhat-is-a-Dual-Core SmartphoneWhat is Google NFC Walletwhat-is-a-Quad-Core Smartphonewhat-is-IPTVWhat-is-Mobile-Device-HapticsWhat is a Pocket RouterCLOUD COMPUTING GLOSSARYWhat is Office 365What is Amazon Cloud Playercloud computing top trendswhat-is-no-glasses-3Dwhat-is-a-powerline networkwhat-is-MIMO Technologyfree_kindle2What-is-Boxeewhat-is-Internet-TVwhat-is-AMD-Llanowhat-is-a-multiscreen displaywhat-is-Lytro-Technologywhat-is-a-LAN-partywhat-is-NFC-Near-Field CommunicationWhat is a SFF PCwhat-is-an-eyefinity-displayWhat is an AIO PCWhat is HKMGVIDEO LEARNING CENTERHow Internet Traffic Moves Over PlanetCloud Computing and Cloud ArchitectureDynamic Cloud ServerWhat is the Internet BackboneTwitters Internet Infrastructure64bit-vs-32bitsClient-Server Vs Cloud Computingwhat-is-arm-processorWhat's cloud computing architectureWhat is Web Services Architecturewhat's a Tablet eReaderTablet Cloud ServicesWhat is iCloudwhat-is-a-hex-coreTop Tablet AppsWhat's a KindleFree SupportVerizon 4G LTE Modem is Speed Demonwhat is a Tablet OS4G Network Deployment & Evolution-LTEWiMax Battles LTE for Wireless King-MakerWhat is a Cloud Data CenterWireless M2M Communicationswhat is Intel ThunderboltClearWire and Clear Cloud NetworkWiMax vs LTEconnect laptop to tvwhat is a Video Conferencing Callwhat is Data Center Virtualizationwhat is a HTPCwhat is a Chromebooktablet-as-a-hotspotwhat is SilverlightWhat is mVOIPwhat-is-Amazon Cloud PlayerSkype and Videos and iPhone Appwhat-is-HP Wireless TV Connectwhat is HTC Senseciscos-new-data-centerWIRELESS COMMUNICATIONSLaptops_and_Cellularwhat-is-RevoluTVWhat is Wi-Fi Direct4G Networks and WiMaxEnterprise Mobility and Wireless 4GWiFi Facts and Laptop Cloud ExperienceWhat's SkypeMobile Internet has arrivedLearn_Wi-Fiwhat is Amazon Cloud DriveWhat's the Mobile Web?Laptop's_Wi-Fi_RadioCisco_Tabletwhat-is-a-4G-mobile-hotspotCorp Workers Getting Lots of TabletsGalaxy Tab Tabletwhat-is-BoxeeTablets Coming on BigIP Internet TV Platform Mobile BrokersThe SMB and the CloudeReaders are Tabletswhat-is-new-USB-3.0Verizon 4G Networkwhat is a P2P Networkwhat-is-a-4G-Tabletwhat is ARMwhat-is-Adobe-AIRwhat-is-Amazon-Instant-Videowhat-is-google-navigationwhat-is-an-Amazon-data-center32 bit Vs. 64 Bit3D Laptop and Sandy Bridgewhat-is-ubuntu-netbook-editionTablet Explosion New Post 8-20-10!what-is-a-google-data-centerwhat is Atrix 4GDoes Touch on a Screen Matter?About UMPCsA $35 Tablet?Millions of ChannelsWhat's a Hybrid Tablet-Smartphone?Tegra_TabletWhat's Google Places in the CloudHitachi Virtual Storage PlatformWhat is peer to peer online storagedial2domobile enterprise applicationsAcer 2-screen TabletWhat is Mobile BankingWhat's a VIDEO CALL?Mobile_Cloud_TalkCloud Computing ArchitecturesThin Clients & Web 2.0 for BrokersThin Client Computing ExplainedWhat is Ubuntu ServerWhat is Chrome OSGoogle Fiber NetworkWhat is Augmented Reality for Mobile?what-is-Amazon-instant-videoeReader_basicsCloud SecretsWhat is a HypervisorEmerging Cloud OS'sLight Peak is 100 gigabits per secondExploring Augmented RealityWhat is Display PortRouters now are home supercomputersHTC ThunderBoltWhat is Light PeakWhat's M2M?3D Smartphone with 4G and Touch and Wi-FiWhat is iOS 5Cloud StorageHandheld_HeavenWhat is MS SkydriveBlackberry_TorchMobile DevicesWhat is a Data BrokerAndroid_ExplosionMobile Devices AdvancedWhat is USB 3.0what is dropboxPalm PreWhat is IE9What is Firefox 4Amazon APP StoreFREE CLOUD APPSMORE FREE CLOUD APPSPrint from the Cloud with ePrintWhat is a PicoCellOnline Backup Service CARBONITEWhat is P2P NetworkingWhat is ZohoAmazon Web Services-Mobile Device ProsHP WebOSCloud Computing in India is HugeGroup Texting is evolving and expandingAmazon Web Services for BrokersCLOUD_ANALYSISNetflix Video Streaming and FacebookIndia is Growing Data CentersWhats_on_line_storage?What is WebOSAndroid 3.0 Honeycomb TabletWhat's a 1GHZ SmartphoneMainframe in PocketHTC Incredible for Mobile BrokersCloud Based Storage PlatformsVideo Over CloudWhat is HTML5What_is_a_SmartphoneGoogle's Giant SmartphoneAndroid 2.2 Mobile & CloudBehind_the_CloudWhat is Mozy Cloud StorageWhat is a Content Delivery NetworkFujitsu Cloud ServicesSony Cloud ServicesAndroid Smartphone Becomes a HotspotWhat's 4G?What is WIDIExplaining IaaSVideo CallingVideo Over LTEWhat is Cloud SecurityIntels Sandy Bridge Core ProcessorsCloud Gaming Distribution's RiseAndroid in the CloudAll About Smartphones and BlackberrysNew Cloud DevelopmentsTablet Explosion in GrowthWireless Network PlatformsDedicated Vs Cloud ServersTop Android Apps using Mobile CloudMASTER_LINK_PAGEExplaining Streaming VideoWhat is 4G CellularWhat's a HSPA+ Network...is it 4G?What's a Blu-ray LaptopHere's IPv6 Networking for the CuriousCloud StorageWhat is Windows Phone 7 and how it worksBluray NetbooksFacebook & the CloudCluster GPUChina Supercomputer is better than oursNew_Notebook_TrendsCloud Storagewhat_is_IaaSChrome OSgoogle ebookstoremobile apps for the cloudCruel Cloud RealityWhat_is_PaaSNook Color eReaderIP-TV for Mobile Device BrokersAbout Rugged LaptopsLearn about SSDCorporate iPad is now a RealityMobile Devices Erasing the Enterprise DesktopCloud Delivered Hi-Def VIDEO & Mobile DevicesCloud Download or StreamedMobile Cloud Computing Glossary-WITH VIDEOS!Mobile_Cloud_ProsWhat does Streaming Video MeanCloud Streams 100 Million ChannelsComputing Moving Back to the CloudWhite_House_CrashCloud_Server_FarmsMobile_Cloud_FutureWhat exactly is a Mobile_HotspotMobile Hotspot in your PocketAndroid_MonsterCloud_Q_ACellular and LaptopsWave_FailureUltraportable LaptopsWindows7_Awesomesave_it_pleaseFrash_is_FlashRugged_LaptopsIndia_$35_LaptopBluRay LaptopsHow to Upgrade a NetbookTrying eREADERSenrollee_mods_1Google TV Explored and ExaminedBluray Drive with Web AccessWhy Cloud Computing Reduces Jobsenrollee_2_modsLaptop ConnectorsGaming LaptopsDesktop_Replacements

FREE MOBILE CLOUD COMPUTING CONCEPTS - TRAINING_MODULES_WITH_TONS_OF_VIDEOS

data-mining-000000
data-mining-0000000.jpg

Cloud Mining is a new approach to apply Data Mining to customer data. This article introduces Cloud Mining in a quick overview.

Data Mining is a determined technique to analyse Data in CRM, Marketing and Distribution. For example it helps optimizing customer interaction, shows buying potentials of customers and the churn probability by the use of statistical-mathematical methods on big amounts of data. Thereby companies can make marketing efforts more precise – they spendings less and achieve better effects.

Data Mining is practically only used by big companies

Data Mining can be useful at less as 10.000 customers. But since the relative cost per customer to apply it virtually decreases by the number of customers, mostly big companies with millions of customers using Data Mining right now.

The high costs of personal, hardware and software licensing just don’t make it cost-effective. Main source of the costs are the Data Mining experts, that are needed to prepare the data and understand the domain knowledge of the company.

They have to be hired or provided by external service-providers (consulting companies). Yearly license-fees that actually equal the costs for a Data Mining expert have to be paid as well. The hardware has to be efficient and causes remarkable costs. That is why Data Mining usually can’t be afforded by small companies.

Reduce Data Mining costs by SaaS – Cloud Mining is born

The SaaS Distribution model (Software-as-a-Service) helps to reduce costs by providing flexible license options and outsourcing the hardware effort. At SaaS, the software is not applied in the company, it lies at a software service providers server.

That means the provider deals with the hardware, looks after software updates and maintains technically everything. In Cloud Mining, the servers that provide the software are the Cloud. This can be the
public cloud from Google, Amazon.com etc, or a private cloud on the servers of a single provider. That has two main effects; on one hand the customer only pays for the tools of Data Mining he needs.

That makes him save a lot compared to complex Data Mining suites that he is not using exhaustive. And on the other hand he just pays for the costs that are generated by using the Cloud. He does not have to maintain a hardware infrastructure, he can apply data mining just via his browser. This reduces the barriers that keep small companies from benefiting of Data Mining.

Later we will come to pros and cons of Cloud Mining, and present some companies that already providing this service.

cloud data mining layout
cloud-data-mining.jpg
cloud data
data-mining-001111.jpg

Business Intelligence is all about making better decisions from the data you have. However, all too often, the data you have is difficult to process by typical BI tools. These failures generally come in two areas.

  1. The data is too voluminous to be properly digested by your BI system.
  2. The data records are messy, inconsistent and difficult to join together.

Each of these problems is commonplace, and relatively easy to solve in isolation. High‐volume datasets can be mastered by simply (if expensively) throwing more hardware and software at the problem—larger servers, cluster licenses, faster networks, bigger memories, faster disks, etc. Messy data can be cleansed with the appropriate use of script and SQL logic to make records consistent and well‐defined.

But what do you do when datasets are both large and messy? As we have learned from a recent project with a large financial institution, large amounts of data that are difficult to correlate can bring down even a state‐of‐the‐art BI system. But large volumes of messy data are a fact of life—indeed; it's probably the bulk of the data in the enterprise. Add to that the complexity and cost associated with trying to tame it, the business case for analyzing it is overwhelmed.

“large volumes of messy data are a fact of life—indeed; it's probably the bulk of the data in the enterprise”

Enter the Cloud

One of the primary concepts in cloud computing is low‐cost scalability—systems that can grow to handle larger volumes of users and data by adding more low‐cost hardware. Google's entire infrastructure is built on this approach of distributing work out to thousands of inexpensive servers, instead of relying on centralized "supercomputers" to provide the horsepower.

The scalability strategy that Google uses is called MapReduce. The MapReduce model provides a conceptual framework for dividing work up into small, manageable sets that can be distributed across 1 or 10 or 100 or 1000 or even 10000 servers, which can all work in parallel. This technology can be used with BI to meet the challenge of large‐scale, messy data, but you can’t use Google’s infrastructure to run your own MapReduce system. Luckily, there’s Hadoop – an open source implementation of the Google MapReduce system.

Even though it’s technically still in“Beta”, Hadoop is in use at many large organizations, including:

  • Amazon
  • Yahoo
  • Facebook
  • Adobe
  • The New York Times
  • AOL
  • Twitter
  • Rackspace

Introducing Hadoop

In 2004, Google published papers describing their Google File System and MapReduce algorithms. Doug Cutting, a Yahoo employee and Open Source Evangelist, partnered with a friend to create Hadoop (amed after his son’s stuffed elephant), an opensource implementation of GFS and MapReduce.

In essence, Hadoop was a software system that could handle arbitrarily large amounts of data using a distributed file system, and distribute it to be worked on by an arbitrary number of workers, using MapReduce. Adding more storage or more workers is simply a matter of connecting new machines to the network—there is no need for larger devices or specialized disks or specialized networking.

The two main parts of Hadoop are:

  1. HDFS
  2. MapReduce

HDFS


HDFS (Hadoop Distributed File System) is a system for managing files that runs "on top of" standard computers and standard operating systems. When a file is loaded into HDFS, the master “Name Node” invisibly breaks these files into large chunks, and stores them in multiple places (for redundancy) on the native file systems of the computers in the "cluster".

There's no requirement that the disks be the same size or that the computers be the same as the others in the "cluster". When a file is retrieved from HDFS, the Name Node fetches the chunks from the appropriate machines, re‐assembles it and delivers it to the caller (This is a simplification of the actual process, which is a lot more technically sophisticated).

MapReduce

MapReduce is a three‐step process that provides a structure for analyzing data and manipulating it in a scalable way. The three steps are:

  1. Map
  2. Shuffle/Sort
  3. Reduce

Map

Raw data is translated/standardized/manipulated ‐ usually in fairly "lightweight" ways. The output of the Map step is a key‐value pair, which represents some sort of unique or nearly‐unique key (in many cases, you want the same key to be used for multiple records, for grouping purposes) and then whatever data is needed later (in the Reduce step) for the value.

Shuffle/Sort

The "secret sauce" of Hadoop is the distributed sort, where the records are all sorted by key (using either a default alphabetical sort or another comparator of your choice). Once records are sorted by key, all the records with the same key are sent to the same Reduce processor ‐ essentially this represents a way to group data intelligently.

Reduce

In the last step, these groups of records with the same keys are handed one‐by‐one to the Reduce task. Sometimes, the Reduce step will cache all of the records, so it can operate on the entire group at one time (often to perform aggregations). Other translations and manipulations may occur here ‐ for example, the data might be output in a format that's easier to import into a database. Finally, the resulting data is written back out to HDFS, and the job is done.

Map

Shuffle/Sort

Reduce

How Does This Help BI?

Imagine you have data where some of the dimensions are well defined, but others change over time, in non‐trivial ways. Imagine that the data is in many different places, in many different formats, and you want to create a "holistic" view of the data. Last, but not least, imagine that the size of the overall dataset is so large that it will swamp the capabilities of your BI tool. How do you solve this problem?

The Traditional Way (Take a deep breath)

You can write translators for the different datasets ‐ but if the dataset is large, those translators will take a long time to run. So you consider manually splitting these large datasets into smaller sets, but then, of course, you have to get the data onto all of the computers, get the scripts running, and the computers need to have enough storage for the subset of the data.

Once you clean the data, you have to rejoin the cleansed data together again from the multiple machines, and if you need to sort the data to help you aggregate it, you're going to have to find a sort solution that works on the huge volumes of data that you're dealing with. Odds are, you'll have to sort smaller subsets of the overall dataset, and then find ways to merge the subsets back together.

Then you still haven't dealt with the fact that you need to aggregate it, so you have to write more scripts, divide the data into subsets again, and make sure you got all the records that belong in a group on the same machine.

Every step in the above scenario is error‐prone, complex, difficult to predict and, if you have to do this process on a regular basis, probably maddeningly tedious.

“every step in the [traditional way] is error‐prone, complex, difficult to predict and…maddeningly tedious”

Hadoop to the Rescue

Instead, consider this option:

  1. Load all the datasets into the HDFS. Hadoop will take care of how to partition the data, where to put it, and it also handles redundancy. (In other words, you don't need to use RAID in your Hadoop solution)
  2. Write Map jobs that will take the data from each of the formats, and clean them, organizing the data into a general format.
  3. Specify how to sort the data to properly group it
  4. Write Reduce jobs to aggregate the data ‐ averaging and summing columns as needed, and then outputting the final aggregated data into a SQL‐friendly format
  5. Now you run the Map/Sort/Reduce process on the data.
  6. At the end of this process, you pull the aggregated data out of HDFS, and load it into your BI system.

You'll need to perform some quality checking on the output, but if the steps have been done properly, you have a repeatable, scalable process for generating aggregated data, with minimal manual intervention.

Some Current Uses of Hadoop:

  • Product Search Index generation
  • Data Aggregation & Rollup
  • Data mining for ad targeting
  • ETL
  • Analyzing & Storing Logs
  • Data Analytics
  • RDF Indexing

Where Hadoop Doesn't Fit

Hadoop is a tool like any other, and it is not applicable to every problem. Some areas where Hadoop is not the right solution:

Highly Interdependent Data

Hadoop is not well suited for data where each record is heavily dependent on a number of other records. For example, consider weather forecasting ‐ predicting what's going to happen to a storm front over time requires a view of pretty much the entire dataset at once. This is the realm for supercomputers.

Ad‐Hoc, "Casual" BI

Hadoop provides a framework for data analysis, but it usually requires a fairly sophisticated user to create queries and aggregations. And, of course, there's little‐to‐no support for visualizations, etc. Hadoop's SQL‐related sub‐projects, such as Hive and Pig help mitigate some of this, but casual reporting is still somewhat difficult.

Real‐time Processing

Hadoop is designed to trade off startup speed forscalability and parallelism ‐ in other words, Hadoop is more like a locomotive than a sports car ‐ it takes a fairly long time to get everything set up, but once it's moving, it's doing a lot of work.

Dependencies on Other Systems

If you set up a 10,000 node Hadoop cluster to process a huge dataset, and one of the steps involves a query to a database on an old computer in a dusty corner of the datacenter, that database server is going to be a bottleneck for the entire cluster. Hadoop jobs work best when they have few (or best case, none) dependencies on external systems. There are various tricks and strategies that can mitigate this problem, but in general, remove as much external dependency as possible from your Hadoop jobs.

“Hadoop jobs work best when they have few (or best case, none) dependencies on external systems”

Conclusions

In terms of Business Intelligence, Hadoop is a tool that makes the ETL process easier, and can bring the size and quality of data under control. It provides low‐cost scalability, a reliable and easily expandable file system, and a framework for dramatically increasing the scope and robustness of your data‐mining and data‐analysis business strategies. In situations where your data is too large, too messy or both, Hadoop can help you get it under control, and focus on your business, instead of focusing on one‐off IT infrastructure data analysis projects.