Keith Dawson: Develop in the Cloud
This story was written by Keith Dawson for UBM DeusM’s community Web site Develop in the Cloud, sponsored by AT&T. It is archived here for informational purposes only because the Develop in the Cloud site is no more. This material is Copyright 2012 by UBM DeusM.

A Data Pipeline in the Cloud

Amazon pre-announces a data orchestration service for your big-data.

AWS Data Pipeline will offer a graphical UI to integrate date from within and outside Amazon's services. It's part of a growing suite aimed squarely at big-data.

Last week, Amazon held its first-ever conference for Amazon Web Services in Las Vegas -- AWS re:Invent -- and used the occasion to unleash a small blizzard of product announcements. One of them, AWS Data Pipeline will be of interest to developers who use AWS and need to deal with data scattered across Amazon's services, and elsewhere.

Amazon CTO Werner Vogels announced AWS Data Pipeline (see beginning about 1:25:00 on the video), characterizing it as "a data-driven workflow service that helps you periodically move data through several processing steps to get it where you want it to go." Data Pipeline is integrated with AWS's storage services: DynamoDB (the NoSQL database tool), S3 (the Simple Storage Service), the archival Glacier service, or the new "data warehousing as a service" offering, Redshift, which had been announced only the day before. AWS Data Pipeline can also be connected to data sources external to Amazon, such as databases on your own servers or private cloud.

Looks real
Amazon's chief data scientist Matt Wood gave a live demo of Data Pipeline (see around 1:27:30 on the video), using it first from a pre-built template and then building a workflow from scratch. Wood outlined the four elements that need to be defined to set up such a workflow:

Defining a workflow is done via a graphical, drag-and-drop interface. Wood, the data scientist, demonstrated setting up a Data Pipeline that would, daily, pull log-file data from a DynamoDB database; spin up a Hadoop cluster of any necessary size to process it via Elastic MapReduce, then decommission the cluster; and store the results in a storage bucket on S3. Once a week the job would pull seven logs from S3, produce a summary report, and store that back on S3. Wood concluded that Data Pipeline facilitates the "point-and-click creation of complex, light-weight data environments to enable your data analytics from your log files."

AWS Data Pipeline is evidently real enough and working -- there were no glitches during the demo -- but Vogels did not announce an availability date for the service.

More big-data
Other announcements at Amazon's conference filled in more pieces of the big-data offering that the company is assembling piece-by-piece. Vogels described new Elastic Compute Cloud (EC2) instances designed for analytics: "This is the instance type you want to use for in-memory," he said. This instance will have 240 GB of RAM and two 120-GB solid-state drives. A second new instance type is a high-storage one with 117 GB of RAM and 48 TB of disk space on 24 hard drives.

Vogels goes into detail about Redshift, the other big-data enabling announcement, on his blog, All Things Distributed. Redshift is priced to disrupt existing data-warehouse business models. A single 2-TB Redshift node will cost you $0.85 per hour, or $3,723 per terabyte annualized. By reserving a 3-year instance you can get the price under $1,000/terabyte/year, "one tenth the price of most data warehousing solutions available to customers today," according to Vogels.

If you have a big-data problem, increasingly it's looking like Amazon is the place to go for solutions.