Skip to content

Big Data Top Ten


Screen Shot 2013-12-20 at 7.52.56 AM

What do you get when you combine Big Data technologies….like Pig and Hive? A flying pig?

No, you get a “Logical Data Warehouse”.

My general prediction is that Cloudera and Hortonworks are both aggressively moving to fulfilling a vision which looks a lot like Gartner’s “Logical Data Warehouse”….namely, “the next-generation data warehouse that improves agility, enables innovation and responds more efficiently to changing business requirements.”

In 2012, Infochimps (now CSC) leveraged its early use of stream processing, NoSQLs, and Hadoop to create a design pattern which combined real-time, ad-hoc, and batch analytics. This concept of combining the best-in-breed Big Data technologies will continue to advance across the industry until the entire legacy (and proprietary) data infrastructure stack will be replaced with a new (and open) one.

As this is happening, I predict that the following 10 Big Data events will occur in 2014.

1. Consolidation of NoSQLs begins

A few projects have strong commercialization companies backing them. These are companies who have reached “critical mass”, including Datastax with Cassandra, 10gen with MongoDB, and Couchbase with CouchDB.  Leading open source projects, like these, will pull further and further away from the pack of 150+ other NoSQLs, who are either fighting for the same value propositions (with a lot less traction) or solving small niche use-cases (and markets).

2. The Hadoop Clone wars end

The industry will begin standardizing on two distributions. Everyone else will become less relevant (It’s Intel vs. AMD. Lets not forget the other x86 vendors like IBM, UMC, NEC, NexGen, National, Cyrix, IDT, Rise, and Transmeta). If you are a Hadoop vendor, you’re either the Intel or AMD. Otherwise, you better be acquired or get out of the business by end of 2014.

3. Open source business model is acknowledged by Wall Street

Because the open source, scale-out, commodity approach to Big Data is fundamental to the new breed of Big Data technologies, open source now becomes a clear antithesis of the proprietary, scale-up, our-hardware-only, take-it-or-leave-it solutions. Unfortunately, the promises of international expansion, improved traction from sales force expansion, new products and alliances, will all fall on deaf ears of Wall Street analysts. Time to short the platform RDBMS and Enterprise Data Warehouse stocks.

4. Big Data and Cloud really means private cloud

Many claimed that 2013 was the “year of Big Data in the Cloud”. However, what really happened is that the Global 2000 immediately began their bare metal projects under tight control. Now that those projects are underway, 2014 will exhibit the next phase of Big Data on virtualized platforms. Open source projects like Serengeti for VSphere; Savanna for OpenStack; Ironfan for AWS, OpenStack, and VMware combined, or venture-backed and proprietary solutions like Bluedata will enable virtualized Big Data private clouds.

5. 2014 starts the era of analytic applications

Enterprises become savvy to the new reference architecture of combined legacy and new generation IT data infrastructure. Now it’s time to develop a new generation of applications that take advantage of both to solve business problems. System Integrators will shift resources, hire data scientists, and guide enterprises in their development of data-driven applications. This, of course, realizes the concepts like the 360 degree view, Internet of things, and marketing to one.

6. Search-based business intelligence tools will become the norm with Big Data

Having a “Google-like” interface that allows users to explore structured and unstructured data with little formal training is the where the new generation is going. Just look at Splunk for searching machine data. Imagine a marketer being able to simply “Google Search” for insights on their customers?

7. Real-time in-memory analytics, complex event processing, and ETL combine

The days of ETL in its pure form are numbered. It’s either ‘E’, then ‘L’, then ‘T’ with Hadoop, or it’s EAL (extract, apply analytics, and load) with new real-time stream-processing frameworks. Now that high-speed social data streams are the norm, so are processing frameworks that combine streaming data with micro-batch and batch data, performing complex processors on that data and feeding applications in sub-second response times.

8. Prescriptive analytics become more mainstream

After descriptive and predictive, comes prescriptive. Prescriptive analytics automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions. We will begin seeing powerful use-cases of this in 2014. Business users want to be recommended specific courses of action and to be shown the likely outcome of each decision.

9. MDM will provide the dimensions for big data facts

With Big Data, master data management will now cover both internal data that the organization has been managing over years (like customer, product and supplier data) as well as Big Data that is flowing into the organization from external sources (like social media, third party data, web-log data) and from internal data sources (such as unstructured content in documents and email). MDM will support polyglot persistence.

 10. Security in Big Data won’t be a big issue

Peter Sondergaard, Gartner’s senior vice president of research, will say that when it comes to big data and security that “You should anticipate events and headlines that continuously raise public awareness and create fear.” I’m not dismissing the fact that with MORE data comes  more responsibilities, and perhaps liabilities, for those that harbor the data. However, in terms of the infrastructure security itself, I believe 2014 will end with a clear understanding of how to apply those familiar best-practicies to your new Big Data platform including trusted Kerberos, LDAP integration, Active Directory integration, encryption, and overall policy administration.

Posted in Big Data.

Tagged with , , , , , , , , , , , , , .

SAP & Big Data


SAP customers are confused about the positioning between SAP Sybase IQ and SAP Hana as it applies to data warehousing. Go figure, so is SAP. You want to learn about their data warehousing offering, and all you hear is “Hana this” and “Hana that”.

It reminds me of the time after I left Teradata when the BI appliances came on the scene. First Netezza, then Greenplum, then Vertica and Aster Data, then ParAccel. Everyone was confused about what the BI appliance was in relation to the EDW. Do I need an EDW, a BI appliance, an EDW + BI appliance?

With SAP, Sybase IQ is supposed to be the data warehouse and Hana is the BI or analytic appliance that sits off to its side. Ok. SAP has a few customers on Sybase IQ, but are they the larger well-known brands? Lets face it….since its acquisition of Sybase in 2010, SAP has struggled with positioning it against incumbents like Teradata, IBM, and even Oracle.

SAP Roadmap


SAP’s move from exploiting it’s leadership position in enterprise ERP to exploring the new BI appliance and Big Data markets has been impressive IMHO. With acquisitions of EDW and RDBMS company, Sybase, in 2010 after earlier acquisition of BI leader, Business Objects, in 2007 was necessary to be relevant in the race to providing an end-to-end data infrastructure story. This was; however, a period of “catch-up” or “late entry” to the race.

The beginning of its true exploration began with SAP Hana and now strategic partnership with Hadoop commercialization company, Hortonworks. The ability to rise ahead of Data Warehouse and database management system leaders will require defining a new Gartner quadrant – the Big Data quadrant.

SAP Product Positioning

SAP_Product_PositioningLets look back in time at SAP’s early positioning. We have the core ERP business, the new “business warehouse” business, and the soon to be launched Hana business. The SAP data warehouse equation is essentially = Business Objects + Sybase IQ + Hana. Positioning Hana, as with most data warehouse vendors, is a struggle since it can be positioned as a data mart within larger footprints, or as THE EDW database altogether in smaller accounts. One would think that with proper guidelines, this positioning would be straightforward. But there is more than database size, and complexity of queries, but a very challenging variable of customer organizational requirements and politics that play into platform choice. As shown above, you can tell that SAP struggled with simplifying its message for its sales teams early on.

SAP Hana – More than a BI Appliance

SAP released its first version of their in-memory platform, SAP HANA 1.0 SP02, to the market on June 21st 2011. It was (and is) based on an acquired technology from Transact In Memory, a company that had developed a memory-centric relational database positioned for “real-time acquisition and analysis of update-intensive stream workloads such as sensor data streams in manufacturing, intelligence and defense; market data streams in financial services; call detail record streams in Telco; and item-level RFID tracking.” Sound familiar to our Big Data use-cases today?

As with most BI appliances back then, customers spent about $150k for a basic 1TB configuration (SAP partnered with Dell) for the hardware only – add software and installation services and we were looking at $300K, minimally, as the entry point. SAP started off with either a BI appliance (HANA 1.0) or a BW Data Warehouse appliance (HANA 1.0 SP03). Both of these using the SAP IMDB Database Technology (SAP HANA Database) as their underlying RDBMS.

BI Appliances come with analytics, of course


When SAP first started marketing their Hana analytics, you were promised a suite of sophisticated analytics as part of their Predictive Analysis Library (PAL) which can be called directly in a “L wrapper” within an SQL Script. The inputs and outputs are all tables. PAL includes seven well known predictive analysis algorithms in several data mining algorithm categories:

  • Cluster analysis (K-means)
  • Classification analysis (C4.5 Decision Tree, K-nearest Neighbor, Multiple Linear Regression, ABC Classification)
  • Association analysis (Apriori)
  • Time Series (Moving Average)
  • Other (Weighted Score Table Calculation)

HANA’s main use case started with a focus around its installed base with a real-time in-memory data mart for analyzing data from SAP ERP systems. For example, profitability analysis (CO-PA) is one of the most commonly used capabilities within SAP ERP. The CO-PA Accelerator allows significantly faster processing of complex allocations and basically instantaneous ad hoc profitability queries. It belongs to accelerator-type usage scenarios in which SAP HANA becomes a secondary database for SAP products such as SAP ERP. This means SAP ERP data is replicated from SAP ERP into SAP HANA in real time for secondary storage.

BI Appliances are only as good as the application suite

Other use-cases for Hana include:

  • Profitability reporting and forecasting,
  • Retail merchandizing and supply-chain optimization,
  • Security and fraud detection,
  • Energy use monitoring and optimization, and,
  • Telecommunications network monitoring and optimization.

Applications developed on the platform include:

  • SAP COPA Accelerator
  • SAP Smart Meter Analytics
  • SAP Business Objects Strategic Workforce Planning
  • SAP SCM Sales and Operations Planning
  • SAP SCM Demand Signal Management

Most opportunities were initially “accelerators” with its in-memory performance improvements.

Aggregate real-time data sources

There are two main mechanisms that HANA supports for near-real-time data loads. First is the Sybase Replication Server (SRS), which works with SAP or non-SAP source systems running on Microsoft, IBM or Oracle databases. This was expected to be the most common mechanism for SAP data sources. There used to be some license challenges around replicating data out of Microsoft and Oracle databases, depending on how you license the database layer of SAP. I’ve been out of touch on whether these have been fully addressed.

SAP has a second choice of replication mechanism called System Landscape Transformation (SLT). SLT is also near-real-time and works from a trigger from within the SAP Business Suite products. This is both database-independent and pretty clever, because it allows for application-layer transformations and therefore greater flexibility than the SRS model. Note that SLT may only work with SAP source systems.

High-performance in-memory performance

HANA stores information in electronic memory, which is 50x faster (depending on how you calculate) than disk. HANA stores a copy on magnetic disk, in case of power failure or the like. In addition, most SAP systems have the database on one system and a calculation engine on another, and they pass information between them. With HANA, this all happens within the same machine.

 Why Hadoop?

SAP HANA is not a platform for loading, processing, and analyzing huge volumes – petabytes or more – of unstructured data, commonly referred to as big data. Therefore, HANA is not suited for social networking and social media data analytics. For such uses cases, enterprises are better off looking to open-source big-data approaches such as Apache Hadoop, or even MPP-based next generation data warehousing appliances like Pivotal Greenplum or similar.

SAP’s partnership with Hortonworks enables the ability to migrate data between HANA and Hadoop platforms. The basic idea is to treat Hadoop systems as an inexpensive repository of tier 2 and tier 3 data that can be, in turn, processed and analyzed at high speeds on the HANA platform. This is a typical design pattern between Hadoop and any BI appliance (SMP or MPP).

Screen Shot 2013-11-30 at 7.26.13 AM

SAP “Big Data White Space”?

Where do SAP customers need support? Where is the “Big Data White Space?”. SAP seems to think that persuading customers to run core ERP applications on HANA is all that matters. Are customer responding? Answer – not really.

Customers are saying they’re not planning to use it, with most of them citing high costs and a lack of clear benefit (aka use-case) behind their decision. Even analysts are advising against it - Forrester research said the HANA strategy is “understandable but not appealing”.

“If it’s about speeding up reporting of what’s just happened, I’ve got you, that’s all cool, but it’s not helping me process more widgets faster.”, SAP Customer.

SAP is betting its future on HANA + SaaS. However, what is working in SAP’s favor for the moment is the high level of commitment among existing (european) customers to on-premise software.

This is where the “white space” comes in. Bundling a core suite of well-designed business discovery services around the SAP solution-set will allow customers to feel like they are being listened to first, and sold technology second.

Understanding how to increase REVENUE with new greenfield applications around unstructured data that leverages the structured data from ERP systems can be a powerful opportunity. This means architecting a balance of historic “what happened”, real-time “what is currently happening”, and a combined “what will happen IF” all together into a single data symphony. Hana can be leveraged for more ad-hoc analytics on the combined historic and real-time data for business analysts to explore, rather than just be a report accelerator.

This will require:

  • Sophisticated business consulting services: to support uncovering the true revenue upside
  • Advanced data science services: to support building a new suite of algorithms on a combined real-time and historic analytics framework
  • Platform architecture services: to support the combination of open source ecosystem technologies with SAP legacy infrastructure

This isn’t rocket science. It just takes a focused tactical execution, leading with business cases first. The SAP-enabled Bid Data system can then be further optimized with cloud delivery as a cost reducer and time-to-value enhancer, along with a further focus around application development. Therefore, other white space includes:

  • Cloud delivery
  • Big Data application development

SAP must keep its traditional customers and SI partners (like CSC) engaged with “add-ons” to its core business applications with incentives for investing in HANA, while at the same time evolving its offerings for line of business buyers.

Some think that SAP can change the game by reaching/selling to marketers with new analytics offerings (e.g. see SAP & KXEN), enhanced mobile capabilities, ecosystem of start-ups, and a potential to incorporate its social/collaboration and e-commerce capabilities into one integrated offering for digital marketers and merchandisers.

Is a path to define a stronger CRM vision for marketers? It won’t be able to without credible SI partners who have experience with new media, digital agencies and specialty service providers who are defining the next wave of content- and data-driven campaigns and customer experiences.

Do you agree?

Posted in Big Data.

Tagged with , , , , , , , .

Infochimps, a CSC Company = Big Data Made Better


What’s a $15B powerhouse in information technology (IT) and professional services doing with an open source based Big Data startup?

It starts with “Generation-OS”. We’re not talking about Gen-Y or Gen-Z. We’re talking Generation ‘Open Source’.

Massive disruption is occurring in information technology as businesses are building upon and around recent advances in analytics, cloud computing and storage, and an omni-channel experience across all connected devices. However, traditional paradigms in software development are not supporting the accelerating rate of change in mobile, web, and social experiences. This is where open source is fueling the most disruptive period in information technology since the move from the mainframe to client-server – Generation Open Source.

Infochimps = Open Standards based Big Data

Infochimps delivers Big Data systems with unprecedented speed, scale and flexibility to enterprise companies. (And when we say “enterprise companies,” we mean the Global 2000 – a market in which CSC has proven their success.) By joining forces with CSC, we together will deliver one of the most powerful analytic platforms to the enterprise in an unprecedented amount of time.

At the core of Infochimps’ DNA is our unique, open source-based Big Data and cloud expertise. Infochimps was founded by data scientists, cloud computing, and open source experts, who have built three critical analytic services required by virtually all next-generation enterprise applications: real-time data processing and analytics, batch analytics, and ad hoc analytics – all for actionable insights, and all powered by open-standards.

CSC = IT Delivery and Profession Services

When CSC begins to insert the Infochimps DNA into its global staff of 90,00 employees, focused on bringing Big Data to a broad enterprise customer base, powerful things are bound to happen. Infochimps Inc., with offices in both Austin, TX and Silicon Valley, becomes a wholly-owned subsidiary, reporting into CSC’s Big Data and Analytics business unit.

The Infochimps’ Big Data team and culture will remain intact, as CSC leverages our bold, nimble approach as a force multiplier in driving new client experiences and thought leadership. Infochimps will remain under its existing leadership, with a focus on continuous and collaborative innovation across CSC offerings.

I regularly coach F2K executives on the important topic of “splicing Big Data DNA” into their organizations. We now have the opportunity to practice what we’ve been preaching, by splicing the Infochimps DNA into the CSC organization, acting as a change agent, and ultimately accelerating CSC’s development of its data services platform.

Infochimps + CSC = Big Data Made Better

I laugh many times when we’re knocking on the doors of Fortune 100 CEOs.

“There’s a ‘monkey company’ at the door.”

The Big Data industry seems to be built on animal-based brands like the Hadoop Elephant. So to keep running with the animal theme, I’ve been asking C-levels the following question when they inquire about how to create their own Big Data expertise internally:

“If you want to create a creature that can breathe underwater and fly, would it be more feasible to insert the genes for gills into a seagull, or splice the genes for wings into a herring?”

In other words, do you insert Big Data DNA into the business savvy with simplified Big Data tools, or insert business DNA into your Big Data-savvy IT organization? In the case of CSC and Infochimps, I doubt that Mike Lawrie, CSC CEO, wants to be associated with either a seagull or a herring, but I do know he and his senior team is executing on a key strategy to become the thought leader in next-generation technology starting with Big Data and cloud.

Regardless of your preference for animals (chimpanzees, elephants, birds, or fish), the CSC and Infochimps combination speaks very well to CSC’s strategy for future growth with Big Data, cloud, and open source. Infochimps can now leverage CSC’s enterprise client base, industrialized sales and marketing, solutions development and production resources to scale our value proposition in the marketplace.

“Infochimps, a CSC company, is at the door.”

 Jim Kaskade
Infochimps, a CSC Company

Posted in Big Data, Cloud Computing.

Tagged with , , , , .

Real-time Big Data or Small Data?


Have you heard of products like IBM’s InfoSphere Streams, Tibco’s Event Processing product, or Oracle’s CEP product? All good examples of commercially available stream processing technologies which help you process events in real-time.

I’ve been asked what I consider as “Big Data” versus “Small Data” in this domain. Here’s my view.

Real-Time Analytics Small Data Big Data
Data Volume None None
Data Velocity 100K events / day (<<1K events / second) Billion+ events / day (>>1K events / second)
Data Variety 1-6 structured sources AND 1 single destination (an output file, a SQL database, a BI tool) 6+ structured and 6+ unstructured sources AND many destinations (a custom application, a BI tool, several SQL databases, NoSQL databases, Hadoop)
Data Models Used for “transport” mainly. Little to no ETL, in-stream analytics, or complex event processing performed. Transport is the foundation. However, distributed ETL, linearly scalable in-memory and in-stream analytics are applied, and complex event processing is the norm.
Business Functions One line of business (e.g. financial trading) Several lines of business – to – 360 view
Business Intelligence No queries are performed against the data in motion. This is simply a mechanism for transporting transaction or event from the source to a database.Transport times are <1 second.Example: connect to desktop trading applications and transport trade events to an Oracle database. ETL, sophisticated algorithms, complex business logic, and even queries can be applied to the stream of events as they are in motion.  Analytics span across all data sources and, thus, all business functions.Transport and analytics occur in < 1 second.Example: connect to desktop trading applications, market data feeds, social media, and provide instantaneous trending reports. Allow traders to subscribe to information pertinent to their trades and have analytics applied in real-time for personalized reporting.

Want to see my view of Batch Analytics? Go Here.

Want to see my view of Ad Hoc Analytics? Go Here.

Here are a few other products in this space:


Posted in Big Data.

Tagged with , , , , , , , , , .

Ad Hoc Queries with Big Data or Small Data?


Do you think that you’re working with “Big Data”? or is it “Small Data”? If you’re asking ad hoc questions of your data, you’ll probably need something that supports “query-response” performance or, in other words, “near real-time”. We’re not talking about batch analytics, but more interactive / iterative analytics. Think NoSQL, or “near real-time Hadoop” with technologies like Impala. Here’s my view of Big versus Small with ad hoc analytics in either case.

Ad Hoc Analytics Small Data Big Data
Data Volume Megabytes – Gigabytes Terabytes (1-100TB)
Data Velocity Update in near real-time (seconds) Update in real-time (milliseconds)
Data Variety 1-6 structured data sources 6+ structured AND 6+ unstructured data sources
Data Models Aggregations with tens of tables Aggregations with up to 100s – 1000s of tables
Business Functions One line of business (e.g. sales) Several lines of business – to – 360 view
Business Intelligence Queries are simple, regarding basic transactional summaries/reports.Response times are in seconds across a handful of business analysts. 


Example: retrieve a customer’s profile and summarize their overall standing based on current market values for all assets.


This is representative of the work performed when a business asks the question “What is my customer worth today?”


The transaction is a read-only transaction. Questions vary based on what business analyst needs to know interactively.

Queries can be as complex as with batch analytics, but generally are still read-only and processed against aggregates. Queries span across business functions.Response times are in seconds across large numbers of business analysts.Example: retrieve a customer profile and summarize activities across all customer-touch points, calculating “Life-Time-Value” based on past & current activities.

This is representative of the work performed when a business asks the question “Who are my most profitable customers?”


Questions vary based on what business analyst needs to know interactively.

Want my view on Batch Analytics? Look here.

Want my view on Real-time analytics? Look here.

Here are a few products in this space:

Posted in Big Data.

Tagged with , , , , , .

Batch with Big Data versus Small Data


How do you know whether you are dealing with Big Data or Small Data? I’m constantly asked for my definition of “Big Data”. Well, here it is…for batch analytics, now addressed by technologies such as Hadoop.

Batch Analytics

Batch Analytics Small Data Big Data
Data Volume Gigabytes Terabytes – Petabytes
Data Velocity Updated periodically with non-real-time intervals Updated both in real-time  and through bulk timed intervals
Data Variety 1-6 structured sources 6+ structured AND 6+ unstructured sources
Data Models Store data without cleaning, transforming, or normalizing. Store data without cleaning, transforming, and normalizing. Then apply schemas based on application needs.
Business Functions One line of business (e.g. sales) Several lines of business – to – 360 view
Business Intelligence Queries are complex requiring many concurrent data modifications, a rich breadth of operators, and many selectivity constraints. However, they are applied to a simpler data structure.Response times are in minutes to hours, issued by one or maybe two experts.Example: determine how much profit is made on a given line of parts, broken out by supplier, by geography, by year.


Queries are complex requiring many concurrent data modifications, a rich breadth of operators, and many selectivity constraints. Queries span across business functions.Response times are in minutes to hours, issued by a small group of experts. 

Example: determine how much profit is made on a given line of parts, broken out by supplier, by geography, by year; and then determining which customers purchased the higher profit parts, by geography, by year; determining the profile of those high-profit customers; finding out what products purchased by high-profit customers were NOT purchased by other similar customers in order to cross-sell / up-sell.

Want to see my view on Ad Hoc and Interactive Analytics? Go here.

Want to see my view on Real-Time Analytics? Go here.

Here are a few other products in this space:

ICS Hadoop








Posted in Big Data.

Tagged with , , , , .

Splice data scientist DNA into your existing team

Screen Shot 2013-05-07 at 2.38.52 PMAs organizations continue to grapple with big data demands, they may find that business managers who understand data may meet their “data scientist” needs better than the hard core data technologists.

There’s little doubt that data-derived insight will be a key differentiator in business success, and even less doubt that those who produce such insight are going to be in very high demand. Harvard Business Review called “data scientist” the“sexiest” job of the 21st century, and McKinsey predicts a shortfall of about 140,000 by 2018. Yet most companies are still clueless as to how they’re going to meet this shortfall.

Unfortunately, the job description for a data scientist has become quite lofty. Unless your company is Google-level cool, you’re going to struggle to hire your big data dream team (well, at least right now), and few firms out there could recruit them for you. Ultimately, most organizations will need to enlist the support of existing staff to achieve their data-driven goals, and train them to become data scientists. To accomplish this, you must determine the basic elements of data scientist “DNA” and strategically splice it into the right people.


Image credit: Thinkstock

Posted in Big Data.

Tagged with , , , .

Why the Pivotal Initiative’s Fate will Mirror VMware’s

Screen Shot 2013-04-24 at 8.16.00 AM

An Enterprise PaaS must truly be agnostic to the underlying elastic infrastructure, and fully support open standards. So the big question is whether the Pivotal Initiative will be able to break away from its roots with EMC and VMware and the associated ties to VSphere?

Lets itemize just a few of the major components of the stack from top to bottom:

  • Pivotal Labs: Besides the source of Paul Maritz‘s new company name, this is an agile software development consulting firm focused on Ruby on Rails, pair programming, test-driven development and behavior driven development. It is known for Pivotal Tracker, a project management and collaboration software package.
  • OpenChorus: real-time social collaboration on predictive analytics projects, allowing businesses to iterate faster and more effectively.
  • Cetas: End-to-End analytics platform from data ingestion, to data source connectors, to data processing and analytics, and visualization to recommendations.
  • Vfabric SpringSource: Eclipse-based application development framework for building Java-based enterprise applications.
  • Vfabric Data Director: Database provisioning, high availability, backup, and cloning. This product includes the ability to provision Hadoop on VSphere using open source project Serengeti (powered by the open source orchestration project Ironfan).
  • Vfabric Gemfire: An in-memory stream processing technology that combines the power of stream data processing capabilities with traditional database management. It supports ’Continuous Querying‘ which eliminates the need for application polling and supports the rich semantics of event driven architectures.
  • Vfabric RabbitMQ: Enterprise messaging middleware implementation of AMQP supporting a full range of Internet protocols for lightweight messaging— including HTTP, HTTPS and STOMP – enabling you to connect nearly any imaginable type of applications, components, or services.
  • Greenplum: An ad hoc query and analytics database. The Greenplum database is based on PostgreSQL. It primarily functions as a data mart / analytic appliance and utilizes a shared-nothing, massively parallel processing (MPP) architecture. It has a parallel query optimizer that converting SQL AND MapReduce into a physical execution plan.
  • Pivotal HD (Hadoop Distribution): The distribution is competitive with Cloudera. EMC (now Pivotal) created their own distribution so it could improve query response time (but this occurred before they were aware of the introduction of Impala. Many believe that Pivotal HD was created solely to boost struggling sales of its Greenplum software and appliances.
  • Cloudfoundry: an open source cloud computing Platform as a service (PaaS) software written in Ruby.
  • Bosh: an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services. It was initially developed to manage the Cloud Foundry PaaS, but as it is a large scale distributed application, bosh turned into a genreal purpose orchestration tool chain that can handle any application. It currently bosh supports four different IaaS providers: OpenStack, AWS, vSphere & vCloud.
  • IaaS – OpenStack, AWS, vSphere & vCloudSupport starts with vCloud and VCenter APIs, and extends with later additions of OpenStack and AWS (via the Bosh orchestration layer).

So when you look at this sample of technologies (and I’m sure I’m leaving many off the list), you might see through the EMC/VMware veil….to see a collection of open source projects. We’ll see how Paul Maritz pulls this all together – clearly a powerful number of teams and technology.

So why do I refer to VMware’s “fate”…well, it’s no secret that VMware’s business has begun to plateau under the pressure from open projects like OpenStack. Did Paul get out right in the “nick of time”? Can he create a long-term sustainable business on open source?

Posted in Big Data, Cloud Computing.

Tagged with , , , , , , , , , , , , , , , .

Big Data and Banking – More than Hadoop

Jim's_BankFraud is definitely top of mind for all banks. Steve Rosenbush at the Wall Street Journal recently wrote about Visa’s new Big Data analytic engine which has changed the way the company combats fraud. Visa estimates that its new Big Data fraud platform has identified $2 billion in potential annual incremental fraud savings. With Big Data, their new analytic engine can study as many as 500 aspects of a transaction at once. That’s a sharp improvement from the company’s previous analytic engine, which could study only 40 aspects at once. And instead of using just one analytic model, Visa now operates 16 models, covering different segments of its market, such as geographic regions.

Do you think Visa, or any bank for that matter, uses just batch analytics to provide fraud detection? Hadoop can play a significant role in building models. However, only a real-time solution will allow you to take those models and apply them in a timeframe that can make an impact.

The banking industry is based on data – the products and services in banking have no physical presence – and as a consequence, banks have to contend with ever-increasing volumes (and velocity, and variety) of data. Beyond the basic transactional data concerning debits/credits and payments, banks now:

  • Gather data from many external sources (including news) to gain insight into their risk position;
  • Chart their brand’s reputation in social media and other online forums.

This data is both structured and unstructured, as well as very time-critical. And, of course, in all cases financial data is highly sensitive and often subject to extensive regulation. By applying advanced analytics, the bank can turn this volume, velocity, and variety of data into actionable, real-time and secure intelligence with applications including:

  • Customer experience
  • Risk Management
  • Operations Optimization

It’s important to note that applying new technologies like Hadoop is only a start (it addresses 20% of the solution). Turing your insights into real-time actions will require additional Big Data technologies that help you “operationalize” the output of your batch analytics.

Customer Experience

Customer-Experience-Management-Customer-Centric-Organization-copyBanks are trying to become more focused on the specific needs of their customers and less on the products that they offer. They need to:

  • Engage customers in interactive/personalized conversations (real-time)
  • Provide a consistent, cross-channel experience including real-time touch points like web and mobile
  • Act at critical moments in the customer sales cycle (in the moment)
  • Market and sell based on customer real-time activities

Noting a general theme here? Big Data can assist banks with this transformation and reduce the cost of customer acquisition, increase retention, increase customer acceptance of marketing offers, increase sales by targeted marketing activities, and increase brand loyalty and trust. Big Data presents a phenomenal opportunity. However, the definition of Big Data HAS to be broader then Hadoop.

Big Data promises the following technology solutions to help with this transformation:

  • Single View of Customer (all detailed data in one location)
  • Targeted Marketing with micro-segmentation (sophisticated analytics on ALL of the data)
  • Multichannel Customer Experience (operationalizing back out to all the customer touch points)

Risk Management

Quality-Risk-ManagementRisk management is also critically important to the bank. Risk management needs to be pervasive within the organizational culture and operating model of the bank in order to make risk-aware business decisions, allocate capital appropriately, and reduce the cost of compliance. Ultimately, this means making data analytics as accessible as it is at Yahoo! If the bank could provide a “data playground” where all data sources were readily available with tools that were easy to use…well, lets just say that new risk management products would be popping up left and right.

Big Data promises a way of providing the organization integrated risk management solutions, covering:


  • Financial Risk (Risk Architecture, Data Architecture, Risk Analytics, Performance & reporting)
  • Operational Risk & Compliance
  • Financial Crimes (AML, Fraud, Case Management)
  • IT Risk (Security, Business Continuity and Resilience)

The key is to focus on one use-case first, and expand from there. But no matter which risk use-case you attack first, you will need batch, ad hoc, and real-time analytics.

Operations Optimization

operations_managementLarge banks often become unwieldy organizations through many acquisitions. Increasing flexibility and streamlining operations is therefore even more important in today’s more competitive banking industry. A bank that is able to increase their flexibility and streamline operations by transforming their core functions will be able to drive higher growth and profits; develop more modular back-room office systems; and respond quickly to changing business needs in a highly flexible environment.

This means that banks need new core infrastructure solutions. Examples might involve reducing loan origination times by standardizing its loan processes across all entities using Big Data.  Streamlining and automating these business processes will result in higher loan profitability, while complying with new government mandates.

Operational leverage improves when banks can deliver global, regional and local transaction and payment services efficiently and also when they use transaction insights to deliver the right services at the right price to the right clients.

Many banks are seeking to innovate in the areas of processing, data management and supply chain optimization. For example, in the past, when new payment business needs would arise, the bank would often build a payments solution from scratch to address it, leading to a fragmented and complex payments infrastructure. With Big Data technologies, the bank can develop an enterprise payments hub solution that gives a better understanding of product and payments platform utilization and improved efficiency.

Are you a bank and interested in new Big Data technologies like HadoopNoSQL datastores, and real-time stream processing? Interested in one integrated platform of all three?


Posted in Big Data.

Expansion Stage Companies

The webster definition of “expansion stage” is:

“Financing provided by a venture capital firm to a company whose service or product is commercially available. Though the company’s revenues may look strong and show significant growth, the company may not be profitable. Typically, a company that receives an expansion stage investment has been in business three years or longer.”

This definition was “ok”, but I was looking for something with more depth. That’s when I read an old blog post from Scott Maxwell at OpenView Partners where he states that “expansion stage” begins with:

  1. Whole Product: You have a core product vision and “whole product” offering with enough functionality and enough of a competitive differentiation that your target market customers are purchasing/using your “whole product” with a high enough win/conversion rate.
  2. Referenceable Customers: You have a set of happy (or at least satisfied) customers (that are willing to be used as references, used for case studies, and/or say good things about your company online and offline) and your customers and target market are generally happy with your product and go to market approach.
  3. G2M that works: You have a core Go-To-Market Strategy and are executing it in a way that gets solid economic results (sometimes we call this “sales economics”, “sales and marketing economics”, “distribution economics”, or “funnel economics“). Ideally, the management team has gone down the learning curve far enough that the benefits of growing the resources outweighs the more difficult continuous improvement of a larger set of resources (that you will have as you go through the expansion stage).
  4. Foundation: You have adequate organizational and operational methodologies and people to support additional resources and additional business.

I like Scott’s view of “expansion stage”. It is a high-tech startup CEO’s focus and goal to break free of the early-stage startup phase and enter into the beginning of this phase. So it is worth defining it for your organization and managing the change required to transition into it successfully. I reflect on this using my own experiences with Infochimps over our own transformation over the past two quarters.

Customers Buying “Whole Product”

If you are a high-tech CEO and you are reading this, you might ask, “What is a ‘Whole Product’?” Lets be clear, you are never finished developing, improving, and adding to your product/service.

However, you know you are still in the early-stage if you are only delivering on a part of your promise, with plans to “add XYZ when we’re further along”….and the part you are missing is still required for your customer to really receive the value you are promising.

For example, had Infochimps only delivered “Hadoop as a Service” as our cloud service for our customers, we would have provided HUGE potential. However, we would have fallen short on our promise of solving our customer’s business problems. ALL of our customers require MORE than Hadoop to solve their problems. Therefore, we needed to make sure that our cloud services provided real-time analytics, ad hoc analytics, and batch analytics – this was required to provide a “whole product” to our customers.

On the flip side, we have SO many ideas on how to improve the developer’s experience including a rich number of data flow and analytic libraries developed specifically around customer use-cases, customer GUIs that give our customers operational views into how our cloud service is operating, etc. However, all these ideas fall into the category of improving on the “whole product” we have today. In other words, our customers are receiving the value associated with solving their business problem, even if there are still some “rough edges” to iron out.

Scott  also mentions another important element of “whole product” where you are experiencing a high enough “win rate”. This is a requirement that I have always struggled with throughout my 15 years as a startup CEO.

What is a high enough win rate? I can personally translate this into two things, which the team at Infochimps has spent several months perfecting:

  • Having a sales process which is well defined. This means that although you may not have every phase of the sales cycle understood to the point that you can predict the odds of closing them perfectly (e.g. we have a “measurable phase” called “Relationship Building” which we associate with 50% odds of closing), you continue to measure and adjust our sales process with the goal of statistically closing what you predict (e.g. at least half of the customers who reach the “Relationship Building” phase in the sales cycle). What is important is that our entire sales team (sales operations, inside sales, direct sales, systems engineers) and even marketing are: a) aligned on what is being measured, b) know exactly what criteria the organization is using to define each measureable phase, and c) are constantly improving the process.
  • Knowing how/when to say “No” to customers. In some cases, this means you don’t take on new customer prospects unless you have a high degree of certainty that you can make THEM successful. This statement is loaded because it involves understanding your customer’s business, their targeted use-case, and your ability to provide a solution that successfully solves that use-case. It also has a lot to do with knowing your target market (and knowing what markets you want to avoid). In the case of Infochimps, we would like to work exclusively with Fortune 1000 companies, scrutinizing smaller company prospects.

At Infochimps, we use a number of well-defined and measureable phases to help us understand how to qualify and obtain customers with a high win rate. Here’s how we define our MPs (measurable phases):

  1. MP1 (measurable phase 1 = Potential Opportunity): We have performed “business discovery” and we understand the customer’s use-case which equates to a “big data problem”, we know their goals (success criteria), we know their deployment requirements (e.g. public, virtual private, private cloud), we have confirmed that they have and can spend budget, there is a clear champion, and ultimately there is a compelling and/or impending event…all driving the need for our cloud services. Inside sales focuses on potential opportunities.
  2. MP2 (Confirmed Opportunity): Our direct sales force has a detailed dialogue with the champion to confirm all criteria in MP1, but digs further into the use-case and the potential impact on our customer’s business (how do they win? does it increase revenue? to what extent? how far does the needle move?). At the end of this phase we have determined that a significant level of our investment is justified (e.g. system engineering is brought in). We then begin to build what is called a “Client Specification” which details the opportunity.
  3. MP3 (Relationship): Creating a relationship requires understanding the tasks/timelines involved, knowing the people are involved in authorizing the spend / thumbs up, getting the customer to speak to their success criteria succinctly, reaching a point where the customer believes that our solution solves their problem (technical and business validation begins), and frankly getting to know the customer (creating a level of trust). The output of this phase is a “Proposal” to the customer.
  4. MP4 (Negotiation): This phase occurs when you have submitted your proposal to solving their problem and it includes the economics involved. You enter into a level of “technical” and “business” due diligence that can result in a “bake-off” where you may lose their business, or you potentially “fire your customer prospect” because they don’t fit your target market model, or you win their business and begin to move forward, formalizing the partnership. Our team then puts together what’s called a “Mutual Action Plan” (MAP) which outlines the steps to a formal partnership.
  5. MP5 (Procurement): This is the phase where business terms are drafted into a legal contract. At Infochimps, we have a standard “Cloud Services Agreement”. However, many of our Fortune 1000 customers arm-wrestle on certain terms and many provide their own “paper”. Note: don’t think that because you are in this “legal” or “procurement” period that customers will stay quiet on business terms. I find that customers still like to negotiate during this period (especially if timing gets close to the end of your quarter….which they always leverage). We’ve also seen deals fall apart during this phase.
  6. MP6 (Customer Win): Of course, execution of the contract is not the “end” of the sales cycle for Infochimps. In fact, our sales directors are incentivized based on a smooth process of “pre-sales” to “post-sales” so that our customers don’t feel an abrupt transition. We know that the hard work starts at this point, making our customers successful.

Our sales team reviews our customer prospects every week and shares these with the entire company every two weeks! We don’t expect to have the perfect probabilities associated with each of these above phases, but we do know what the goals are, how to measure them, and that we need to constantly improve in order to achieving and exceed our goals.

Referenceable Customers

If you don’t have this as a company S.M.A.R.T goal, you are failing to meet one of the most important requirements to becoming an expansion stage company.

This is MORE than just saying, “Oh yeah! You can call any of our customers and they will vouch for us.” For us at Infochimps, we have a specific goal that all in the company understand.

I have an example to emphasize the importance of this objective from the senior team level down. Remember “Delivering Happiness” by Tony Hseih? Our entire executive team visited Zappos and met with Tony Hseih and Fred Mosser with the opportunity to spend personal one-on-one with them. The topic? How should Infochimps create a corporate culture of their own which makes customers the #1 focus throughout the organization? What are the important ways to establishing a customer-centric culture and making sure it is sustained with the addition of each new team member?

We too go out of our way to establish a personal connection with our customers. We will go the “extra mile” for our customers….maybe not quite as extreme as ordering pizza for them, but close. We’re also doing so in a way that scales economically. Unfortunately, our business can become very “professional services” heavy with each customer “touch” potentially leading to a “sucking sound” (customers love the fact that we are the ‘experts’ in Big Data….which can quickly result in them leaning on us to do a lot of customer work…and a business that is far from profitable).

So what processes have we established over the past two quarters which addresses this important characteristic of an “expansion stage company”? Some key milestones for us include:

  • Establishing clear success criteria with our customers based on a clearly stated use-case.
  • Clearly defining what is expected of our customers so that they become accountable to the success of the project.
  • Involving our customers throughout the process with distinct checkpoints where we ask consistent questions – starting on day 1 (as part of a kickoff meeting), and at various phases of being deployed on our cloud (we have three phases today), and then on a regularly scheduled basis after “going live”. This is led by both our “expert services” and “customer service” departments.
  • Onsite visits by our System Engineering and Product Management teams to assess “what we need to do better” which plays nicely into our cloud service roadmap.
  • Executive sponsorship, involving our VP of Sales talking to our customer champion, as well as CEO to CEO dialogue. Yeap, I personally take the time out to have calls with my peers within every customer account. In some cases where the “CEO” is clearly not going to take my call (top ten bank globally), I go as high as I can within the organization to communicate our commitment and establish a direct connection. This may not fit your model (e.g. direct to consumer), but I find that it remains an invaluable opportunity for even expansion stage companies.

By the way, all our customers are “referenceable”…even those who were and are now not currently using our cloud service  ;-)

Go-To-Market That Works

There are two important aspects here:

  • Having a Go-To-Market strategy (yeap, you actually need one of these to then measure your results against)
  • Executing it in a way that gets solid economic results (your G2M needs to produce a profitable business)
  • Where the benefits of growing the resources outweighs the more difficult continuous improvement of a larger set of resources (“just add water”)

These may seem like simple ideas, but I generally find that only 10% of my peers really understand the mechanics of these three things. Lets take a brief look at each.

Having a G2M

One of the FIRST things we discussed as a team was G2M. I technically spent an entire quarter defining it, testing it with customers, and going back to defining it. This was a very iterative process that involved talking to real customer prospects, as well as ecosystem companies we felt were necessary in helping us execute on it. I’ll mention a few components involved:

  • Outline your ecosystem with your company in the center.
  • Understand your  target market – early adopters versus early majority
  • Profile decision makers within your target market(s)
  • Understand the sales process involved
  • Define the Why, How, What of your product
  • Create a plan around lead-gen & qualification
  • Establish your Direct vs. Indirect plan
  • Define your sales process for both
  • Create competitive positioning
  • Distill your market messaging
  • Set your revenue goals
  • Define the resources required and when

Then we knew we had to measure how well our work was paying off. For us this meant having marketing and sales tied to mutual objectives and measuring the lead process from “marketing campaign” all the way to “customer win”. It also meant A LOT of work with Wow, you’d think SFDC would be easier…but if you feel that you’re creating everything from scratch in this tool, you are not alone. Key metrics we measure include:

  • Cost of Acquisition Cost Ratio (CAC): Bruce Cleveland from Interwest as well as the Bessemer folk talk about this metric quite a bit. Our CAC ratio is at 48%. Costs include campaign costs per win + marketing salaries allocated + direct sales commissions/salaries + inside sales commissions/salaries + pre-sales SE costs.
  • Sales cycle (from lead generation to close). Ours is 4 months.
  • Duration within the lead/sales lifecycle (all measurable phases). Email me for this.
  • Conversion rates throughout each phase. Email me for this.
  • Time from contract close to deployment (realizing revenue/value). This is 30 days.

Other “dashboard” items that assist us in measuring the business include:

  • Monthly Recurring Revenue: I not only look at what is committed under contract (CMRR) but also the annual recurring revenue (ARR) or the first 12 months of recurring (CMRR x 12). Our average is around $20K MRR or $240K ARR.
  • Total annual contract value (TACV): Which includes both recurring and one-time fees (expert or professional services) within the first 12 months of contract. I measure from start of recurring (when deployed) + 12 months out. This is currently around $300K.
  • Pipeline (Pipe): All-odds pipeline is everything. Then there’s what is forecast in the current quarter, plus the upside/backfill (which could come in this quarter or next), and everything else (next quarter+). I measure both MRR + one-time = Total Contract Value. This is around $18M currently.
  • Churn: How many customers do you lose after 12 months (we require a 12 month upfront commitment. So this milestone is important). This is 20% for us (due to our early focus on startups…need I say more?).
  • Customer Life Time Value (CLTV). For our target customers, we’re seeing a 3 year minimum life-time on projects at an average of $780K.

Economic Results

We look at EVERY prospect customer deployment on our cloud as it compares to our “target profile” or what we call a “Standard Reference Platform (SRP)” customer. Infochimps has a “model customer” where we define everything from revenue to profitability – it’s a full “margin model”. We look at variable costs (all cost of goods, customer support, delivery services) that contribute to gross profit, and then allocated indirect costs (allocated sales and marketing, R&D, etc.) that drives profit (for a complete P&L view) for all the supported cloud configurations for our customers. My guidance to my peers, is that you make sure your gross profits (gross margins) are 70-90%, and your profit margins are 25-35%. This is a healthy operating model ;-) .

Just Add Water

We all know that if your operational model requires many “human” moving parts, that adding more customers will actually exponentially raise the cost of doing business. In this case, “just adding water” doesn’t equate to a scalable growth, but rather a business which will implode. Over the past six months the Infochimps team has been focused on two things to make sure we avoid this outcome, and can scale with lots of operational leverage:

  • Hardening our existing cloud services so that they support the largest deployments (making sure that as our size of customers, our size of problems, and the size of our cloud deployments all grow, we don’t experience a non-linear amount of effort in supporting them)
  • Automation of our cloud deployment and management such that we have superior operational leverage. For a 10th of an operational engineer, our customers need 10 people to do the same…and by achieving this, we’ll always be 24 months ahead of our customers.

This concept also means that when we add another “sales team” (which consists of a combination of direct sales, inside sales, pre-sales SE, and post-sales expert services), that the number of customers, the resulting revenue, and ultimately the net profits scales well. I’m proud to say that over the past six months we have surpassed most companies our size with a process and “equation” which supports this. A message to my peers – it comes down to really understanding your margin model, and making sure the entire organization also understands their contribution to improving it.

Foundation For Growth

This is the most subjective characteristic, and yet the one that could have the greatest impact to your organization. The good news is that with the right leadership you can apply changes which ensure your ability to scale your business…and this, indeed, is all about “scaling” with “people processes”.

Remember, this means having adequate organizational and operational methodologies to scale your business. The number of areas within this category which we’ve focused on at Infochimps include:

  • Establishing a vision that all understand and support
  • Creating an ROI-focused organization (people understand that everything affects the P&L)
  • Innovating by focusing on sustainable competitive advantage
  • Being nimble and comfortable with change
  • Hiring people with passion & commitment
  • Fostering a level of communication that is “straight but sensitive”
  • Creating a business that is centered around the customer (solving real problems)
  • Creating a corporate culture which is about the “we” not “I”

I’ll add that we also deploy many typical processes like an agile engineering process, and a lean but effective product realization process. But let me focus on the more “fluffy” for a moment. I’ll give you an example (which my executive team didn’t exactly appreciate at first).

As a CEO, I know my job is to set the vision/direction of the company; make sure that we have the right people/resources; make sure we’re executing well; and ultimately being responsible for removing  obstacles. However, what I believe many of my peers seem to discount is that if you establish the proper executive team communication practices, and push those down into the organization, your company can withstand the challenges associated with scaling to any size.

I can’t tell you how many times I’ve been told by an executive who says they are good at operating well under stress, and has fallen significantly short…..and it always has something to do with communication at its core. On example of how we address “communication” issues is at our executive meetings. Our weekly executive meetings have a seemingly standard agenda….except for one major difference. Here’s our agenda:

  1. Good news check-in
  2. Discussion around “Real Issues” & top priorities
  3. Customer and employee hassles
  4. Review of overall quarterly status
  5. Commitments/cascading messages
  6. Wrap – one sentence close

Notice the discussion around “real issues”? Here’s where our staff meetings stray from most. Our definition of a “real issue”:

  • A topic that would make your stomach linings churn, if brought up as a team
  • Something that you are uncomfortable talking about (especially as a team)
  • Event(s) which are affecting the group (staff, company) negatively

Why does our executive meeting need to address “real-issues”?

  • Teams (companies) fail based on process (team dynamics) not content (what is actually being talked about)
  • Every team “hits a wall”. Great teams work through the “real issues”
  • Every “real issue” that has the potential of “blowing the team apart” is exactly what makes it stronger
  • Reality always wins. It’s our job to get in touch with it.
  • There are no secrets in teams, just dysfunctional dynamics thinking so.

Our executive team actually works through issues which assist us in facilitating change needed to grow. That’s what every expansion stage company needs, and most early-stage companies lack. As an executive team we constantly assess and work to improve our ability to operate (see Five Dysfunctions of a Team).

Curious about my management style? Have other ideas about “expansion stage” companies? I’m always available for beers after work.

Posted in Leadership.

Tagged with , , , , , , , , .

Switch to our mobile site