The Hong Kong Summit attracted more than 3,000 developers, users, investors, media and analysts this week. Presentations and panels in which Cloudscalers participated are linked below.
AWS Repatriation: Bring Your Apps Back
Running your own infrastructure *can* be as little as half the cost of running on AWS once you are at scale. OpenStack-based cloud systems can provide the same or similar economies of scale if you leverage the lessons of AWS and GCE when building your cloud. This talk will discuss the economic factors in designing a cost-efficient AWS + OpenStack hybrid cloud. We look at the issues involved in repatriating existing applications, and we show a couple of real-world demonstration of tools that can assist in the repatriation process. Repatriation isn quite as simple as hitting the Easy button, but if you plan your deployment correctly, you can make it work, both technically and economically.
This 2nd major State of the Stack address is a complete refresh of the spring 2013 edition, broadcast live on BrightTALK from the OpenStack Summit in Hong Kong.
Randy Bias, CEO and Co-founder of Cloudscaling examines the progress from Grizzly to Havana and delves into new areas like refstack, tripleO, bare metal server provisioning, the move from “projects” to “programs”, and public/hybrid cloud compatibility. Check out the updated statistics on project momentum and look more closely at big upgrades in Havana, including OpenStack Orchestrate (Heat), which has the opportunity to change the game for OpenStack in the greater private and hybrid cloud game. We also discuss the “what is ‘core’” debate and examine the idea that OpenStack is a kernel, not a complete cloud OS.
DevOps, is changing how cloud applications and infrastructure is deployed and managed, DevOps aims to speed up process for delivering new features and releases of our applications. OpenStack is a fast growing infrastructure itself. In this panel we discuss the specific challenges for delving DevOps and continues deployment models on-top of an OpenStack infrastructure that is continuously evolving. We’ve assembled a panel of top industry experts to discuss how we can push updates to our applications without downtime, how we can also update our OpenStack infrastructure as we do that, we learn about the tools and framework available on the OpenStack ecosystem that can help us implement a successful DevOps project. The panel represents Rackspace (Tony Campbell), Cloudscaling (Azmir Mohamed), Scalr (Sebastian Stadil), Canonical (Mark Ramm-Christensen) and Liveperson (Toby Holzer). Nati Shalom (GigaSpaces) led the discussion as moderator.
Panel on Application Portability
It’s a multi-cloud world and your code needs to run somewhere. However, the cloud you choose today may not be the cloud you need tomorrow. Changes in reliability, performance, cost, and privacy may drive you to research alternative public clouds, a private cloud, or a hybrid of the two. Considering application portability upfront can be crucial in avoiding lock-in. The tools you use to interact with the cloud will play a large part in how portable your application is between clouds.This panel will discuss the different approaches to application portability, e.g. API compatibility, multi-cloud SDKs, image portability, application architecture portability, etc. Is application portability a myth or reality? What are the pros and cons? Bring your questions to be answered by our panel of experts.
Stack Debate: Understanding OpenStack’s Future
Which is more important to the future of OpenStack: A vibrant community of users who collectively steer the project toward features and technologies that matter most to them? Or, is the future of OpenStack best served through a logical series of sound engineering decisions that steer the project toward an architecture that’s ready for a new world of scale-out apps?
In this debate, moderated by Stefano Mafulli, Scott Sanchez and Randy Bias will take point and counterpoint. We’ll discuss the hype growth of OpenStack and debate whether or not the technology has kept up. Participants will hear differing points of view on how projects like Neutron, Iconic, refstack and tripleO are or are not taking OpenStack in the right direction. We’ll look at who is putting OpenStack into production and examine what that means for the larger community.
The debate is designed to draw participants out of their default points of view and challenge them to consider alternatives. As OpenStack continues it rapid growth, a continual re-examination of our thinking is critical.
The cloud space is overflowing with buzzwords: cloud, agile, devops, shadow IT… it can be exhausting. Beneath all that hype, though, are some powerful technologies – game-changing, even. Based on the trends I see in the industry, I believe OpenStack is one of those game-changers, and that the skills needed to be an OpenStack ninja are skyrocketing in demand.
With that in mind, Cloudscaling decided it would be super cool to offer public training classes, to share the learnings picked up over some battle-hardened years of making OpenStack run in production.
Here’s a rundown of the courses offered this December in San Francisco:
The first, Cloud Computing with OpenStack, is geared toward the business and strategy crowd. It covers the industry shift toward cloud computing, value proposition of OpenStack, and formulas to apply to help decide whether it’s right for your organization. If you’re an engineer who’s chomping at the bit to implement OpenStack but are having trouble getting the higher-ups to understand the value in it, then this might just be the direction to send them!
The other two, the Bootcamp for OpenStack and Cloud Storage with OpenStack Swift, are targeted at a more hands-on, technical audience: network, systems, and operations engineers, or end-users. These courses dive into the guts of what it takes to deploy and operate an OpenStack cloud. The beginning of the course touches briefly on the purpose, history, and success of OpenStack as a project, but that’s mostly just to set the stage. The bulk of time is spent stepping through each of the components (Keystone, Glance, Neutron, Nova, Cinder, Swift, and Horizon) in detail, covering…
What it is
How to install it
How to configure it (and trade-offs of various options)
How to operate it as an end-user
Real-world example use-cases and incidents
Hands-on lab exercises, to help things stick
The lab exercises are performed in a virtualized environment on your workstation, so everyone can build, hack, and rebuild without worrying about trouncing on anyone else. The courses also give a peek into Cloudscaling’s hosted demo cloud, OCSgo, just to get a better feel for how things operate in a true distributed model.
Cloudscaling is already fairly active in knowledge-sharing via meetups, conferences, and blogs. Most of the time, however, those of us in Engineering and Ops have had our heads down, building clouds with OpenStack and trying to make them bulletproof. We’ve experimented, we’ve iterated, and we’ve learned the hard way what works well and what doesn’t. Now, we’re excited to be able to share even more of those learnings with the community.
Oh and Cloudscaling happily does on-site trainings too, so feel free to shoot us an email if you have a group of folks interested. Even if you’re not in the Bay Area, we don’t mind the occasional airplane ride.
Earlier this morning, Randy Bias posted an open letter to the OpenStack community here. It presents a case for immediately and deliberately embracing Amazon and the AWS APIs, as well as those of other established public clouds.
In the post, Randy presents an overview of the history of OpenStack’s current position, the evolution of the project’s governance, an analysis of how Amazon now dominates in public cloud, the “innovation curve” in public cloud, and how OpenStack can dominate in private and hybrid cloud. He also looks at why fears of legal action are ungrounded, and how the entire OpenStack community can win by embracing Amazon and other leading public clouds.
I want the community to know how integral the continuous integration (CI) system (TripleO), plus integration tests (Tempest), plus unit tests (per project), are to our success. Previously I interviewed Monty Taylor on this topic and he had a ton of fabulous insight to share on how the CI system works. However, in looking back on the last three years and trying to understand why OpenStack continues to grow and hit every milestone, I think we should “do the numbers.”
First up, notice the total number of unit and integration tests, which are well over 13,000 15,000. And due to lack of time, I am even missing a few key projects like Swift, Ceilometer, and Heat (will try and update soon!). (UPDATED: Graphic below updated to include Heat and Swift.)
This is impressive, but perhaps most impressive is observing the trajectory of the creation of unit tests. Just looking at Nova you can see that the community has been hard at work over the last three years adding test after test:
This is incredible velocity and it really tells us about the commitment of the OpenStack community to deliver a high quality, production-grade, Cloud OS kernel.
More importantly, the OpenStack infrastructure team’s continuous integration system is deploying and testing OpenStack over 700 times a day using the Tempest integration tests, which have doubled in the last year:
This is why we are able to move so fast and why no other Infrastructure-as-a-Service (IaaS) open source software development community will be able to catch us.
From the Cloudscaling engineering team: thanks so much for the continuing hard work!
Last week, I had the privilege of being involved in the authorship of a book. It was an intense and amazing experience and I’m still surprised that it worked, that we successfully outlined, wrote, and edited a book in only 5 days!
Our outcome was the OpenStack Security Guide, a 154 page instructional on securing cloud deployments. This was a collaboration of individuals from Cloudpassage, Cloudscaling, HP, Intel, John Hopkins University Applied Physics Lab, the NSA, Nebula, Nicira (VMware), Rackspace, and RedHat.
In writing, we managed to squeeze in practical guidance on configuration topics that were previously undocumented or otherwise “hidden” features, such as using SSL client certificates for MySQL authentication. (Thanks Nathanael Burton!) We also found a number of feature requests, questions-to-investigate-later, plus some serious and not-so-serious bugs.
I note, most of the feature requests, concerns, and bugs were found in the first day. Many would not have surfaced had we not had had this mix of vendors and personalities together in the same room. We would not have had nearly the same amount of energy should we have attempted this over the wire. It reminds me that we as a community would do well to continue various sprints and in-person events, lest we forget the value they bring.
Ultimately, I believe our book accomplishes a reasonable balance between scope and depth. This was difficult because several topics could be books in and of themselves. It is impossible to be entirely happy with any creative work, but I’m pleased with our output. There is plenty of room for additional insight, especially when it comes to topics we know we couldn’t scope — we were too light on storage and didn’t cover Compute Cells at all. However, we’ll be releasing the book as Creative Commons and it will be put into ‘git’ as a living project, so we welcome future community participation.
Finally, I really need to thank Adam Hyde in particular for bringing his expertise to this exercise; Without him, we couldn’t have done this. Additionally, I thank Bryan Payne (Nebula) and Robert Clark (HP) for their efforts in making this come together, Keith Basil (RedHat) and Ben de Bont (HP) for both their efforts and footing the bills, and everyone else with whom I shared, “an overly air-conditioned room” for 5 days.
The OpenStack Security Guide is immediately available for download as ePub. Soon, we will be making HTML and PDF versions available for download. A printed edition will also be available.
Russell Bryant is the PTL (Project Technical Lead) for OpenStack Compute (Nova) and has spent the past two years working on Compute in his role as a Principal Software Engineer for Red Hat. That combination of experience and technical leadership gives Russell a useful perspective on the complexity of running the Nova project and the evolution of OpenStack overall.
At the OpenStack Summit in Portland a few weeks ago, I talked to Russell about the technical challenges he manages in his role as PTL:
spinning nova-volume and nova-networking out of OpenStack Compute into OpenStack Block Storage (Cinder) and OpenStack Networking (Quantum) respectively
progress toward feature-completeness of nova-networking under OpenStack Networking (Quantum) in Havana
the ongoing need for cross-project collaboration
shout outs to some of the key folks who’ve helped drive Nova forward
the development of sub-teams in OpenStack Compute interested in similar functionality
speeding up the developer feedback loop on performance and scaling issues
perspectives on nova-conductor, a traffic control layer above the hypervisor to facilitate communication between compute nodes and to abstract out database operations for security reasons
Mark McLoughlin is a principal engineer at Red Hat who’s also the company’s OpenStack technical lead. He serves on both the OpenStack Technical Committee and as individual director on the OpenStack Foundation Board of Directors. More importantly, he’s the top committer to the Grizzly release.
We spoke at the OpenStack Summit in Portland about the often overlooked Oslo (openstack-common) project within OpenStack, which Mark leads. The Olso project produces a set of python libraries containing code shared by various OpenStack projects. The goal is to provide a common set of high-quality API libraries for the project, to follow a Don’t Repeat Yourself (DRY) model across projects, and to create a model for cross-project collaboration.
In the video, we discuss:
an overview of Olso (openstack-common) and how it enables DRY and cross-project collaboration
addressing technical debt to help OpenStack move more quickly and keep up with the six-month release cycles
how the governance model for OpenStack provides a balance among the interests of users, operators and developers
brief comparison of different governance models (Gnome Foundation vs. OpenStack Foundation)
Thierry Carrez handles release management for the OpenStack Foundation and is chair of the project’s Technical Committee. Thierry was involved with the earliest incarnations of OpenStack while at Rackspace. We caught up with him at the OpenStack Summit in Portland to get Thierry’s insights into the release cycle, governance and his wish list for the project.
In the video, we discuss:
drivers behind the shift from a 3-month to a 6-month release cycle for OpenStack
managing the release cycle as OpenStack has grown from two to nine projects
the logic behind aligning the release cycle with the semi-annual Summits
the role of CI in improving interoperability and quality across all the projects
complementary roles of the board (resources, brand, trademark) and the technical committee (meritocracy of developers and code quality)
importance of motivating corporate contributors to invest more in long-term, strategic projects like documentation, security, QA, and test suite
One of our favorite sayings at Cloudscaling is “Simplicity Scales.” This saying has a slightly-less-well-known coda, “Complexity Fails.”
Let’s walk through a real-world example of this.
In Open Cloud System (OCS), our high-availability (HA) strategy for services that have persistent datastores is to use a UCARP IP to make sure that one and only one of the backend servers is active at any given time. Then we replicate data between all the backend servers so that if one fails, another can take over the UCARP VIP and the cloud continues operating normally. UCARP works basically like VRRP – multiple devices share a virtual IP address (VIP) and communicate using CARP to figure out which one of them should be active at any given time.
The typical server in an OCS installation has four NICs: one (1G) for hardware management (IPMI), one (1G) for systems management (PXEbooting, chef), and two 10G NICs. In our canonical network design, one of these 10G NICs is used for intra-cloud traffic between VMs and storage resources, and the other is used for external access for VMs to talk to the Internet (or other resources outside the cloud).
Here is a diagram of the standard network layout without bonding.
This is a simple and well-understood network design, easily implemented with standard networking models that have been around for decades. But there’s another option for how OCS can be deployed: using bonded interfaces on the servers and port channels on the switches to take those two 10G NICs and make them appear as a single 20G network link, and pass both intra-cloud and external traffic across that higher-bandwidth virtual link. Many of our customers have preferred this option, which in theory provides higher burst bandwidth and greater resilience to failure of a NIC. Bonding sounds great, right?
Diagram of the network architecture with bonding.
The Trouble Begins
Let me tell you a story. It’s kind of a detective story. Like everything else in OCS, we do extensive testing of our HA/failover solutions, and during such testing we discovered some odd behavior when running in bonded interface mode. In most of our tests, failover worked great. When a node failed, the other node would take over. Because everything had been replicated from the active node, no data was lost. When the failed node comes back up, it’s supposed to see the broadcasts from the existing master and join the cluster as a backup. This happened most of the time in our tests, but in a certain environment we saw the wrong behavior, where a failed node would come up and take over as master. In some cases, this could happen before replication had finished, which is obviously a big problem. After a ton of time spent debugging and a lot of red herrings, we finally figured out what was happening. If you use the default values for UCARP configurations, you get the following behavior when a node comes up and joins an existing cluster:
new node listens for 3 seconds for an announcement from an existing master
if the new node does not hear such an announcement it promotes itself to master
also important, if a master node hears an announcement from another master, it will demote itself to backup IF the other master has a numerically higher IP address
Here’s what was happening. During the boot process on the new node, it was taking several seconds (more than three) for the port channel on the bonded interfaces to be setup between the server and the switch – until that happened each port had link, but no frames (or packets) were being passed. During this time, UCARP was starting and listening for announcements – announcements that it couldn’t see because they come over the bonded interface, which wasn’t working yet. After three seconds the node was declaring itself a master, then the port channel would finish coming up and now both the new node and the previous master see announcements from a second master. Because the new node has a numerically lower IP address the other master demotes itself and you wind up with the new node becoming master – potentially before it has replicated data back over from the previous master.
Following diagram depicts UCARP under normal conditions.
And under failure conditions.
We never saw this behavior with unbonded interfaces, because there is no setup delay for the network in that case. The new node comes up, starts UCARP, hears the announcement from the previous master, and joins the cluster as a backup just like it’s supposed to. We also didn’t see this behavior with all models of network switches – some set up the port channels faster than others, and as long as it takes less than three seconds for the port channel to start passing traffic to the node, we see the proper behavior. We only saw it with a certain network switch that took more than three seconds, and we only had that switch in one test environment.
Bringing It Home
So back to “simplicity scales, complexity kills.” Interface bonding and port channels are newer technologies than basic switching and routing, and their implementation is more complicated on both the server and the switch sides. Because they are newer and more complex, the implementations from one vendor to another differ in significant ways (and have different bugs). In this case the complexity introduced by bonding introduced a new failure mode that manifested in way that is extremely hard to diagnose. Relying on simpler (and older) technologies can prevent having to deal with these kinds of hard-to-diagnose problems. For example, in other parts of OCS we use ECMP at layer 3 to provide HA to servers. This is a time-tested and well-understood mechanism that has been used by ISPs for HA for decades. We’re planning on switching our existing UCARP implementations to such a mechanism in the future, for what should by now be obvious reasons.
The Moral Of The Story: Keep It Simple
The worst part about this story is that by adding something that was aimed at making the system more reliable (redundant NICs) we introduced a new failure mode (likely multiple new failure modes) that wound up making the system less reliable. This is unfortunately a common theme with HA strategies. What appears at first glance to be a great idea has unexpected (and often negative) consequences on the overall system. The best way to avoid this is to use the simplest and most time-tested strategies you can to keep your systems up and running.
We talked with Monty Taylor of HP at the OpenStack Summit in Portland. Monty is the automation and deployment lead for cloud at HP. He’s also a member of both the OpenStack Technical Committee and the OpenStack Foundation Board of Directors.
Monty leads the CI (continuous integration) project for OpenStack. In that role, he and his group have built testing systems that have made it possible for the OpenStack project to scale from a few dozen contributors for the Bexar release to more than 700 developers now pushing patches *daily* to the project.
Watch the video to learn more about:
OpenStack’s integrated code review system and gated commits
running the CI system as a single app across two public clouds, with resources donated by HP, Rackspace and eNovance
merging about 150 patches each day into the code base, and the 500+ that don’t make it