SkyDriveLike us on FacebookFollow us on Twitter (@skydrive)
HotmailLike us on FacebookFollow us on Twitter (@hotmail)
MessengerLike us on FacebookFollow us on Twitter (@messenger)
Service status
Live Connect Dev Center
Windows Live Help Center
Learn more about Windows Live
Hi, my name is Arthur de Haan and I am responsible for Test and System Engineering in Windows Live. To kick things off, I’d like to give you a look behind the scenes at Hotmail, and tell you more about what it takes to build, deploy and run the Windows Live Hotmail service on such a massive global scale.
Hosting your mail and data (and our own data!) on our servers is a big responsibility and we take quality, performance, and reliability very seriously. We make significant investments in engineering and infrastructure to help keep Hotmail up and running 24 hours a day, day in and day out, year after year. You will rarely hear about these efforts – you will only read about them on the rare occasion that something goes wrong and our service has run into an issue,.
Hotmail is a gigantic service in all dimensions. Here are some of the highlights:
You can imagine that the Hotmail user interface you see in the browser is only the tip of the iceberg – a lot of innovations happen beneath the surface. In this post I will give a high level overview of how the system is architected. We will do deeper dives into some specific features in later posts.
Architecture
Hotmail and our other Windows Live services are hosted in multiple datacenters around the world. Our Hotmail service is organized in logical “scale units,” or clusters. Furthermore, Hotmail has infrastructure that is shared between the clusters in each datacenter:
A cluster hosts millions of users (how many depends on the age of the hardware) and is a self-contained set of servers including:
Preventing outages and data loss is our top priority and we take utmost care to keep them from happening. We’ve designed our service to handle failure –our assumption is that anything that can fail will do so eventually. We do have hardware failures—with hundreds of thousands of hard drives in use, some are bound to fail. Fortunately, because of the architecture and failure management processes we have in place, customers rarely experience any impact from these failures.
Here are a few of the ways we keep failures contained:
Engineering process
I’ve talked a little bit about our architecture and steps we are taking to ensure uninterrupted service. No service is static however; in addition to growth due to usage, we do push out updates on a regular basis. So our engineering processes are just as important as our architecture to provide you with a great service. From patches to minor updates to major releases, we take a lot of precautions during our development and rollout process.
Testing and deployment – For every developer on our staff we have a test engineer who works hand in hand with him or her to give input on the design and specs, set up a test infrastructure, write and automate test cases for new features, and measure quality. When we talk about quality, we mean it in the broadest definition of the word: not just stability and reliability, but also ease of use, performance, security, accessibility (for customers with disabilities), privacy, scalability, and functionality in all browsers and clients that we support, worldwide. Given our scale, this is not an easy feat.
And because we’re a free service funded largely by advertising, we need to be highly efficient on an operational basis. So deployment, configuration, and maintenance of our systems are highly automated. Automation also reduces the risk of human error.
Code deployment and change management – We have thousands of servers in our test lab where we deploy and test code well before it goes live to our customers. In the datacenter we have some clusters reserved for testing “dogfood” and beta versions in the final stages of a project. We test every change in our labs, be it a code update, hardware change or security patch, before deploying it to customers.
After all the engineering teams have signed off on a release (including Test and System Engineering) we start gradually upgrading the clusters in the datacenter to push the changes out to customers worldwide. Typically we do this over a period of a few months – not only because it takes time to perform the upgrades without affecting customers with downtime, but it also allows us to watch and make sure there is no loss of quality and performance.
We can also turn individual features on or off. Sometimes we deploy updates but postpone or delay turning them on. In rare cases we have temporarily turned features off, say for security or performance reasons.
Conclusion
This should begin to give you a sense of the size and scope of the engineering that goes into delivering and maintaining the Hotmail service. We are committed to engineering excellence and continuous improvements of our services for you. We continue to learn as the service grows, and we take all your feedback seriously, so do leave me a comment with your thoughts and questions. I am passionate about our services and so are all the members of the Windows Live team – we may be engineers but we use the services ourselves, along with hundreds of millions of our customers.
Arthur de Haan Director, Windows Live Test and System Engineering
Hello –
Thanks to all for the first round of comments. As I said in my previous post, this is the start of a two-way conversation intended to discuss how we build and operate our services. We love all the comments and suggestions and we’ll read every one. I’ve responded privately to more specific issues raised in the comments, and I will respond publically to others, but there were some general trends that seemed quite important, and most of these relate to the purpose of this blog and what type of topics we intend to cover here. The purpose of this post is to set expectations for the responses you should expect to get depending on the type of comment that is posted.
As I mentioned at the beginning, we will “dig a little deeper into how we build our services and how they’re used worldwide” and we dedicate it to “software engineers, web industry insiders, and to our most passionate Windows Live customers.” As a result this blog is about the products and services we have in use today, as well as additional detail about new releases as we roll them out. As such, there are many types of questions this blog intends to answer, including what’s the architecture for the mail system, how many photos get uploaded every day to SkyDrive, how do we detect and prevent spam, and what are we doing about SPIM.
Most of your initial comments center around three other areas – the schedule for the “next release,” feature suggestions, and feedback on “shipping sooner.” Because these will probably come up quite frequently and have the same general answer I thought I’d take the opportunity now to write up our response and frame the general approach.
The first category are questions about the next release of Windows Live, including the future direction and features for Windows Live. As a general rule, we will only discuss these as we near the release or availability of product updates. This blog is not intended to be a breaking news blog, but rather a blog that provides an engineering perspective that details the work behind the product, our implementation, and our decisions. As a result you should expect we won’t comment specifically on those questions.
The second category are feature requests. In most cases, we will not respond to these directly, and instead we will note these down and consider them alongside all the other feedback we receive. When we decide whether to fix an issue you’ve reported or to add a feature that you’ve requested, naturally we have to weigh factors like how much work will it will take to get this done, how will that impact all of the other features and fixes on our current schedule, how many customers will benefit versus how many would benefit more from putting those resources into a different feature, and all the other tradeoffs that are made in the process of developing products. Often we get competing requests where some people say “add more features” and others say “keep it simple” and as a result we can’t say “yes” to everything. Our goal of course is always going to be to create the best possible set of products and services for users, but it’s always going to be a complicated equation to figure out which features get priority in getting built, and naturally, we can’t promise anything until we’ve built it and tested it, and know that it is going to work on a global scale. As one commenter pointed out (using different words) often when we respond too soon the response feels empty. So while you are free to comment with feature suggestions, expect that we will note these down as we would other suggestions and ideas and consider them with the other feedback we receive, and that our response will be coordinated with the delivery of updated software and services.
A third category of comments relate to ship schedule and shipping more frequently. As with any project, there is a balance between frequency of updates (how often do you release), size of updates (how much change do we release at one time), and retraining (how much existing customers have to learn with any change). Each project or team has its own balance. Some products ship “every day” and make small changes, others take longer to ship, and make larger changes. And often, projects change based on the needs and requirements of customers. In the end we need to balance all of these into our overall schedule, including what we intend to accomplish, how long we think it will take, and the expected customer benefit. There are certainly folks (and many commented on this blog) who would like to see us ship “sooner” and “change more.” There are others that we hear from in other forums who “don’t like change” and want us to “keep things as they are.” And then there are the questions of which features we pick and how long those will take to be delivered with quality. In the end, I’ll simply say that we are generally happy with our release rhythm and we recognize as well that our customers and competitors continue to innovate, which increases the importance of planning well.
I hope this helps to frame the blog and our goals in this discussion. Thanks for taking the time to read and comment.
- Chris
Engineering is a process, with trial and error, analysis, weighing of pros and cons, planning for the unexpected, and discovering unexpected issues along the way. It’s exciting precisely because we’re always learning as we go.
I want to welcome you to our new blog, which is about just that: the engineering behind Windows Live.
Over the last year, we’ve consolidated our blogging efforts for all of the different Windows Live teams into a single blog, Windows Live team blog (or “Windows Live Wire”), so you wouldn't have to chase all over the web find out what we’re up to and what’s new in our products. But as we’ve brought the different blogs together, some of you let us know that you wanted to see more details about not only what we’re building, but why and how.
This blog, Inside Windows Live, is where we’ll do that.
The posts here are intended to complement those on the Windows Live team blog, which will continue to provide Windows Live customers with essential news and information about using our products and services.
The new blog, on the other hand, will be dedicated to software engineers, web industry insiders, and to our most passionate Windows Live customers, those who want to dig a little deeper into how we build our services and how they’re used worldwide.
We’ll start by giving you the current state of our software and services, including Hotmail, Messenger, SkyDrive, and our Essentials suite of client software. We’ll share with you how we build and operate our services, explain what’s going on when there are service interruptions, and talk about how we see people using our services worldwide. As we release new or updated products, we’ll provide an inside look into the changes we made and why we made them.
But we won’t just be telling you what we think. We’ll also be asking you what you think. We strongly believe that success for Windows Live must include an open and honest two-way discussion about how we operate and design our products in order to balance the different interests of customers and partners who rely on us every day.
We’ve created this blog with that two-way conversation in mind. We decided to host Inside Windows Live here on The Windows Blog in part to reflect the great synergy between Windows Live and the new Windows 7 operating system, and in part because this site gives us better options for monitoring and gathering your feedback—via comments, direct mail, and even hosted IM conversations. Over the next few months you will also see this site become integrated with Windows Live ID, giving you even more ways to interact with us on Windows Live.
I’ll be blogging here regularly, along with the lead engineers on my engineering team, who will be able to give you deeper insights into each of the products they work on. Because each blog post is just the start of a conversation, you’ll see us respond directly to comments, and follow up on other comments with new blog posts.
In short, we will take the blog where you want to take it—so if you have questions or topic suggestions, please leave a comment! We’d like to thank you for your interest in Windows Live, and we’re looking forward to getting a good discussion going with you in the next few posts.
Chris Jones Corporate Vice President, Windows Live