I recently went to the SharePoint Conference and attended several great sessions. I am going to put up some of my notes from the conference as there was a ton of great information for the 10,000 people whom attended. I openly admit that I am a little focused on SharePoint Online because I am constantly talking with customers about all our solutions available in Office 365. Also excuse the grammar – I am just trying to get content up as quick as I can…
1.1 Cloud First and Aligned Management
From a SharePoint perspective, Office 365 has really driven Microsoft towards building solutions that are more scalable and manageable. For instance in SharePoint 2010 we were given this “new SharePoint Federate Services model” that really not many organizations stood up on-premise however it is heavily used in SharePoint Online. Now with SharePoint 2013 we can really see that everything is being built for a cloud environment. For instance SharePoint 2013 upgrades, wow. If you look at the way we will be doing upgrades from SharePoint 2010 to 2013 you are going say “this so much more well thought out now”. First, we have lessons learned since the product has been around since 2001 however the cloud has really driven Microsoft to deliver better solutions. Why? Because Office 365 delivers a finically backed SLA with a promise to keep customers moving forward on latest and greatest solutions in the cloud. We will not get stuck in an old version and this has forced Microsoft deliver even better upgrade capabilities.
In this blog I am going to be capturing some information from two specific sessions I attended on “How We Do It” for SharePoint Online. Throughout these sessions it really resonated to me how we were changing the architecture of the SharePoint product to be cloud first. The Microsoft SharePoint Product Group is the same group of people supporting SharePoint Online. I will talk about this later in the blog but this is a big deal as it really demonstrates Microsoft’s commitment to have people, process and technology strategically aligned.
1.2 Sessions on How We Do It at SharePoint Online
There were two amazing sessions at the SharePoint Conference. One was called Operating SharePoint Online and the other was called Building and Managing SharePoint Online. If you are a developer of SharePoint you may not have attended these two sessions but they will blow your mind. Specifically if you are a person that has ever managed a production SharePoint farm, you will really appreciate what they have done; not just from a physical and logical architecture perspective as there is program management and governance built into Office 365 that frankly organizations have a very tough time in building and delivering on-premise. This is driven by the 99.9% finically backed SLA that Microsoft delivers with Office 365.
1.3 Session on Operating SharePoint Online
Here are some of my notes (formalized a little) from the presentation.
Some Current Stats - Microsoft has made over a $3.28 billion dollar investment in data centers that are supporting Office 365. This really demonstrates Microsoft’s commitment to the cloud. At the time of this presentation, in support of SharePoint Online, there are currently more than 13,000 servers with over 37,000 SQL servers in the cloud data center. They indicated they are currently bringing on 30,000 companies a week! They have a 24/7 development staff. Plus they have actually maintained 99.95% YTD deliver of service for SharePoint Online – basically beating their stated SLA. This is absolutely amazing when you hear stats like this, and you have done SharePoint administration, it really makes understand the value that Microsoft is delivering to your organization. Stop and think a second for all the people, business processes, governance, management, etc. that is needed to deliver this.
Goals – The SharePoint Online Operations team discussed their focus on several things such as zero downtime, zero loss of data, always up to date and security/compliance. These are all things which organizations whom deploy SharePoint Online try to adhere to which are very challenging to implement because they can require serious investment in people, process and technology which may not simply deploying some SharePoint servers.
Zero Downtime – During the session they spent a lot of time discussing this.
- In order to support this, the SharePoint Online team is constantly monitoring from multiple different angles. SCOM is highly utilized in support of this activity. They specifically implement scenario based monitoring so it is just not checking server machine status. As part of this they do a lot of live traffic monitoring and watch for patterns. They also implement rather comprehensive scenario based monitoring on the SharePoint Online environment.
- They stated that one of the biggest reasons for their success thus far is their alignment to the SharePoint Product Group. They have direct integration with the people whom write the actual code for SharePoint and these same people have direct responsibilities support SharePoint Online. As I mentioned earlier, it is this sort of alignment which drives great delivery as there are is direct access to people who wrote SharePoint to support SharePoint Online.
- They also stated that even though they have access to the people who built SharePoint Online, from a support perspective they have a goal to “automate everything”. Really this is the only way they can ever scale and they have demonstrated that with the level they are currently delivering at.
- They said stated they are doing close to 172 million probes per month to make there are no issues. They stated that this can result in roughly 600,000 anomalies a month that SCOM may identify. Through correlations systems, they are able to identity roughly 200 escalations a month they may need to deal with. This is pretty amazing when you look at the number of probes and the amount of automation they put in place to discover and automatically resolve. Plus they continually find ways to reduce this.
- When issues are discovered they have an entire automated system will communicate to engineers, manage workflows and tasks, and proactively initiates meetings between responsible engineers. The system even provides a full report and list of past resolutions on how to immediately resolve it if is something that has been encountered before.
- Additionally they talked a little about this internal solution they created with Microsoft Research that can parse ULS logs. I know when I have had to debug SharePoint on-premise production issues in the past I had to work with ULS logs which can never be a fun task. However this tool provides a dashboard, drill down capabilities and pattern analysis across every ULS log across the entire SharePoint Online cloud. It is impressive.
- The document is first stored in the content database associated to the site, so that is on place it is stored.
- Second, all the SQL databases are using RAID 10, so there is an immediate duplication.
- Third, there is synchronous SQL Mirroring built up to a DR SQL server in the immediate data center, so that is 4 copies of the file.
- Fourth, there is asynchronous log shipping from the primary cloud data center to the secondary data center. So that is roughly 4 additional copies of the file into the secondary data center.
- Fifth, there are schedule backups at the primary data center and then asynchronous replication of those backups to the secondary data center.
On top of all this, remember users have the Recycle Bin to recover items that they may have deleted. Note I re-checked the service descriptions and they state the Recycle Bin will keep deleted items for 30 days and backups are stored for 14 days. Also note with SharePoint Online, the Recycle Bin is can also be used to recover objects such as site and even site collections (through tenant administration).
The Operations team stated that Disaster Recovery for them is a hot standby where data centers are always paired with each other. They adhere to an Active-Passive farm set-up with automated failure overusing DNS. They do tons of monitoring, testing, and production fail overs tests. They states specifically that each data center is production and there is no such thing as on primary data center is taking all the traffic while secondary is just sitting there waiting. They ensure that all data centers are performing primary workloads and if there ever is a disaster, they would just re-distribute that workload across the data centers. They stated that as part of this they demand resiliency at both hardware and software layers. They also indicated that for SharePoint Online that they have never really run into the whole situation where the data center has gone. More realistic scenarios they run into are connectivity issues or something to that effect where they will do a DNS flip and keep operations going.
Always Up to Date – Again one of the biggest reasons why customers want to move to SharePoint Online is to ensure they are always up to date with latest SharePoint software, but additionally all the security and feature patching that is provided to ensure that best secure user experience is being delivered. The SharePoint Online Operations team discussed some of their change management and governance they implement to support this for their customers. They need to make sure that security patches, platform upgrades, escalation responses and latest/greatest features are deployed.
Doing this across a large cloud environment requires a significant amount of automation and they built internal tools that will orchestrate these changes. There is a Change Manager application that manages all the physical and virtual machines. The manager knows the state of every machine, the patch level, how it is being utilized and has deep logic for it to know how to apply patches based on scenarios. Plus VMs (where SharePoint servers roles running for SharePoint Farm) are not all located in the same physical servers. VMs are deployed across multiple physical machines and “availability groups” are created so that when a patch is run, it is executed by availability group to ensure there are no performance issues during patching. The Manager will handle lock management across VMs and SharePoint farms and they state they do patching roughly every two weeks worldwide but this could be more dependent on the need.
The Operations team also noted that changes are not rolled out whenever they feel like it J There is a phased roll out process including change approve board which analyzes every proposed change. They have an automated, multi-step process of numerous environments they will test these changes out before ever going into production. The SharePoint Online Operations team even said “we eat our own dog food” by pushing all completely vetted patches into the Microsoft Corporate’s SharePoint Online production tenant before it goes to customer. SharePoint Online is highly utilized by Microsoft employees.
Secure and Compliant – The final goal they discussed was security and compliance.
- First there was a good discussion on how they are fully patched 100% of the time. They have a team of security specialists (they joked hackers) whose job it to continually search and test for vulnerabilities.
- Security by Design was something the team stressed. Role based access is required at all times, regardless of the task or operation at hand. If there is an operation that must be done by a human, then there are secure consoles that are provided based on your role. Plus permissions are managed using an on-demand access model. The said almost no operations required admin access levels. They also stated the operations people do not need access to customer data to perform the tasks they need to complete. They need to work with system logs and such. If support needs to work with customer data, that would be done as part of a customer request. The goal is to be extremely respectful to customer data. They even discussed that for the US Government cloud personnel must be US Citizens.
- They discussed, which I talk a lot about is, Office 365 support for compliance through audits. ISO 27001, EU Model, HIPPA, FISMA, etc. This is the only way to scale and Microsoft has demonstrated they adhere to the most of them.
- One last thing they discussed is they take the approach that they always assume there could be a breach. This is basically to ensure that they are always proactive, checking, monitoring and improving. To assist them with this they actually have a Big Data solution (which is compliant with all our standards and scrubs out PII) that consume log data for them to proactive searching and security analysis. For instance they said SharePoint Online today generates roughly 2TB of ULS logs per day (that is amazing). They scrub and then push this data in the system and they check for instance SharePoint correlation logs in less than a second going back three or more years.
1.4 Session on Building and Managing SharePoint Online
The second session I sat in and took notes was on was a session on how SharePoint Online is built and managed. In this session they discussed at length how SharePoint Farms are provisioned.
Layers of Office 365 – They had a good discussion on how they logically break out the layers of Office 365.
- Office 365 Portals – This was the sign up experience and tenant administration services that allow customers to manger purchased services.
- Office 365 Platform Services – This is made up of Commerce / Billing, Identity Platform, authentication, and DNS.
- Office 365 Services – These are the services that you know and purchase today – SharePoint, Lync, Exchange and Office Web Apps.
Layers of SharePoint Online – They then broke out the layers of SharePoint Online as being three core layers:
- Physical – this is all the data centers, machines and physical networks that are used to support SharePoint Online.
- Virtual Machines – they then discussed how Hyper-V was central to their delivery strategy. They also discussed how the break out units of scale by “networks”. Now the term network does not really mean what you normally think. Let’s come back to that a little later.
- Services – they noted that every service that runs in SharePoint Online has a 1+ redundancy strategy. There are thousands of services that are running and everything must be integrated.
- First they have a network. On that network they have a lot of common services that are available. For instance such services AD synchronization, provisioning services, SCOM, DNS, administration, back-up, etc.
- Then within each network they create what the call a stamp. A Stamp is a set of SharePoint Farms that a customers are brought into. First within the stamp the have a SharePoint Federated Services farm. This was introduce in SharePoint 2010 as a way to create scaled our services for such things a search, metadata managed service, etc. The second farm in the stamp is the SharePoint farm itself including all the WFEs, crawl WFEs, app servers, timer jobs, sandboxes, etc. They said this usually will be around 10 or more SharePoint servers. The third farm is a SQL Server farm. Finally there is a local Active Directory with accounts for the customers who have been provisioned to that stamp. Remember this could be a mixture of cloud based IDs or federated IDs from on premise. Once a stamp is built, there will be a second identical stamp set up on a network. They stated that each one of these stamps could support roughly 100,000 users.
- Third they discussed this component of Office 365 called the Grid Manager. This is the component of SharePoint Online that is responsible for basically running, coordinating and automating almost everything. Then there are other services such as the Global directory, tenant administration, commerce backend, DNS, authentication, incident management, Azure service and CDN services.
Provisioning Process – The operations team then discussed at a high-level how the Grid Manager would provision a new stamp. Many of the operations SharePoint Administrators do but this is completely automated. For instance they have stamps such as bring in the standard VMs, deploy the local AD and SQL farms, create the federated services farm, then the content management farm, then post deployment patching of VMs and SharePoint, etc.
Provisioning New Customers – The operations team then had another interesting discussion on how they provision customers based on the layered architecture they described earlier. They also gave some interesting stats that they on board roughly 30K new tenants a week with roughly 4K new tenants a day. They then discussed some of the rules that would determine when network and stamp that a customer is provisioned to. The Grid Manager basically has tons of factors that it evaluates as part of that such as geography, capacity of existing farms, operation activities currently occurring within a stamp, tenant vision (is it primarily a SP 2010 of 2013 farm), and dependency of services (for instance a government customer will go into a government network and stamps). Once that is done there is a whole another set of provisioning services that are responsible for setting up the initial site collections for the customer, creating DNS entries, creating user groups, etc. They even discussed how they have become pretty smart with doing pre-provisioning of tenants in advance and then can just adjust them as customers come into the service to be even more efficient with delivery.
Upgrades – The operations team had a very interesting discussion on this but my next blog will be focused on that with notes I captured from another session. Will post a link here once I have that done.
You can draw a ton of conclusions from this. The point everyone should be taking away is building this on your own, even if it is nowhere as near as automated as SharePoint Online is a major task for many organizations to take on. Why? Organizations are in the business of providing goods and services. Even though organizations create IT groups in support of their mission, it is really hard to justify this sort of level of automation and management for an organization that may have a just a 12 server farm on-premise. The value of SharePoint Online is your business can focus IT resources at building solutions versus running them.
1.6 Additional References
There is more information associated if you read the service descriptions. In this case read the SharePoint Online, Security and Continuity and Support Service Descriptions and you will see how all this information plays into supporting them.