Friday, January 18, 2008

Demystify MOSS Content Database Sizing

1) Introduction

A common question that has arisen how big content databases should be and what is the governance around the management of those databases. Hopefully this will demystify a lot around content database sizing for MOSS. Much of this information I have gathered in trying to gain my own understanding of the topic. I will discuss considerations, assumptions, various best practices, how to use central administration functionality, managing growth, recommendations, etc. One conclusion is not all of this can be done out of the box and tools like AvePoint DocAve and K2 BlackPearl can be used to create governance around an enterprise deployment of MOSS.

2) Content Database Sizing
MOSS has no restrictions on the size of content databases. Many available resources available state that a content database should range from 30 to 50GB but this guidance originated from WSS v2/SPS 2003 in regards to quick recovery of documents and site collections. If configured correctly, WSS v3 and MOSS 2007 implementations can have content databases that can handle terabytes of data. [1] Still the ability to quickly backup and restore content database is important. Most state that a best practice is keep content databases capped at 20 GBs of data to support backup/restore requirements. Tools like AvePoint can be purchased that can alleviate this issue however both the logical and physical architecture of the SharePoint implementation should be evaluated.

When evaluating the size of a SharePoint content database things like high availability, disaster recovery and backup methods must be understood. [2] The size of the database should always be discussed with database administrators. Knowing their preferences and what their environment can support is critical to understanding how many and acceptable sizes for the content databases that can be manage moving forward.

It is important to know how the database administrators strategy for managing back-ups (clustering, database mirroring, log shipping, etc.). For instance if mirroring is used (new to SQL 2005) a higher availability could be achieved but there is a higher operational cost and learning experience associated to it. Using SQL Clustering is the most common solution offering greater processing power and scalability but uses shared storage. A log shipping strategy could be used with either clustering or mirroring but things such as latency between publishing and subscribing servers should be considered. [3] This may or may not affect your end decisions but is value information to know.

In the end there are two architectural questions that need to be answered are:

  • How large to make the content databases?
  • How many content database instances should be created?

3) Types of High Capacity MOSS Sites

MOSS supports high-capacity document storage which can allow for the storage of a 5 million documents with a document library. Performance of a site containing a large document library can degrade depending on how the content is used. Microsoft has provided some guidance to help support the design of large-scale content management solutions. Design decisions and governance made around the capacities of site collections, sites, and libraries in MOSS must be taken into account not only for the physical storage constraints but also the content usage and viewing patterns of the users.

Microsoft classifies three types of large scale content management databases. [9]

  • Large Scale Authoring Environment: Contains a library within which users actively edit 50,000 or more documents across 500 or more folders. There is heavy collaboration on documents, versioning, workflows, custom content types, etc. The database could be as large as 150GB of data. [9]
  • Large Scale Content Archive: A large-scale archive is a repository where users view or upload documents. Little to no collaboration occurs on the site. These sites can be classified as a knowledge base or records management solution. A knowledge base site would typically have a single version of most documents, containing around1,000,000 or more documents. The content would be stored in a single database as large as 400 GB. On the other hand a records center site template features and functionality that are used to manage the retention and disposition of company records. Records are can be described as evidence of activities or transactions performed by the organization and that must be retained for some time period for compliance or regulatory requirements. [9]
  • Extremely Large Scale Content Archive: Microsoft describes this as an archive might contain up to 10,000,000 documents distributed across 5,000 or more folders with a database larger than 3TB. There could be 50,000 or more users primarily browsing content by searching. [9]
Personally I have not run into any scenario yet where MOSS has had to go to that scale.

4) Database Sizing Considerations
For large content databases understanding the topology of the website and the organization of the content will help drive how many site collections and content databases are needed. A single large content database can provide a better overall picture of all content and the ability to aggregate data. They typically allow for easier backup/restore policies as there is only a single content database. Putting content that is highly related to each other into the same site collection allows administrators to focus on a single content database. [4]

The following are some performance recommendations that have been made by Microsoft. Before going down the path of using a few large content databases there are a few things that should be considered. First is the potential performance implication if the site collection will host a large amount of content (either at the root level or in the sub webs). Typically, performance will become progressively worsened as the number of total objects exceeds 2,000 (sites, libraries, documents, etc.). [4] The following are recommendations provided by Microsoft for improving performance at each level of storage when high volumes of content are being stored. [9]
  • Site Collection: 2,000 subsites of any site is the recommended limit.
  • Site: 2,000 libraries and lists is the recommended limit.
  • Library: 10,000,000 documents is the recommended limit. 2,000 items per view is the recommended limit.
  • Folder: 2,000 items per folder is the recommended limit.
Delete content is can become problematic as the size of the content database grows significantly. It is possible that SQL blocking can occur when a large number of items are deleted because the high number of transactions occurring on the SQL database server during a delete operation. Additionally, sites that will have high amounts of concurrent users actively collaborating can also have negative implications as SQL transaction logs can grow rapidly creating challenges for backup/recovery. [4]

If a number of site collections can be accommodated with multiple content databases to manage them some benefits can be realized. For instance content can be managed by business line and usage of analysis reporting can be more targeted which can drive governance, manageability and metrics which can be used to gain insight into the business. Even though content would be broken apart over site collections aggregation of data can still be achieved by creating an aggregation portal. [4]

An aggregation portal would be a top level site collection that has the links to all of the site collections below it. There is currently no out of the box way to elegantly link site collections together. The aggregation portal will have some custom code built to make it aware that there are site collections below it and provide navigation functionality to them. The aggregation portal will more than likely require views of data across all site collections. This can be accomplished by using custom web parts written to retrieve data this data and present it. Another option for an aggregation portal is a records center where content is published it.

Joel Oleson, a SharePoint MVP, specifically notes that "in WSS 3.0/MOSS 2007 I recommend you pick a maximum site collection quota of no larger than 15GB, excluding the top level site collection and to move sites to dedicated databases where they need to grow beyond this. SharePoint content databases can scale to hundreds of GB and SharePoint farms can handle TBs of content. These recommendations are based on my experience with MS IT and in conversations with product support, and your experience may vary." [10]

This statement is consistent with other colleagues that I work with that have been confronted with this challenge.

5) Facilities to Manage Content Database Sizing
Content databases are databases used to store all MOSS content. By default a single content database will be created for each web application. As new site collections are added to a web application the content for that site collection will be stored in the database assigned to the parent web app. Central Administration site can be used to add new databases.

Within central administration there are is some basic information indicating the database location, the account used to make a connection to the database and capacity setting which control how many sites can be added to the content database. Using the capacity setting can help control the size of the databases from a governance perspective. If the site capacity is low, more content databases will be required to provision new sites in a site collection. Now this does not control the actual size of the database nor control the amount of content that is in a specific site. But if it can be planned on large "typical" sites within the site collection are, when certain "thresholds" are exceeded administrations can be warned.

Site quotas should be created to help with the governance of the actual size of the content databases. First a site quota template should be created that specifies that when a site collection reaches a specific size the site collection administrator will be notified.

Then when a new site collection is created the quota template can be selected.

It is possible to change the site quota template after a site collection has been created. It is possible to assign unique quota size values to specific site collection. As well, locks can be placed on the site collection to block new content from being added. This allows an administrator to take action before the site collection reaches the maximum size.

As well SharePoint Central Administration provides workflows that manage content that is not being actively used. This is turned on at the site collection level and provides a proactive way to delete web content that is no longer in use. When turned on SharePoint will email the site collection administrator if the content is still being used. If the confirmation remain un-challenged by the site collection administer the site collection will be removed. This functionality would be used in collaborative environment where team, project and workspace sites are heavily used and not needed to be retained (i.e. never to be used with a records center).

6) Managing Growth
As a site collection grows to a point where performance suffers administrators can break content up using the stadmn operations to manage the fastest growing sites into their own site collection. To support the possible partitioning of content across site collections the topology of content must be considered to it is agile. This can be as simple as having capturing guidance on should be considered a top level site such that it can be moved easily.

Note that site collections, even defined within the same web applications, are not linked to each other by using the out of the box portal template. The following screenshot shows a screen where an administrator can create a new site collection. In the Web Site Address section there are two options by default. The first is a "/" which signifies the site collection being created will be assigned to the root. The second is "/sites/" is a managed path. Managed paths in MOSS are used to provide a location to where site collections are hosted. This is a different definition from what they were in previous versions of SharePoint.

There are two types of managed paths, explicit and wildcard inclusion which can be used to create a hierarchy of site collections. It would be possible to have a top level site collection then create a "/dept/" wildcard managed path under which site collections like "/dept/hr/", "/dept/fin", "/dept/it", etc. are placed. However the out of the box site templates do not have the intelligence built into them to instinctively know that first the site collection is within a managed path and second there is possibly a parent site collection. The left navigation, breadcrumb, top right parent link, etc. will not reference the parent site collection and the user will not be able to navigate up or between site collections.

A work around for this issue would be to customize the site collections to have a link to the parent site. There is a MOSS configuration to assign an address to the site to site collection but the link is not managed. If the address were to change it would have to be manually updated everywhere. A better option would be to create a custom site definition and make it configurable to point to a parent site collection. Even another option would be to create a custom Feature which when turned on will make a site collection aware that is has parent and provide links to navigate back to it.

Note that enterprise search supports crawling content across site collections so there are no issues other than managing search. It is not trivial to just re-index content at drop of a hat so understanding how users will be searching across site collections is very important.

Repartitioning a Content Database

If the situation occurs when a content database has grown to the point where it has reached its quote limit, a new content database can be created and site collections can be moved to the new one. [9] The following steps can be used:

  • Lock down the site collection that will be moved so that no users can update the site collection.
  • Make sure that no one else will create any new site collections while this is being executed if the intent of the new content database is to be fully dedicated to the site collection to be moved.
  • Backup the site collection using the stsadm tool.
  • Then go to Central Administration and create a new content database in the web application where the site collection currently resides. Make sure to set the maximum number of sites to a very high number like 50,000 way more than the others.
  • Next backup the content databases just in case any issues could arise. This could be done through SQL Server.
  • Delete the site collection through Central Administration.
  • Restore the site collection using stsadm and it will be moved to the new site collection.
  • In Central Administration set the maximum number of site collections to 1 for the new content database if it is to be dedicated to a single site collection.

Assign a Site Collection to a Content Database

It is possible to designate a specific content database for a new site collection. The following steps can be done to accomplish this:

  • Use the Stsadm command-line tool to create a site collection in a specific database.
  • Dedicate a database to a single site collection by applying the following database capacity settings: Number of sites before a warning event is generated = 1 and Maximum number of sites that can be created in this database = 1
  • Add a group of site collections to a dedicated database by performing the following steps:
    • Within the Web application, create the database and set the database status to Ready.
    • Set the status of all other databases to Offline. While content databases are offline, new site collections cannot be created. However, existing site collections in offline databases are still accessible for both read and write operations.
    • Create the site collections. They are automatically added to the database.
    • Set the status of all other databases back to Ready.

Note that tools such as AvePoint DocAve provide comprehensive functionality to allow administrators to backup/restore content at a grandular level. I highly recommend in investing in these sort of tools.

7) Monitoring Best Practices

Content in site collections need to be continually reviewed by designated owners. Typically a "librarian" role needs to be created where this person's job is to review the content and ensure the content is well organized. If a succinct plan for maintaining site or sites is developed from the beginning they will stay current, useful, and usable.

"Plan for site maintenance and management" by Microsoft provides a great set questions that should be asked on a regular basis: [13]

  • Discuss with users in their sites and perform surveys to determine what users.
  • Make sure logging and reporting are used to determine which sites are used the most and map that back to user survey results. This will help you determine if content should be preemptively partitioned.
  • Archive or delete obsolete content, sites, etc.
  • Regularly review user site permissions and remove permissions to users whom no longer need access. Workflows can be created to manage this (K2 BlackPearl).
  • Have a plan for regular backups of site content, determine how often backups will be made, and the process for restoring content.
  • Ensure that site collections are being used correctly and content is being grouped into the correct site collection. From what I have seen this is extremely important and having librarians whose job is to manage this content is extremely important.
  • Creating too many sub-sites within a site collection will affect performance and usability. Microsoft recommends a limit of 2,000 sub-sites be in any site collection. If you have more than that, you should put in policies, procedures and a topology design to move this content and organize it better.
  • It is highly recommended that if a site collection will be extremely active just move it into its own site collection.

8) Recommendations

The following are some final recommendations and conclusions:

  • Utilize DocAve - The usage of DocAve has a lot of functionality to better allow administrators to manage content. Because of this the logical architecture site collections can be organized to support larger content databases making management less complex implementation.
  • Adaptable Architecture - Create an adaptable environment that can be scaled both logically and physically. The content placed in site collections should be organized that if content will grow over a threshold the content be partitioned across many content database and site collections.
  • Use Content Quotas - A good practice is to set a maximum site collection quota template that can be supported using SharePoint administration tools (stsadm). With the usage of a more robust back/up and restore tool like DovAve, the of setting quota to 20 GBs is less about supporting the stsadm tool versus knowing which content databases are growing more quickly.
  • Aggregate Portal - Create an aggregation portal solution if it is known that certain areas will grow exponentially.
  • Dedicated Content Databases - Create dedicated site collections and content databases for very active sites.
  • Agile Environment - Design the topology of site collections such that if the size becomes an issue a based predetermined thresholds it can be moved without adversely affecting organization of the content.
  • Audit Content Regularly - Perform regular reviews of site content using standardized criteria. Ensure to used the reports that are available in Central Administration.
  • Site Creation Governance with K2 BlackPearl - Limit site collection creation to a unique administration group to provide oversight and management of the environments content. Using workflow products like K2 BlackPearl can assist with creating automated processes for this. Make sure to hide site collection templates making them unavailable for self-service site creation. It is also suggested that standard site templates be designed or created.
  • Content Lifecycle Plan for Content - Establish a life-cycle management plan for sites, documents and content. Usage reporting can be used to help determine the life-cycle which may not be immediately apparent. Life cycle management can be tuned according to individual growth, retention, and data management planning. Again automated workflows using K2 BlackPearl can used assist with this management.

9) Resources

  1. SharePoint Content Database Restrictions... There are none!
  2. How large for a single SharePoint content database?
  3. Cluster or mirror?
  4. Site Collection Sizing Considerations
  5. MOSS 2007 - Import/Export Lessons Learnt
  6. Site Quotas
  7. Create quota templates (Office SharePoint Server)
  8. Logical architecture model: Corporate deployment
  9. Plan enterprise content storage
  10. Tips on Site Collection Sizing
  11. MOSS 2007 - Import/Export Lessons Learnt
  12. Export/Import traps for young players
  13. Plan for site maintenance and management (Office SharePoint Server)
  14. Can SharePoint Backup Harm Your Business
  15. Microsoft Office SharePoint Server (MOSS) 2007 Backup Strategies


The Wilk said...

Nice Work! Thank you for posting this, it was very helpful in answering my questions regarding Content Databases and other subjects for MOSS Architecture and Administration. And I bet you thought people don't read this stuff. Heheh...

RB said...

thank you so much for posting this, amazing work, i especially like the 'Assign a Site Collection to a Content Database' portion, its exactly what I needed to know.

Just curious, if you have two content databases active in the site collection, will it write to only one? Or does it toggle new content between them?

Jason Apergis said...

I live to serve.

Good question RB. The site collection will be assigned to a single content db. Subsequently all the content will go to that db.