Monday, November 2, 2009

What is a FAST Enterprise Search Project Part 2

Series

Introduction

In my previous blog (Why is FAST Enterprise Search Important) I discussed why is an Enterprise Search project in import? In this blog posting I will discuss what is needed for a successfully Enterprise Search project. This should hopefully give you enough information to anticipate what will be needed in an Enterprise Search project.

What is an Enterprise Search Project?

A few years ago I had to make the transition as a custom application developer to an application server consultant with Microsoft products. Project plans for implementing SharePoint, K2 or BizTalk were really not much different other than you have several new tasks associated to the configuration, integration, sustainability and maintenance of the new application server. Still with application server projects you still have lots of custom artifacts and components that have to be developed. This too is the case with FAST.

When posed the question of what is an Enterprise Search project, I first did not know where to start. I wanted to draw from my past experience. I also knew that Enterprise Search projects can be complex but I did not understand what a search project would entail.

Content Processing and Transformation

Enterprise Search within an organization many complexities. First we have to be able to index content where ever it may be (in a custom database, 3rd party enterprise application server, file share, mainframe, etc.). Custom code may have to be written to facilitate bringing this content over to FAST so that it can be indexed. Knowing this a comprehensive analysis project must be completed to understand all the content/data that is spread across the organization. A common mistake is a company may index bad data and they get the old "garbage in; garbage out" issues. There must be plans for indexing both good and bad data, formatting unstructured data, making data relevant, normalizing data (removing duplicates), etc. We will need to understand the entire life-cycle of that data and how it can be effectively pulled or pushed into the FAST Search index. This is very similar to a data warehouse project however the context is a little different.

An Enterprise Search project is also very similar to a complex ETL project because you will have to create several transformation processes/workflows. The processes must transform the content into a document that can be recognized by the FAST Index. FAST refers to anything in the index as a document; even if the index item comes from a database. A document for FAST is a unique piece of data with metadata which gives it relevancy. FAST provides several out of the box connectors that do this transformation and they provide an API to write custom ones. In many cases you may have to build or extend connectors. Just as important as the ETL pre-processing, there is post-processing routines that must be executed before the search results are passed back to the user interface layer. Again more relevancy rules or aggregation of search results may be incorporated here. I was happy to hear that the FAST team also draws comparisons to an ETL project when discussing what an Enterprise Search project is.

User Interface

Most Enterprise Search platforms like FAST do not have a traditional GUI; it is an Enterprise Search engine that can be plugged into new or existing platforms. FAST does provide several controls that can be integrated into any UI platform but in many cases you will be extending upon or building complete new controls. FAST provides a rich API that is accessible in such languages and .NET, Java and C++.

User Profile

An important element of the FAST Enterprise Search project is to understand the user profile that is performing the search. Things such as their current location, where they are within the organization, what sort of specialties do they have, what types of past searches have they done, who have they worked for work for, and past or future projects, tasks or initiatives they have supported can all be used to give a more relevant search result. This requires integration to go to systems that can infer these relationships and pass this information along with the query to FAST Query and Results server which will return a relevant result.

Security

The profile is also important for incorporating security. FAST has numerous ways in which documents can be securely exposed to the end user. For instance there is an Access Control List (ACL) which is part of the document instance in the search index. The ACL is populated during the indexing of content and this may require customizations to set the ACL appropriately. As well, more customizations may be added to do real-time authorization to ensure that documents being returned from the index have not been removed from the user's visibility. Another consideration is to partition indexes based on boundaries such as internet, extranet and intranet. There are several more considerations that must be accounted for so time must be accounted for in the plan to ensure that content is managed properly.

Installation and Configuration

A major portion of the project plan needs to be devoted to the installation and configuration of the FAST server. There are several important things that need to be accounted for when doing this. For instance how many queries will be executed concurrently, what are peak usage scenarios, how much content will be indexed, what sort of complexities/exceptions are there in the indexing process, what is the anticipated growth, etc. All of this must be known for us to properly scale the FAST server and the design of custom components.

Testing

With all of the custom transformation and GUI components to support the Enterprise Search implementation, there will need to be a focus on system integration testing, system application testing, and user acceptance testing. There will be specific test for search to ensure that indexing, query performance and result relevancy are accurate and within acceptable ranges. This is nothing new but we need to be sure that a proportionate amount of time is incorporated into the plan to ensure that a quality solution is put in place.

Sustainment and Governance

Sustainment next needs to be part of the plan which is commonly neglected. Too often the plan is focused on the short-term end result while the long-term management is not incorporated into the solution. What sort of organizational management changes are required to support and maintenance of the search implementation? What sort of configuration management business processes will need to be introduced to continually tune the index and relevancy model based on usage? What sort of new roles and responsibilities need to be incorporated into the employee performance (from both a systems and business user perspective)? How is the enterprise taxonomy going to be maintained? What sort key performance metrics and reporting are needed to consistently evaluate the success of the project? What is the process for incorporating change back in the solution (which is extremely important for Enterprise Search)? If questions like these are not incorporated into the early design of the project, there will be long-term challenges with the adoption and integration of the Enterprise Search investment.

Closing

As you can see the key to a successful Enterprise Search project is to understand the needs of the business and how the solution will be supported. Many of the tasks that were discussed are very standard; we just needed to put them in context.

Why is FAST Enterprise Search Important Part 1

Series

Introduction

The first thing that many will ask before beginning a major Enterprise Search initiative with a product like FAST is why is an Enterprise Search important? Secondly, what is an Enterprise Search project? My approach is to not understand these questions this from a sales perspective but from a technology management and consultant perspective.

Why is Enterprise Search important?

Users have to work mass amounts of data that is either stored internally or externally. Search can mean lots of things to different industries however the goal is simple; it is to display the right information to the right person at the right time without distraction. At the same time we must have a flexible and configurable search platform that will surface the most relevant information to the business user from where it is stored.

Information Workers have to search and then utilize data. How do they do this? They typically have to log into an application and perform a search. Or when they enter an application, there may be some data contextually rolled up to them based upon who they are. There is a demand by business users to make search easier. We have heard many times "how can I search my enterprise data in the same way I Google something on the internet". Users want the ability to go to a single place, run a search query and receive results from across the entire enterprise. This is very different than performing a public internet search or a search function contained within the scope of a single application. Public internet searching has its own complexities however it typically is indexing content on websites. Enterprise Search becomes complex because the data being indexed can come in numerous formats (document file, database, mainframe, etc). From the user perspective this complexity must be transparent. They must be given a single result set that will allow them to research problem, complete task or even initiate a business process.

Organizations are challenged with providing comprehensive search solutions that can access content no matter where the data resides. Public search engines have as well created demand to provide highly relevant search experiences. Relevancy is the key to success for a search solution. To have accurate relevancy it is important to know as much as we can about the user entering the query. Profile relevancy can be determined a by numbers of things. For example where the person is located, what is their job function, and what past searches have they or colleagues done. Relevancy can also be determined by the attributes associated to a piece of content. For example is the author considered to be trusted, is the content itself fresh, or even is content highly recommended by other users. The search platform must have an adaptive relevancy model. It must be able to change based on business demands and subsequently learn how to provide better results utilizing factors that are incorporated into the relevancy model. An Enterprise Search platform like FAST can provide this advanced capability.

The vision of going to a single place find data is not really a new concept. We have seen a major push for data warehouses to create a single location to facilitate enterprise reporting. We have seen enterprise portals created which give users a single user interface that provides contextual data from disparate systems. We have seen SOA trying to consolidate business services and now we are seeing cloud services gaining traction in the market. The reality is that the enterprise architecture on the large will be disparate. Companies have made significant investments into many technologies at one time or another and consolidating them to a single platform is not always realistic. This is why we are constantly trying to find new solutions to work with data in a uniform manner. This is an important justification for an Enterprise Search solution such as FAST.

To restate, the goal is to have an Enterprise Search platform that can create single result set using disparate data from across the enterprise. Where a lot of organizations fall short is they do not have the tools to navigate this data. Business users are required to have deep domain knowledge of the organization, format of the data, and business processes. The domain expert must know what is good or bad based upon experience which is not transferrable making continuity of operations challenging. This is yet another reason why an Enterprise Search platform provides significant value to an organization.

Here are some examples of how organizations have used Enterprise Search.

  • Several major ecommerce sites like Best Buy and Autotrader.com used FAST to better advertise to its customers, expose product significantly quicker to the customer, provide better navigation of search results and provide integration with OEM partners.
  • A business data brokerage firm was able to provide more relevant results, increase user satisfaction, provide data from multiple disparate locations, create better customer retention, created collaborative data rating system and allowed for communication between subject matter experts.
  • A community facilitator for the natural resource industry was able to create a B2B solution that provided dynamic drill/navigation of industry data, created automate extraction policies to mine for important data, was able to regionalize their search results, created a pay model for more high-end results, and improved their sales model by using relevancy.
  • A major computer production company used FAST to improve economies of scale for support personnel. They significantly lowered call-center cost by directing users to search first, provided customers with more up to date support information and allowed their worldwide staff of engineers to user their native languages when performing a search.
  • A global law firm used FAST to create a knowledge management solution that allowed them to reduce research personnel and created consolidated search experience. They significantly reduce ramp-up time of new lawyers, greatly improved relevant results with advanced content navigation, and provided better communication of best practices.
  • A law enforcement agency was able allow investigators to electronically research mass amounts of data across the government which they normally did not have access to. This subsequently increased productivity, shortened lengths of investigations and help them comply with government regulations.
  • Another government agency created a solution using FAST which would search public domain for information of persons who are potentially breaking laws and initiate business processes bring them to justice.

All these examples provide strong justifications for the value of an Enterprise Search solution. With FAST costs were reduced, they were able to meet regulations, they performed more efficiently, and generated more revenue for goods and services.

What is an Enterprise Search Project?

This will be discussed in my next blog What is a FAST Enterprise Search Project

Saturday, October 24, 2009

FAST Search Whitepapers

Here are some great whitepapers you should read if you want to start learning about FAST. I know there is a lot of buzz around it with its integration with SharePoint 2010 and finally providing SharePoint with a robust search engine. This is a great starting point for starting to understand what Enterprise Search is and how it can be strategically introduced and aligned with your Enterprise Architecture.

http://www.microsoft.com/enterprisesearch/en/us/FAST-technical.aspx

Tuesday, October 20, 2009

FAST Introduction and SharePoint Search Evolution

There is a lot of information that is coming out from the SharePoint 2010 conference and one of the biggest ones is the integration of FAST into SharePoint 2010. What is FAST? FAST is an enterprise search engine that Microsoft acquired and they have placed a significant investment into. The most important thing you should know right off the bat is FAST does not equal SharePoint. FAST is an enterprise search platform which can be used as the search engine for SharePoint. Up to this point Microsoft has not provided a way to search for content across the enterprise. What we have done to compensate for this is build custom applications or purchase products like FAST and Google Appliances to do enterprise search.

This is what I have seen with the evolution of search solutions in the context of SharePoint. SharePoint 2001, nothing to really discuss but with SharePoint 2003 we started to get a taste of what we wanted for Search. We found that the search did not really work well in SharePoint 2003 (cross site searching did not work) and many customers who were using SharePoint 2003 said it simply did not work. It did basic text searching of content within SharePoint but it was missing key things like relevancy. This created a small market of third party vendors who creates search solutions for SharePoint. Remember, at this time Google had become the search engine of choice, as every day business users would just say go Google something and get the answer. Problem was we did not have the same search engine that we could use internally with a company, organization or enterprise. As result FAST, Google, Autonomy, etc. created enterprise search solutions that could be used within a company enterprise and that many these features that were required by the business user.

Then SharePoint 2007 came out with Enterprise Search. It was a significant improvement over what we had with SharePoint 2003 but it was still far off from being an enterprise search solution. They improved the user interface, allowed for target content taxonomy searching, they added a relevancy model, best bests, synonyms, administrative features, reporting, an API we can build customizations to, added security using an access control list (ACL), and business data search using the business data catalog (BDC). All the stuff needed when creating an enterprise search platform. We now had the ability to search for data inside and outside of SharePoint, we could rank the search results based on who you were, we could analyze searches to improve the user experience, etc. however it still seemed to fall short. The core problem I go back to is users are expecting that Google experience; and not just doing text searching. SharePoint tried to solve some of that but in the end it fell short.

One thing that had always been the most interesting is the introduction of the business data catalog (BDC) to provide a single result set of data from multiple disparate data sources. This was the most interesting search feature for me when SharePoint 2007 came out. This is where they tried to become an enterprise search engine because you go to one place, you enter something to search on, and you query against many different places but get back a single result set. I personally was able to use it successfully to index custom SQL databases of HR related data for several clients. So when they searched for a person, they were able to get more information about that person other than just information stored in Active Directory. Now the BDC had lots of limitations including only able to call databases, stored procedures and web services, no ability to do data transformation, an API that was very hard to develop with and had limitied scalability.

With the introduction of FAST as part of the Microsoft stack, they really have a true enterprise search engine. FAST has a significant amount of features and functionality, which I have not even touched upon. In my next blog, I intend to write about some of these core features and capabilities that are needed for an enterprise search solution and how they are used to meet your business users needs to find the data.

For more information on the value proposition of FAST, I have written the following two blogs:

Friday, October 16, 2009

SharePoint GB 2057 Localization

I was recently asked to dig around into an issue with an international SharePoint site we are setting up. I personally have little experience with globalization other having to read about it to pass a MS certification test.

There are language packs for SharePoint which are used to support configurable text for globalization. Well the issue was how is LCID 2057 for England handled? The English language pack supports 1033, which is for US English. LCID 2057 is a considered a sub language of 1033. So, would it be possible to create a unique resx file for GB that maps to 2057? After digging and stumbling around, the answer is it is not possible.

The only resolution would be in the web application set the regional settings to LCID 2057 (GB), and then modify the resx for US English (1033) in that specific web application.

This is what I was able to find out:

  • There is only a language pack for English (1033).
  • It is possible to have formatted text, like dates, formatted to 2057. It is possible to change the locale to 2057 by accessing the SPWeb.Locale. You can try to change the locale through the SharePoint Regional Settings screen in Site Settings, but you will not see a GB option, only US. Another way to change the locale is to go to Webs table in the site collection database; HOWEVER that is not supported by Microsoft.
  • In the Webs table you will see another column called language. What I was able to find out is that the value in this column MUST correspond to a language pack that has been installed. Otherwise SharePoint will bomb. So setting Language = 2057 and Locale = 2057 will not work. However Language = 1033 and Locale = 2057 will work. What this will do is make sure that things like dates are formatted correctly. The reason why it fails is because in several places, including the 12 hive, SharePoint is building a relative path to resources installed when the Language Pack was installed. You will 1033 folders throughout the 12 hive. So if the Language is set to 2057, it will start looking for a 2057 folders and things will start breaking. At this point I said, it would not be possible to create a dedicated unique resx file for GB. Bummer.

Here are some references:

Wednesday, October 14, 2009

Copy SPListItem.Version (SPListItemVersion) Part 3

Background and Considerations

A while back I wrote a blog that discussed the issues with copying SPListItems from one list to another. However I recently needed to create a utility and thought my old blog would solve the problem – I am unhappy to say it did not. It definitely unlocks the issue with copying SPListItems with versions however I just found a couple shortcomings of what I wrote. Let’s try again.

Here are some considerations I had to understand before starting to build this.

  • The SPListItem CopyTo() and CopyFrom() methods do not work after doing some research with Reflector.
  • You will need to need to loop over the versions backwards and add the versions of the list items in the destination list.
  • Moving documents is different than moving list items.
  • Recursively looping over items within a SPList or SPDocumentLibrary is not straight forward. You usually want to maintain the folder structure when moving items from one list to another. You cannot simply loop over all items in the SPList nor does a SPFolder object have a collect of items within it. Only easy way of achieving this is to use a CAML query to get all the items for a specific folder.
  • If you need to preserve the Created and Modified time stamps on the version items, you need to set the times correctly because they are stored as GMT in the SharePoint database.
  • If you want to move items cleanly into a new or existing list, I recommend writing code that will first remove all the items from the destination list, then remove all the content types destination list and finally add the needed content types back into the destination list. There are numbers of reasons why to do this. It is possible to write a routine to reconcile the content types from the source list to the destination list however that can be come complicated. The important thing to know is that if a column is missing in the destination list, the movement of the SPListItem or SPDocument item will fail. The code I have written is not dependant on the content type ID which is a good thing. This is because if the content types are defined within the SharePoint UI a unique GUID is created for that content type. If you are moving items across SharePoint servers, you cannot be guaranteed that the Content Type ID will be the same, but the column names and types should be the same.

Create Copy Folders Structure

I created a method called MoveFolderItems which will recreate the folder structure in a new library. All you need to do initiate it is something like the following.

MoveFolderItems(sourceList, sourceList.RootFolder, destList, destList.RootFolder);

As you can see in this method, it first gets all the items for a specified folder. Then it checks to see if the item is another folder or not. If so, it will create a new folder, otherwise it will move over the item depending.

        private static void MoveFolderItems(SPList sourceList, SPFolder sourceFolder, SPList destList, SPFolder destFolder)
{
//Query for items in the source folder
SPQuery query = new SPQuery();
query.Folder = sourceFolder;
SPListItemCollection queryResults = sourceList.GetItems(query);

foreach (SPListItem existingItem in queryResults)
{
if (existingItem.FileSystemObjectType == SPFileSystemObjectType.Folder)
{
Console.WriteLine(existingItem.Name);

//Create new folder item
SPListItem newSubFolderItem = newSubFolderItem = destList.Items.Add(destFolder.ServerRelativeUrl,
SPFileSystemObjectType.Folder, null);

//Set folder fields
foreach (SPField sourceField in existingItem.Fields)
{
if ((!sourceField.ReadOnlyField) && (sourceField.Type != SPFieldType.Attachments))
{
newSubFolderItem[sourceField.Title] = existingItem[sourceField.Title];
}
}

//Save the new folder
newSubFolderItem.Update();

if (newSubFolderItem.ModerationInformation != null)
{
//Update Folder Status
newSubFolderItem.ModerationInformation.Status = SPModerationStatusType.Approved;
newSubFolderItem.Update();
}

//Get the source folder and the new folder created
SPFolder nextFolder = sourceList.ParentWeb.GetFolder(existingItem.UniqueId);
SPFolder newSubFolder = destList.ParentWeb.GetFolder(newSubFolderItem.UniqueId);

//Recursive call
MoveFolderItems(sourceList, nextFolder,
destList, newSubFolder);
}
else
{
//Move the item
Console.WriteLine(existingItem.Name);

if (sourceList.BaseTemplate == SPListTemplateType.DocumentLibrary)
{
MoveDocumentItem(existingItem, destFolder);
}
else {
MoveItem(existingItem, destFolder);
}
}
}
}

Move SPListItem

Here is the code for the SPList item with its history. First we create the list item. Then we loop over the versions backwards and add each version into the destination list.

            private static void MoveItem(SPListItem sourceItem, SPFolder destinationFolder) {
//Create a new item
SPListItem newItem;

if (destinationFolder.Item != null)
{
newItem = destinationFolder.Item.ListItems.Add(
destinationFolder.ServerRelativeUrl,
sourceItem.FileSystemObjectType);
}
else {
SPList destinationList = destinationFolder.ParentWeb.Lists[destinationFolder.ParentListId];
newItem = destinationList.Items.Add(
destinationFolder.ServerRelativeUrl,
sourceItem.FileSystemObjectType);
}

//loop over the soureitem, restore it
for (int i = sourceItem.Versions.Count - 1; i >= 0; i--) {
//set the values into the new item
foreach (SPField sourceField in sourceItem.Fields) {
SPListItemVersion version = sourceItem.Versions[i];

if ((!sourceField.ReadOnlyField) && (sourceField.Type != SPFieldType.Attachments))
{
newItem[sourceField.Title] = version[sourceField.Title];
}
else if (sourceField.Title == "Created"
sourceField.Title == "Modified")
{
DateTime date = Convert.ToDateTime(version[sourceField.Title]);
newItem[sourceField.Title] = sourceItem.Web.RegionalSettings.TimeZone.UTCToLocalTime(date);
}
else if (sourceField.Title == "Created By"
sourceField.Title == "Modified By")
{
newItem[sourceField.Title] = version[sourceField.Title];
}
}

//update the new item with version data
newItem.Update();
}

//Get the new item again
SPList list = destinationFolder.ParentWeb.Lists[destinationFolder.ParentListId];
newItem = list.GetItemByUniqueId(newItem.UniqueId);
newItem["Title"] = sourceItem["Title"];
newItem.SystemUpdate(false);

if (sourceItem.Attachments.Count > 0)
{
//now get the attachments, they are not versioned
foreach (string attachmentName in sourceItem.Attachments)
{
SPFile file = sourceItem.ParentList.ParentWeb.GetFile(
sourceItem.Attachments.UrlPrefix + attachmentName);

newItem.Attachments.Add(attachmentName, file.OpenBinary());
}

newItem.Update();
}
}

Move Document

As I mentioned earlier, moving a document is a little bit different. Here is the code that will copy a document, metadata and versions over to a new library.

       private static void MoveDocumentItem(SPListItem sourceItem, SPFolder destinationFolder)
{
//loop over the soureitem, restore it
for (int i = sourceItem.Versions.Count - 1; i >= 0; i--)
{
Hashtable htProperties = new Hashtable();

//set the values into the new item
foreach (SPField sourceField in sourceItem.Fields)
{
SPListItemVersion version = sourceItem.Versions[i];

if (version[sourceField.Title] != null)
{
if ((!sourceField.ReadOnlyField) && (sourceField.Type != SPFieldType.Attachments))
{
htProperties[sourceField.Title] = Convert.ToString(version[sourceField.Title]);
}
else if (sourceField.Title == "Created"
sourceField.Title == "Modified")
{
DateTime date = Convert.ToDateTime(version[sourceField.Title]);
htProperties[sourceField.Title] = sourceItem.Web.RegionalSettings.TimeZone.UTCToLocalTime(date);
}
else if (sourceField.Title == "Created By"
sourceField.Title == "Modified By")
{
htProperties[sourceField.Title] = Convert.ToString(version[sourceField.Title]);
}
}
}

//Get the version of the document
byte[] document;
if (i == 0)
{
document = sourceItem.File.OpenBinary();
}
else
{
document = sourceItem.File.Versions.GetVersionFromLabel(
sourceItem.Versions[i].VersionLabel).OpenBinary();
}

//Create the new item. Overwriting it will treat is as a
//new item.
SPFile newFile = destinationFolder.Files.Add(
destinationFolder.Url + "/" + sourceItem.File.Name,
document,
htProperties,
true);

newFile.Item["Created"] = htProperties["Created"];
newFile.Item["Modified"] = htProperties["Modified"];
newFile.Item.UpdateOverwriteVersion();
}

}

Wednesday, October 7, 2009

.NET 4.0 WF Initial Impressions

A couple months ago I was asked some very direct questions about the viability of K2 and other such tools with .NET 4.0 and Dublin. I personally have just not have had lots of time to do go off and research this. However I attended a quick one hour virtual session put on by Microsoft for WF in .NET 4.0.

The big thing I found out is that the State Machine workflow will not available in the initial release of .NET 4.0. That was a big surprise to me. All you will have is Sequential and Flow Chart workflows. The presenter said that you can achieve something similar to a State Machine workflow by doing a Flow Chart workflow. This would lead me to believe that many of the workflow challenges we had with WF in MOSS 2007 have not been resolved.

They talked a little about Workflow Services and I found out that you cannot do as much with Workflow Services than what you can do with WF. I did not get any details on what those specifics were.

A lot of the discussion was about how ISV can use WF to augment their frameworks and even provide the ability to allow customizations into their products using visual tools. This is what I have been preaching for a while now. You cannot adopt WF as the business process automation platform for a company. It does not come anywhere close. It is a framework to build business process automation frameworks.

I have had conversations think where companies believe that since they have SharePoint to host their WF workflows and they believe that is all they need. In the long run your costs will be significantly hirer to maintain, extend upon and manage. I have a personal thing with WF in SharePoint because I do not like the fact that the workflows can only be tied to a piece of content. If a company wanted to do finance or accounting process automation (that would span across enterprise systems) the workflow instance would have to be tied to a SharePoint list item which is not even actor in the process itself. So ask, why do we need this SharePoint list item, it serves no real purpose in the process. Plus if someone deletes the item or the associated task, the process will just end. There is no reporting, and the list goes on.

The point is that WF in MOSS should be used to just manage content in SharePoint. It is not a good platform for human workflow – you really need to look at other tools if you need human workflow. Plus it really does not look like Microsoft is chasing after companies like K2 and Nintex and they should have a healthy future.