A Static State: FAST ESP SharePoint Connector

Monday, January 11, 2010

FAST ESP SharePoint Connector

Series

Why is FAST Enterprise Search Important Part 1
What is a FAST Enterprise Search Project Part 2
FAST ESP Components Introduction
Scaling FAST ESP Enterprise Search
FAST ESP SharePoint Connector
SharePoint 2010 Search (Coming Soon)

Introduction

After introducing the components and providing a preview of design considerations for scaling a FAST ESP implementation, let us take a look at how FAST EST works with SharePoint today. In this post I will introduce you to the architecture of the FAST SharePoint Connector and explain how content is fed, processed, stored and queried. We will cover considerations and strategies for a successful implementation.

In the next set of posting, we will discuss in detail what has been planned for SharePoint 2010.

Note, if you have no FAST ESP experience or training, you must read this blog to understand some of the concepts.

FAST Connector for SharePoint Today

FAST supports both SharePoint 2003 and 2007 in the same manner it would support any other enterprise application that FAST would index. FAST provides an API (Java, .Net, C++) and the FAST Content Connector Toolkit, which facilitates the building for Connector applications. The SharePoint Connector is built on these frameworks to feed content from SharePoint into FAST.

There are three features of the FAST SharePoint Connector you should be aware of:

It will index sites, lists, list items, files and associated metadata from SharePoint.
It can incrementally retrieve content from SharePoint.
It will capture SharePoint item permissions and incorporate it into the access control list.

Architecture

The architecture of the FAST SharePoint Connector is pretty simple and well contained.

A custom web service will be installed into the SharePoint farm. This web service will be accessible just like the out of the box web services provided in SharePoint. Side note: if you are interested in writing out your own custom web service for SharePoint, read this blog .
The FAST SharePoint Connector must be installed on a machine that can access the SharePoint web services and is able to connect to FAST ESP Content Distributors. It really does not matter where this is installed as long as it can make the required connections. That being said, the Connector could be installed on either the SharePoint Farm or on the FAST Farm.
The Windows Authentication Proxy must be installed onto the FAST Farm.

Basic Processing Flow

The installed components work together to retrieve content from SharePoint and make it searchable within FAST. Here is the process:

The SharePoint Connector calls the FAST SharePoint Web Service to retrieve content.
The FAST Connector connects to the FAST Content Distributor and sends along the SharePoint data.
The FAST Windows Authentication Proxy “may” be used to get additional SharePoint data.
The document processors process the content into FIXML documents so an index can be built.

Now let’s dive a little bit deeper into some of the details about how this works.

Incremental Loading

The FAST SharePoint Connector will perform incremental loading of content. The first time it will be heavy because all of the SharePoint data will be loaded. However, subsequent content loads will only retrieve changes. The interval for incremental loading is configurable.

Incremental Loading Strategy

If you need to completely reload the data, you must clear the Collection the documents were fed to. Doing this has ramifications that you should be aware of. The most important one that comes to mind is that Collection can have documents from other locations. If so, all of that content will have to be re-indexed too! That can be a big deal. So it is important organize your Collections and potentially anticipate if you will ever have to do this.

You may be wondering how you can control the amount of data that is indexed at any given time. Well there are probably many ways but here are some options that come to mind first.

Create multiple pipeline instances for processing SharePoint data, then configure the pipelines to include or exclude specific URLs within SharePoint. I might create a dedicated pipeline for processing content in areas where I know there will be lots of updates. For instance, collaboration or project sites will have data updated on a regular basis. I would then configure that pipeline to refresh the interval on a regular basis. The advantage of doing this is that a smaller subset of data that is regularly updated will be polled more frequently. Then I would create a dedicated pipeline for a publishing site where data is updated less frequently. The interval between getting data may be longer.

Another thing I may take into consideration is pushing data to different collections. For instance, you can have dedicated collections for intranet, extranet and Internet (remember I am not talking about SharePoint collections, I am talking about FAST collections). Typically in the SharePoint world you logically group data into different content databases, shared service providers and even different hardware. It may be good to maintain that logical separation knowing that it is recommended to feed content into different FAST Collections based on these logical boundaries. Doing this will also give more control over the Search Profiles and what Collections people have access to.

Document Feeding

When SharePoint data is read through the web service by the SharePoint Connector, both metadata and security information will be sent to the Connector. However, depending on the configuration you set, SharePoint files may or may not be part of that pay load. By default, a reference to the file (a URL) will be part of the information sent from the SharePoint Connector to the Content Distributor(s). During Document Processing, the Windows Authentication Proxy will use this reference to retrieve the actual document from SharePoint. You have the ability to change this configuration and send the file as a BLOB.

Document Feeding Strategy

Why are you provided with this option? Mostly for flexibility reasons. If you pass the files by reference:

The Connector is going to perform more quickly because it has less data to work with.
Your network I/O will be better utilized because the document will not have to be passed twice as it will only be retrieved one from SharePoint. This is a big deal if you have large files.

If you choose to pass the file immediately:

Document Processing will be quicker because it does not have to go out a retrieve the file from SharePoint.
The machines where Document Processing is located do not need to have access to the SharePoint sites because all of the content is available.

Processing SharePoint Data

Working with SharePoint data is not really different than working with other data that may come into FAST. But we need to be aware of some strategies you may want to employ. First, you should know how the data type mapping from SharePoint to FAST will he handled. Second, metadata from columns in SharePoint will be mapped into the fields within the Index Profile. This mapping is based on the unique name of the field. For example, if you have a column called Last Update Date in SharePoint, in the FAST index profile there must be a column called lastupdatedate (notice it is lower case, no spaces and no special characters). If this is the case, the data from the SharePoint column will be automatically mapped into that index field and become searchable. Note the SharePoint column data is not mapped; Document Processing will discard the data.

If you have a good understanding of SharePoint, this will raise a red flag for you. This is because you know that columns of SharePoint data can be added in an ad-hoc fashion. The question on your mind is how can data made searchable if it is not mapped to a field? When you learn FAST there is a concept called Scope fields. Scope fields take metadata and store them in a structured format (similar to XML) in a single field in the index. Scope fields are specifically provided to support storing of index data without having to know the schema of that data in advance. When you store data in a Scope field you have the ability to query it back out using their FQL language (similar to writing an XPath query).

Processing SharePoint Data Strategy

There are some considerations that you must take into account. First, you can add new fields into the Index Profile to match all of the data that you want to bring into SharePoint. This is fine; however, if you need to add a field this is considered to be a “warm update.” After making the change, only new documents will have the data but all previously indexed documents will not. For old documents to have the data, they must be completely re-processed. This will require you to clear the collection and completely re-index (discussed above). A second consideration is that using Scope fields to support querying all column data has a query performance penalty.

Here are some additional recommendations:

Come up with a hybrid approach where important SharePoint columns are mapped to a field in the Index Profile. Then allow all other columns to be indexed automatically into a Scope field. This is a common practice. This will give you good query performance on most common columns of data and still allow you to access to all other column data.
Earlier, we mentioned potentially creating separate Collections for publishing sites versus collaboration sites. In that scenario, do not turn on Scope fields for the publishing Collection because the metadata should be very well defined. This way you can get better query performance. All you need to do is either add new fields that map directly to SharePoint columns or add document processing stage(s) that will save the data in existing index fields.

Wrap Up

This post provided you with some insight into how FAST ESP indexes data from SharePoint. Hopefully you will take these factors into consideration before you start to index your content. This is why we say it is so important to understand the life-cycle of the data you are indexing - because it will influence your approach.

5 comments:

Mikael Svenson said...: Is this based on the connector released in December last year? https://extranet.fastsearch.com/static/softwareupdate.html / https://extranet.fastsearch.com/static/customers/Sharepoint_1_0.html; January 12, 2010 at 9:34 AM
Jason Apergis said...: Mikael,

Those are some good references if you have access to the FAST extranet. Yes this is based off the features of the release in Dec 2009.

Jason; January 12, 2010 at 4:45 PM
RaviChandra Maniyani said...: We have installed sharepoint_connector_1_0_sp2_win32 on the sharePoint2010 box and configured the content connector. We are using FAST ESP5.3. It is installed on a separate machine but under the same domain.

But when i try to start the Connector, i get the below error.

[2012-07-04 10:40:42 AM] PROGRESS : SharePointConnector : Published Documents: Total: 0; Rate: 0.00dps. Callbacks (success/failure): [0/0].
[2012-07-04 10:41:11 AM] ERROR : SharePointConnector : Unable to communicate with content distributor(s). Make sure you have configured the correct content distributor(s) host name and port (Parameter ESPSubmit/ContentDistributors), and that your FAST ESP installation is running. Currently listed content distributor(s): ps23fastat1.test.com:16100
[2012-07-04 10:41:11 AM] FATAL : SharePointConnector : Error running connector SharePointConnector. Will terminate. Error: Exception has been thrown by the target of an invocation.

Any links/hints would be great help.

Thanks!
Ravi; July 3, 2012 at 11:06 PM
Jason Apergis said...: Ravi,

I have not used FAST ESP for a long time. I have a colleague who may have more insight. Will forward the post along to him...

Jason; July 5, 2012 at 6:17 AM
Anonymous said...: For the SP connector to fail like that would probably be a network issue.

Just to confirm: you are using ESP5.3 with the ESP SharePoint connector. Why are you installing the connector on the SP box?

Install the connector on a neutral box, or the ESP server, and try again. If it works then there was port contention or some such thing by running on the SP box. If ir doesn't work you now have an isolated box to work with in tracking down the issue.

ESP and SP are both simple-to-use, but complex, platforms. Isolate the problems whenever possible.; July 5, 2012 at 6:34 AM