Wednesday, January 6, 2010

FAST ESP Components Introduction

Series

Introduction

I am going to provide an introduction in the architecture of FAST ESP 5.3. As many of you know, FAST is part of the enterprise MOSS 2010 CAL but not for MOSS 2007. I will delve into the future architecture of FAST with MOSS later in this series. Right now I wanted to introduce the architecture of FAST as it is today. Many of the concepts will carry over for MOSS 2010 so having a good understanding of FAST will be valuable.

In this blog I am going to start by giving you an introduction to the various different servers, components, and logical architecture that make up a FAST ESP 5.3 implementation. This blog is meant to provide a tip of the iceberg view of FAST.

In this blog there will be a technical focus on understanding how content is actually made searchable with FAST. One of the first things you should understand is the purpose of each component and its role in making content searchable.

MOSS 2007 Search Architecture

I am going to assume the reader is familiar with SharePoint and is learning FAST from a SharePoint perspective. Some of the Enterprise Search concepts with MOSS 2007 are similar to FAST. Let me touch on these so you have a frame of reference before we jump into FAST. In MOSS:

  • There is an index server that builds the index.
  • There is a physical index file that is pushed to each WFE server.
  • There is a query service that runs on the WFE that queries the index file.
  • There are SSPs which are to host search services and control what content is indexed.
  • The BDC is used to add external data (from a database and web services) to the index.
  • There is an access control list which is used to manage security to items.

SharePoint2007Search

FAST ESP 5.3

The most current version of FAST is called ESP 5.3. If you are familiar with FAST and the value proposition (further reading) you will know that it is a really powerful engine for crawling content and providing relevant results to users.

Let us look at FAST ESP 5.3 components from a 20,000 foot level (If you are new to Enterprise Search please review this blog first). A simple way of looking at FAST is to think of it as a big ETL project. First you need to get content from various locations across your enterprise. Then you need to process and format the content so it can be indexed. Once content is indexed, you need to quickly return results in a relevant matter. There are numerous components within FAST that perform this processing, indexing and searching:

  • Connectors – Applications that can feed content into FAST.
  • Content Distributors – Connecters feed content to Content Distributors which send the content for document processing.
  • Document Processors – Processing workflows that create searchable documents from submitted content.
  • Collections – Logical grouping of searchable documents.
  • Index Dispatchers – Routes processed documents to the appropriate Index Node.
  • Index Node – Builds searchable indexes from processed documents.
  • Search Node – Queries built indexes for matching documents.
  • Top Level Dispatcher (TLD) – Manages communications and performance between the QR Server and multiple Search Nodes.
  • Query/Result (QR) Servers – Prepares queries to be sent to the Search Nodes and refines search results returned.
  • Search Front End Servers (SFE) – Applications where a user makes a search request.
  • Administration Server – Administrative features, configuration management, etc. for the enterprise installation.

The diagram below depicts how all of these components interact with each other at a high level. This should give you a general idea of the life-cycle of a document.

FASTComponents

Connectors

Connectors are also commonly referred to as content feeders because that is exactly what they do. Whether Connectors do it by pushing or pulling content to FAST, they are built on a common API (available in Java, C# and C++) which submits content to FAST. Think of Connectors as standalone applications that submit content into FAST. FAST provides several Connector applications. Those that you should be immediately familiar with include:

  • Enterprise Crawler – an extremely powerful crawler for web content.
  • File Traverser – This app can crawl file directories and supports over 270 file formats out of the box. It also provides extensive support for consuming XML content.
  • JDBC Connector – This app is used for submitting structured data/content from databases into FAST.
  • SharePoint Connector – For the SharePoint people reading this blog, there is a connector for SharePoint 2003 and 2007. We will discuss details later in this series.

This is by no means the only list of Connectors. Many FAST partners have built product FAST connectors for many of the well-known enterprise servers that we work with today. You also have the ability to create your own feeder applications using their API.

Collections

Collections are one of the fundamental concepts you need to know about when implementing FAST. Content is always submitted to Collections, which are logical groupings of searchable documents. In FAST terminology, a “document” is anything that has been indexed. This can be a file, some web content, a database record, or something else that can be processed by FAST. Each document has fields which are populated based on document processing rules. FAST has a very powerful relevancy model and fields are used to provide more relevant results to users.

Why would you create Collections? Some examples are you may want to create different collections for different type of content that you are indexing (Internet versus intranet). You may want to index content differently based on business rules and relevance models.

We will touch more on Collections later when discussing Document Processing.

Note if you have a SharePoint background you may want to say a Collection is similar to a SSP but they are not the same. SSP is a concept unique to SharePoint.

Content Distributors

When Connectors feed content to FAST, they must provide two things: the Collection (which is the logical destination for the content) and the Content Distributor. The purpose of Content Distributors is to provide fault tolerance and increased Document Processing throughput for FAST. They are responsible for routing content directly to the Document Process servers. Specifically, the Content Distributor is responsible for sending content to a Collection which has a Document Processing pipeline mapped to Collection. Content will only pass through the Content Distributor and it will never be modified.

Document Processors

Content processing occurs within Document Processing servers of FAST. Pipelines in the servers support an end-to-end process that transforms submitted content into a FAST searchable document. Within each pipeline there are stages which perform actual tasks. FAST comes with numerous pipelines already preconfigured with stages. It is possible to write your own stages in Python and then add them into existing pipelines. You can also create your own pipelines for custom content processing. The goal is to create a document that is in a proprietary format called FIXML. FIXML is a physical file that is used by the FAST index servers to build an index that can be searched. A common comparison made is a pipeline and it stages to an ETL job such as in SQL SSIS.

When a Collection is defined, a single pipeline will be assigned to perform the Document Processing for that Collection. Pipelines can be reused between Collections but it is important to note that a Collection will only have one defined pipeline. This is because some pipelines are geared for processing unstructured content (web) versus structured content (database and XML). Also, documents are uniquely identified by the internal ID along with the Collection name the document belongs to. It is possible that the same document could be in the index more than once but in different Collections. They are not considered to be the same document (even though the document source is the same) because it “may” have passed through a different document processing pipeline.

Note there is a significant amount of work (beyond the scope of this blog) that is performed on the Document Processor servers, including everything from extracting complex entities to applying linguistics.

Index Dispatchers

Before data is sent to the Index nodes, a component called an Index Dispatcher sends the FIXML to the correct Index node. The job of the Index Dispatcher is to hide the topology of the Index nodes from the Document Processing servers. I will discuss details of the Index Server topology later.

Index Nodes

The Index nodes are responsible for building binary indexes from FIXML files created by the Document Processors. Searches are completed against the built indexes, not the FIXML files. There may be many Index Nodes to support fault tolerance, performance or the amount of content that must be indexed. We will dive into this in the next section.

Search Nodes

Search Nodes are the processes that perform queries against Index Nodes. There will always be at least one Search Node for every Index Node. The Search Node will only search the Index Node that it is assigned to. We will go into the details of topology of Index and Search nodes when we discuss scaling FAST.

Within a Search Node, there is a process called fSearch which is created for each index partition within the Index Node. The fSearch processsearches the index partition for matching documents. Then a single process within the Search Node called FDispatch takes all the results from each index partition and merges them into a single result that is ranked and sorted appropriately based on the rules specified in the Index Profile.

We will not go into all the details of the Index Profile, but since it was mentioned I will give you a highlight of it. The Index Profile (XML file) defines the search index schema for the search cluster. It defines document fields, document processing features, search features and results features. Almost every server in the search cluster configuration uses this file in some way.

Note that a Search Node is very similar in concept to the Query Service in MOSS 2007. The Query Service has the responsibility of searching the index file that is built by the Index service.

Query/Result (QR) Server

QR Server is responsible for preparing queries to be sent to the Search Nodes and refines the results before they are returned to the calling Search Front End Server (SFE). Query transformation includes spell checking, query-side lemmatization, query-side synonym expansion, anti-phrasing, stop work removal. It is applied to ensure that the best possible query is submitted. Some of this processing can be controlled by providing parameters with the query.

QR Server is also responsible for preparing the results for the calling SFE. Results transformation will perform result side duplication removal, build document clusters and build shallow navigators based on the query parameters that were given to the QR Server.

Note much of the configuration for the QR server is managed in the Index Profile (mentioned in the previous section).

Top Level Dispatcher (TLD)

Between the Search Nodes and the QR server is the Top Level Dispatcher (TLD). This must be installed anytime there is more than one Search Node. The TLD is responsible for distributing queries across the Search Nodes to improve query performance. The TLD is also responsible for merging search results from each FDispatch in each Search Node into a consolidated result set.

Search Front End Servers (SFE)

Search Front End Server (SFE) is simply the front end application that calls out to the FAST server. FAST provides an SFE application but it is not recommended for production use. A majority of the time, SFEs are custom built or integrated into existing applications. There is an API that is available in C#, Java and C++.

For MOSS 2007, there is a set of web parts available on CodePlex that allows you to display results of FAST ESP 5.3. This will be integrated into SharePoint 2010.

Administration Server

There are several administrative components to FAST such as COBRA Name Service, License Manager, Resource Service, Log Server, Config Server, Cache Manager, Admin Server, etc. Going into their details is beyond the scope of this article.

Round Up

Hopefully this was a good introduction to the major components and their role for FAST ESP 5.3. In the next post, I am going to discuss how these components are scaled and the design decisions you should consider as part of a FAST deployment.

No comments: