ILearnable .Net

June 2, 2009

Using EasySearch as the backbone of an EpiServer site

Filed under: Uncategorized — andreakn @ 14:16

We have had great success in using EasySearch as the backbone of our EPiServer site. Anyone working with EPiServer will have experienced  GetPagesByCriteria and its apparent greatness in finding pages in EPiServer when developing. They will also have witnessed its crash and burn when realistically large sets of pages are introduced. I have seen search times up to 10 seconds when searching through 10k pages.

The problem with GetPagesByCriteria is that it needs to load each page from DB before deciding whether it is a page worth returning in the result set.

For our current project we chose to buy a component called “EasySearch” from a company called Networked Planet.¬† It’s basically an EPiServer-aware wrapper around the Lucene search engine, with functionality for on-demand indexing on EPiServer events (PageCreated / PageDeleted / PageSaved)

Lucene is a document indexer that can be configured to index anything into its weird binary index-format and later queried.  To get the ball rolling, we need to install EasySearch (oh, and pay the guys 30000NOKs for their effort). we then set up a config file enumerating all the props we want searchable:

Configuring

<?xml version="1.0" encoding="utf-8"?>
<indexconfiguration xsi:type="lucene:LuceneIndexConfiguration" xmlns:lucene="http://www.networkedplanet.com/schema/easysearch/configuration/lucene" xmlns="http://www.networkedplanet.com/schema/easysearch/configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="lucene:LuceneIndexConfiguration">
<lucene:configuration RelativeDirecoryPath="..\EasySearchLuceneIndex" AppendInstanceIdToDirectory="false" />

<fileindexing Enabled="false">
<!--    <exclude extension="jpg" />-->
<!--    <include />-->
</fileindexing>
<enterprise Enabled="true" />
<pagetype Name="Artikkel">
<property Name="EPi_StartPublishDate" IncludeInCommonContent="false">
      <lucene:field Name="EPi_StartPublishDate" FieldStore="YES" FieldTermVector="NO" FieldIndex="UN_TOKENIZED" />
    </property>
<property Name="EPi_StopPublishDate" IncludeInCommonContent="false">
      <lucene:field Name="EPi_StopPublishDate" FieldStore="YES" FieldTermVector="NO" FieldIndex="UN_TOKENIZED" />
    </property>
<property Name="EPi_PageTypeId" IncludeInCommonContent="false">
      <lucene:field Name="EPi_PageTypeId" FieldStore="YES" FieldTermVector="NO" FieldIndex="UN_TOKENIZED" />
    </property>
<property Name="EPi_PageName" IncludeInCommonContent="true">
      <lucene:field Name="EPi_PageName" FieldStore="YES" FieldTermVector="NO" FieldIndex="TOKENIZED" />
    </property>
<property Name="Heading" IncludeInCommonContent="true">
      <lucene:field Name="Heading" FieldStore="NO" FieldTermVector="NO" FieldIndex="TOKENIZED" />
    </property>
   ...
</pagetype>
...
</indexconfiguration>

TOKENIZED means that each word (separated by whitespace) will be indexed separately
IncludeInCommonContent means that EasySearch will grab all those fields and concatenate them together into a blob that is searchable, typically we do this for all the content that will be searchable in freetext-searches
FieldStore means that the values will be available on the document (the entity that Lucene operates on) so that we do not necessarily need to load the EPiServer PageData for the particular hit. Typically we do this for field values we want to be able to operate on for filtering etc.

The config file is referred to from web.config as a config section with a file source

Searching

When we search for stuff we have created three distinct layers of functionality
1) basic Terms / Queries


 public class EPiSearchQuery
    {
        public static Term DateTerm(string fieldName, DateTime? dateTime)
        {
            if (dateTime == null){return null;}
            return new Term(fieldName, dateTime.Value.ToUniversalTime().ToString("s").ToLower());
        }

        public static Query ValidUntilQuery(DateTime validUntil)
        {
            return DateRangeQuery("ValidUntil_no", validUntil, null, false);
        }

        public static Query PublishedBetween(DateTime? fromDate, DateTime? toDate, bool inclusive)
        {
            return DateRangeQuery("EPi_StartPublishDate_no", fromDate, toDate, inclusive);
        }

        public static Query AppearInRightColumnQuery(bool value)
        {
            return new TermQuery(new Term("AppearInRightColumn_no", value.ToString()));
        }

        public static Query PageTypeQuery(DomainBasePage pageType)
        {
            return new TermQuery(new Term("EPi_PageTypeId_no", pageType.PageTypeId.ToString()));
        }

        public static Query PageNameQuery(string pageName)
        {
            return new TermQuery(new Term("EPi_PageName_no", pageName));
        }

      ...
}

2) queries that group atomic queries / terms together (AND and OR semantics)


  public class EPiSearchObject
    {

        private readonly BooleanQuery _query;

        public BooleanQuery Query { get { return _query; } }

        private EPiSearchObject()
        {
            _query = new BooleanQuery();
        }

        private List<Document> Search()
        {
            var queryEngine = (LuceneQuery)EasySearchConfiguration.Instance.QueryInterface;

            if (queryEngine == null)
            {
                throw new ApplicationException("EasySearchConfiguration.Instance.QueryInterface is null. Bad or missing configuration?");
            }
            return queryEngine.Search(_query);
        }

  public static EPiSearchObject And(params Query[] subQueries)
        {
            var ePiSearch = new EPiSearchObject();

            foreach (var query in subQueries)
            {
                if (query != null)
                {
                    ePiSearch._query.Add(query, BooleanClause.Occur.MUST);
                }
            }
            return ePiSearch;
        }
....

As everything put into the Lucene index is basically an EPiServer PageData, we have created a domain model (probably worth a blog post in and of itself) where each PageType in EPiServer is represented by a class, and each property (that we need to be programatically aware of) is represented as a property on objects of that class
The domain objects contain logic for taking a Lucene Document, finding out which PageData it represents, newing up a corresponding instance of the domain objects and setting the properties on it.

3) Top level search interface with customized queries resulting in domain objects representing PageData objects (with caching through Policy Injection Application Block from Microsoft Patterns and Practices)

for example:

        [PIABCache(AbsoluteExpiration = 60 * 60, CacheDependencyFactoryType = typeof(EPiServerCacheDependencyFactory))]
        public Collection<DomainBasePage> GetRightColumnArticles()
        {
            Query pageTypeQ = EPiSearchQuery.PageTypeQuery(new List<DomainBasePage> { DomainPageType.ArticlePage, DomainPageType.NewsPage });
            Query rightColumnQ = EPiSearchQuery.AppearInRightColumnQuery(true);
            EPiSearchObject search = EPiSearchObject.And(pageTypeQ, rightColumnQ);
            Collection<R22BasePage> pages = search.Search<DomainBasePage>();

            pages = pages.OrderBy(page => page.Changed).ToCollection();

            return pages;
        }

The cache attribute is defined as this:


    public class PIABCacheAttribute : HandlerAttribute
    {
        /// <summary>
        /// (in seconds).
        /// </summary>
        public int AbsoluteExpiration
        {
            get { return _absoluteExpiration.HasValue ? (int)_absoluteExpiration.Value.TotalSeconds : 0; }
            set { _absoluteExpiration = TimeSpan.FromSeconds(value); }
        }

        /// <summary>
        /// (in seconds).
        /// </summary>
        public int SlidingExpiration
        {
            get { return _slidingExpiration.HasValue ? (int)_slidingExpiration.Value.TotalSeconds : 0; }
            set { _slidingExpiration = TimeSpan.FromSeconds(value); }
        }

        /// <summary>
        /// Used to set the item priority. Default to <c>CacheItemPriority.Normal</c>.
        /// </summary>
        public CacheItemPriority ItemPriority { get; set; }

        public string CacheName { get; set; }
        public string KeyName { get; set; }
        public Type CacheKeyFactoryType { get; set; }
        public Type CacheExpirationFactoryType { get; set; }
        public Type CacheDependencyFactoryType { get; set; }

        private TimeSpan? _absoluteExpiration;
        private TimeSpan? _slidingExpiration;

        public PIABCacheAttribute()
        {
            ItemPriority = CacheItemPriority.Normal;
        }

        public override ICallHandler CreateHandler()
        {
            return new PIABCache(CacheName, KeyName, _absoluteExpiration, _slidingExpiration, CacheExpirationFactoryType, CacheDependencyFactoryType, ItemPriority, CacheKeyFactoryType, Order);
        }
    }

and the cacheExpirationFactory used for EPiServer content is like this:


public class EPiServerCacheDependencyFactory : ICacheDependencyFactory
    {
        public CacheDependency GetCacheDependency(object obj)
        {
            return new CacheDependency(null, new[] { "DataFactoryCache.Version" }, null); 
        }
    }

(the cache item named “DataFactoryCache.Version” is updated on every publish in EPiServer)

The only queries not cached are free text queries, and they are capped at 100 documents in the result set.

We didn’t have to wrap the pages in custom domain objects, but that strategy has given us a super smooth API to work with (albeit at a higher initial cost, and a steep learning curve in setting it up initially)

Now we have a site where it’s super easy to fetch data to all sorts of listings with content coming from all over the place, for instance we can easily get news/tips/weather info/track info etc. for any given event (the domain is betting on horse races) which frees the content editors to create the content in the places where it makes sense for them to do so

Advertisements

Blog at WordPress.com.