Sitecore Search: Index content of PDF and MS Office formats DOC, PPT, Excel

Intro:

My log is flodded with this entries. Why this.? Where is this comming from?

ManagedPoolThread #4 17:17:35 ERROR Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{7E5F66DF-2A4E-448F-B8DF-656BE6D4DA19}?lang=en&ver=1 Exception: System.Runtime.InteropServices.COMException Message: Error HRESULT E_FAIL has been returned from a call to a COM component. Source: Sitecore.ContentSearch at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IClassFactory.CreateInstance(Object pUnkOuter, Guid& refiid, Object& ppunk) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadFilterFromDll(String dllName, String filterPersistClass) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName) at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable) at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable) at Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilder.AddComputedIndexFields()

Explanation:

I have tons of pdf in my database. Lucene tries to index them with iFilter extension. This is somehow not working propper.

What I did is to use the pdfbox .NET or the iTextSharp plugin to index this.

Downloads:

http://www.squarepdf.net/pdfbox-in-net
http://sourceforge.net/projects/itextsharp/

How To

In Lucene config add:
Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config

 <fields hint="raw:AddComputedIndexField">
            <!-- IC TSWK we use our custom PDF Media indexer!!!!
            <field fieldName="_content"             storageType="no"  indexType="tokenized">Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch</field>
            -->
            <field fieldName="_content"             storageType="no"  indexType="tokenized">ICR.SC.Internet.Website.Logic.Search.FieldCrawlers.MediaContentExtractor, ICR.SC.Internet.Website.Logic</field>
            

Extract Method of PDF using the PDFbox 1.8.2

         private string ParsePDF(MediaItem mediaItem)

        {

            PDDocument doc = null;

            ikvm.io.InputStreamWrapper wrapper = null;



            if (mediaItem != null)

                Log.Info(

                    String.Format("SearchUtils.ParsePdf: Parsing path '{0}' Mimetype: '{1}'", mediaItem.Path,

                                  mediaItem.MimeType), new Object());



            try

            {

                if (mediaItem != null)

                {

                    var stream = mediaItem.GetMediaStream();



                    if (stream != null)

                    {

                        wrapper = new ikvm.io.InputStreamWrapper(stream);

                        doc = PDDocument.load(wrapper);

                        var stripper = new PDFTextStripper();



                        if (doc != null)

                            return stripper.getText(doc);

                    }

                    else

                    {

                        Log.Warn(String.Format("pdfbox failed: Could not load stream from mediaitem '{0}'", mediaItem.Path), new Object());

                    }

                }

                else

                {

                    Log.Warn("pdfbox failed: Could not load mediaitem", new Object());

                }

            }

            catch (Exception ex)

            {

                Log.Error(

                    mediaItem != null

                        ? String.Format("pdfbox failed: Exception: '{0}' parsing '{1}'", ex.Message, mediaItem.Path)

                        : String.Format("pdfbox failed: Exception: '{0}'", ex.Message), ex);

            }

            finally

            {

                if ((doc != null))

                {

                    doc.close();

                    wrapper.close();

                }

            }



            return String.Empty;

        }

Extract Method of Office Docs using the Sitecore Content Search Standard

   private string ParseOfficeDocSitecore(IIndexable indexable)
        {   
            string content = string.Empty;
            var extractor = new Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor();
            object value = null;

            try
            {
                value = extractor.ComputeFieldValue((Sitecore.ContentSearch.SitecoreIndexableItem)indexable);
            }
            catch (Exception ex)
            {
                CrawlingLog.Log.Error(ex.ToString(), ex);
                Log.Warn(String.Format("ParseOfficeDocSitecore failed: Could not load stream from mediaitem '{0}'", indexable.AbsolutePath), new Object());
                return "";
            }

            return content = (value ?? "").ToString();
        }

Comments