Intro:
My log is flodded with this entries. Why this.? Where is this comming from?
ManagedPoolThread #4 17:17:35 ERROR Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{7E5F66DF-2A4E-448F-B8DF-656BE6D4DA19}?lang=en&ver=1 Exception: System.Runtime.InteropServices.COMException Message: Error HRESULT E_FAIL has been returned from a call to a COM component. Source: Sitecore.ContentSearch at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IClassFactory.CreateInstance(Object pUnkOuter, Guid& refiid, Object& ppunk) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadFilterFromDll(String dllName, String filterPersistClass) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension) at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName) at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable) at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable) at Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilder.AddComputedIndexFields()
Explanation:
I have tons of pdf in my database. Lucene tries to index them with iFilter extension. This is somehow not working propper.
What I did is to use the pdfbox .NET or the iTextSharp plugin to index this.
Downloads:
http://www.squarepdf.net/pdfbox-in-net
http://sourceforge.net/projects/itextsharp/
How To
In Lucene config add:
Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config
<fields hint="raw:AddComputedIndexField">
<!-- IC TSWK we use our custom PDF Media indexer!!!!
<field fieldName="_content" storageType="no" indexType="tokenized">Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch</field>
-->
<field fieldName="_content" storageType="no" indexType="tokenized">ICR.SC.Internet.Website.Logic.Search.FieldCrawlers.MediaContentExtractor, ICR.SC.Internet.Website.Logic</field>
Extract Method of PDF using the PDFbox 1.8.2
private string ParsePDF(MediaItem mediaItem) { PDDocument doc = null; ikvm.io.InputStreamWrapper wrapper = null; if (mediaItem != null) Log.Info( String.Format("SearchUtils.ParsePdf: Parsing path '{0}' Mimetype: '{1}'", mediaItem.Path, mediaItem.MimeType), new Object()); try { if (mediaItem != null) { var stream = mediaItem.GetMediaStream(); if (stream != null) { wrapper = new ikvm.io.InputStreamWrapper(stream); doc = PDDocument.load(wrapper); var stripper = new PDFTextStripper(); if (doc != null) return stripper.getText(doc); } else { Log.Warn(String.Format("pdfbox failed: Could not load stream from mediaitem '{0}'", mediaItem.Path), new Object()); } } else { Log.Warn("pdfbox failed: Could not load mediaitem", new Object()); } } catch (Exception ex) { Log.Error( mediaItem != null ? String.Format("pdfbox failed: Exception: '{0}' parsing '{1}'", ex.Message, mediaItem.Path) : String.Format("pdfbox failed: Exception: '{0}'", ex.Message), ex); } finally { if ((doc != null)) { doc.close(); wrapper.close(); } } return String.Empty; }
Extract Method of Office Docs using the Sitecore Content Search Standard
private string ParseOfficeDocSitecore(IIndexable indexable)
{
string content = string.Empty;
var extractor = new Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor();
object value = null;
try
{
value = extractor.ComputeFieldValue((Sitecore.ContentSearch.SitecoreIndexableItem)indexable);
}
catch (Exception ex)
{
CrawlingLog.Log.Error(ex.ToString(), ex);
Log.Warn(String.Format("ParseOfficeDocSitecore failed: Could not load stream from mediaitem '{0}'", indexable.AbsolutePath), new Object());
return "";
}
return content = (value ?? "").ToString();
}
Comments
Post a Comment
a new comment