Wednesday 15 July 2009

PDF extensions are not indexed when crawling

Hi guys,

Recently one of my colleague had a problem of crawling .PDF extension files in MOSS 2007. I was trying couple of articles to resolve this issue for him. I found MS article which illustrates 2 scenarios for this issue http://support.microsoft.com/kb/944447/en-us

Apart from above MS article, following steps I found from one of the blogs in internet.


To enable PDF indexing use the following steps:

 Download Adobe Reader 9.0, which includes IFilter 9.0.0.0, from
http://www.adobe.com/products/acrobat/
 Download the Acrobat PDF Picture, to display in front of PDF search result items, from

http://www.adobe.com/misc/linking.html

 Add the PDF file type to the Extensions List for WSS search by editing the registry

• Start regedit


• Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\{Random GUID}\Gather\Search\Extensions\ExtensionList

• Add PDF to the list as a new String Value. Use a new high value e.g. if 37 is the highest value, use "38" as the key with the value "pdf"

 Add the Acrobat PDF picture to the SharePoint templates directory. Copy the Acrobat PDF picture called pdficon_small.gif in the 12 Hive\TEMPLATE\IMAGES folder, e.g. %programfiles%\Common Files\Microsoft Shared\Web Server Extensions\12\TEMPLATE\IMAGES.

 Bind the Acrobat PDF picture to the PDF file type

• Open the 12 Hive\TEMPLATE\XML\DOCICON.XML file

• Find the part

• Add the following mapping:


 Change IFilter mapping in registry

• Start regedit

• Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\

• Add (or modify) the .pdf key

• Add a Multi-String value with value {E8978DA6-047F-4E3D-9C78-CDBE46041603} or modify if another GUID value already exists.

• Open the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\

• Add (or modify) the .pdf key

• Add a Multi-String value with value {E8978DA6-047F-4E3D-9C78-CDBE46041603} or modify if another GUID value already exists.

 Add the Adobe Reader folder to the environment path variable

• Right Click on My Computer

• Open Properties

• Open the Advanced tab

• Go to the Environment variables

• Edit the Path variable

• Add your Reader folder to the Path list, e.g. C:\Program Files\Adobe\Reader 9.0\Reader

 Restart the Search service by restarting your server or executing the following commands:

• Run: net stop osearch
• Run: net start osearch

 Crawl the PDF documents

• Existing PDF documents that were crawled before the Adobe PDF IFilter has been installed are not indexed during an incremental crawl. You have to edit each existing PDF file to trigger the crawler to reindex the file during an incremental crawl. It´s easier to run a full crawl after you have installed the Adobe PDF IFilter.

Now all PDF documents are crawled you can query on content inside a PDF document.

Hope this is usefull tip for resolving the issue.


Cheers !!!

No comments:

Post a Comment