Tuesday, February 9, 2010

Search HTML Files in Alfresco: Search only WYSIWYG text, and not HTML tags, style declrations, etc.

By default, If we add any HTML file to Alfresco, it treats it as a text file and indexes everything. That means, all tags, content inside tags, style declarations, etc get indexed.
For example, If the content of HTML file is:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<div>
<table class="AlternateCyanColorGrid" cellspacing="0px" width="100%">
<col style="width: 33.333%;" />
<col style="width: 33.333%;" />
<col style="width: 33.333%;" />
<tr class="t1Row">
<td>heading1</td>
<td>chkheading</td>
<td>newest</td>
</tr>
<tr class="t2Row">
<td>search</td>
<td>results</td>
<td>logged</td>
</tr>
<tr class="t1Row">
<td>cruel</td>
<td>verify</td>
<td>broom</td>
</tr>
</table>
</body>
</html>

Now, if we search for table, body, style, etc, the file will be included in search results.
Sometimes, requirements may be to search for WYSIWYG text of Html files only, and not the extra meta stuff.

What we need to do, is to provide Alfresco, the WYSIWYG text of HTML files, and not the raw text. Alfresco provides transformers for this purpose. We will use a transformer which is present in Alfresco.

To achieve our purpose, create a custom extension file custom-content-services-context.xml in extension folder [For Alfresco deployed inside JBoss, the extension folder is server\\conf\alfresco\extension, for standalone Alfresco, it is tomcat\webapps\alfresco\WEB-INF\classes\alfresco\extension] with the following content:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
<bean id="transformer.HtmlParser"
class="org.alfresco.repo.content.transform.HtmlParserContentTransformer"
parent="baseContentTransformer" >
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
<property name="sourceMimetype"><value>text/html</value></property>
<property name="targetMimetype"><value>text/plain</value></property>
</bean>
</list>
</property>
</bean>
</beans>

Now, if you add the same Html file, terms like body, table etc should not get indexed.