For example, If the content of HTML file is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<div>
<table class="AlternateCyanColorGrid" cellspacing="0px" width="100%">
<col style="width: 33.333%;" />
<col style="width: 33.333%;" />
<col style="width: 33.333%;" />
<tr class="t1Row">
<td>heading1</td>
<td>chkheading</td>
<td>newest</td>
</tr>
<tr class="t2Row">
<td>search</td>
<td>results</td>
<td>logged</td>
</tr>
<tr class="t1Row">
<td>cruel</td>
<td>verify</td>
<td>broom</td>
</tr>
</table>
</body>
</html>
Now, if we search for table, body, style, etc, the file will be included in search results.
Sometimes, requirements may be to search for WYSIWYG text of Html files only, and not the extra meta stuff.
What we need to do, is to provide Alfresco, the WYSIWYG text of HTML files, and not the raw text. Alfresco provides transformers for this purpose. We will use a transformer which is present in Alfresco.
To achieve our purpose, create a custom extension file custom-content-services-context.xml in extension folder [For Alfresco deployed inside JBoss, the extension folder is server\
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
<bean id="transformer.HtmlParser"
class="org.alfresco.repo.content.transform.HtmlParserContentTransformer"
parent="baseContentTransformer" >
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
<property name="sourceMimetype"><value>text/html</value></property>
<property name="targetMimetype"><value>text/plain</value></property>
</bean>
</list>
</property>
</bean>
</beans>
Now, if you add the same Html file, terms like body, table etc should not get indexed.