Community Discussions and Support
Content control ignores some HTML tags?

Thank you Thomas for a long answer!

I also guessed that Content Control is checking (only) the "readable" form of the message (and with that the tags that contribute to it), athough checking the html tag itself does not fit into that.

It is possible that the person who wrote Content Control simply decided not to check (many) other html tags in order to speed things up.

I think that it would be a good thing to check also the html tags that usually contain information about the rest of the html document (at least !doctype and meta tags) because they very often contain telltale signs of spam. It could be done as one of the "Specialized Content Control Tests". For example a new test named "HTMLTag" with parameters for a html tag name and a string to search for.

 

 

<p>Thank you Thomas for a long answer!</p><p>I also guessed that Content Control is checking (only) the "readable" form of the message (and with that the tags that contribute to it), athough checking the html tag itself does not fit into that.</p><p>It is possible that the person who wrote Content Control simply decided not to check (many) other html tags in order to speed things up.</p><p>I think that it would be a good thing to check also the html tags that usually contain information about the rest of the html document (at least <i>!doctype</i> and <i>meta </i>tags) because they very often contain telltale signs of spam. It could be done as one of the "Specialized Content Control Tests". For example a new test named "HTMLTag" with parameters for a html tag name and a string to search for. </p><p> </p><p> </p>

It seems that Content Control (Pmail v 4.41) ignores some HTML tags when parsing messages.

For example, when a message with HTML contents contains these tags (among others):

<META content=3D"MSHTML 6.00.2900.3132" name=3DGENERATOR>
<img alt="" src="cid:part1.07050806.06030705@mrainc.com" height="480" width="452">

and I have these rules:

if body contains "generator" weight 50
if body contains "cid:" weight 51

the Content Control will find positive only the rule looking for "cid:".

Now, when I add the attribute "name=3DGENERATOR" to the img tag, like this:

<img alt="" src="cid:part1.07050806.06030705@mrainc.com" height="480" width="452" name=3DGENERATOR>

the Content Control finds both rules positive.

Also, CC seems to ignore complete tags, not just attributes - rules looking for "<meta" turn negative while rules looking for "<img" turn positive!

I did a few tests with other tags and CC seems to ignore these html tags: !DOCTYPE, head, meta, body, div, span, br
These html tags are not ignored: a, img

Why is Content Control ignoring these html thags? 

&lt;p&gt;It seems that Content Control (Pmail v 4.41) ignores some HTML tags when parsing messages. &lt;/p&gt;&lt;p&gt;For example, when a message with HTML contents contains these tags (among others): &lt;span style=&quot;font-style: italic;&quot;&gt;&amp;lt;META content=3D&quot;MSHTML 6.00.2900.3132&quot; name=3DGENERATOR&amp;gt;&lt;/span&gt;&lt;br style=&quot;font-style: italic;&quot;&gt;&lt;span style=&quot;font-style: italic;&quot;&gt;&amp;lt;img alt=&quot;&quot; src=&quot;cid:part1.07050806.06030705@mrainc.com&quot; height=&quot;480&quot; width=&quot;452&quot;&amp;gt;&lt;/span&gt;&lt;br style=&quot;font-style: italic;&quot;&gt;&lt;/p&gt;&lt;p&gt;and I have these rules: &lt;/p&gt;&lt;p style=&quot;font-style: italic;&quot;&gt;if body contains &quot;generator&quot; weight 50 if body contains &quot;cid:&quot; weight 51 &lt;/p&gt;&lt;p&gt;the Content Control will find positive &lt;span style=&quot;font-weight: bold;&quot;&gt;only &lt;/span&gt;the rule looking for &quot;cid:&quot;. &lt;/p&gt;&lt;p&gt;Now, when I add the attribute &quot;name=3DGENERATOR&quot; to the img tag, like this: &lt;/p&gt;&lt;p style=&quot;font-style: italic;&quot;&gt;&amp;lt;img alt=&quot;&quot; src=&quot;cid:part1.07050806.06030705@mrainc.com&quot; height=&quot;480&quot; width=&quot;452&quot; name=3DGENERATOR&amp;gt; &lt;/p&gt;&lt;p&gt;the Content Control finds &lt;span style=&quot;font-weight: bold;&quot;&gt;both &lt;/span&gt;rules positive.&lt;/p&gt;&lt;p&gt;Also, CC seems to ignore &lt;span style=&quot;font-weight: bold;&quot;&gt;complete tags&lt;/span&gt;, not just attributes - rules looking for &quot;&amp;lt;meta&quot; turn negative while rules looking for &quot;&amp;lt;img&quot; turn positive! I did a few tests with other tags and CC seems to ignore these html tags:&lt;span style=&quot;font-style: italic;&quot;&gt; !DOCTYPE, head, meta, body, div, span, br&lt;/span&gt; These html tags are not ignored: &lt;span style=&quot;font-style: italic;&quot;&gt;a, img&lt;/span&gt; &lt;/p&gt;&lt;p&gt;Why is Content Control ignoring these html thags?&amp;nbsp;&lt;/p&gt;


Hello!

 

--In order to answer your question, I want to describe what content control is able to do. This might help understand why content control sometimes does not work as you would have expected it to.--

 

When working with content control, you should understand what content control does and does not. In general, it checks: both (a) the decoded version of the mail body, and (b) the headers at the beginning of the mail message.
 

 

As far as (a) is concerned, there is an easy rule of thumb: what you can read will be checked by content control.

Imagine reading a mail message in Pegasus Mail: you haven opened the message in a Message Reader window of its own, you have the several tabs like "Message", "Raw view" and so on. Compare the "Message"-tab and the "Raw view"-tab: the "Message"-tab contains a so called decoded version of the message, whereas the "Raw view"-tab contains the encoded version of the message text (and several headers). The decoded version is the human-readable text; the encoded version is a (kind of a) source text - or in simple words: the decoded version is what you read, the encoded version is what the machine reads.

Content control checks the decoded version of the message body, i.e. what you are displayed in the "Message"-tab (or in the message preview when in preview mode). That is why I have said that content control checks what you can read.

There might be some complicated variants to that rule of thumb, for example if a message contains two human-readable versions of the text (which usually happens if the message has both an HTML- and a "pure text"-version of the message text). However, this does not affect the way content control basically works.

 

 
As far as (b) is concerned, content control checks the initial headers.

Reading the raw view of a message, you see the headers at the beginning of the message, followed by a blank line and the so called body. As a rule of thumb, content control is able to check the headers above the first blank line of the raw view. (Technically speaking, the first blank line in the raw view separates the initial headers from the message body.)

Note that a message can have some more headers. You usually see this when a message consists of several parts, for example an HTML-version of the message text and a "pure text"-version of the text and some inline-graphics (i.e. pictures that are embedded in the HTML-text and sent within the mail message).

Then, each part of the message is introduced by some boundary lines (some additional headers before each part). You can see them only when looking at the raw view of the message, but not within the decoded version of the message text. These boundaries lines are not checked by content control, only the headers above the very first blank line are.

 

----

 

This is the technical background about how content control works...now back to your question. I have to admit that I am only guessing what is happening because I am just another user, not the software auther - nevertheless, I hope my answer is helpful.

I expect most of the HTML-tags not to be checked by content control. Any HTML-tag itself is visible only in raw view, but not in the decoded version of the message text. As described above, the main goal of content control is to check the decoded version of the text - that is why an HTML-tag is not checked itself (as it is only visible in raw view). From that point of view, it is no surprise that meta or body are left out.

 

You said that some HTML-tags were indeed checked by content control. This seems to contradict the rule in (a), saying that only the human-readable text is checked by content control (that is why I am somewhat surprised to read that some HTML-tags are processed).

Reading your description, I think the tags that are seen by content control link to and define a visible element. As far as content control is concerned, there seem to be two kinds of HTML-tags: those that only "fine tune" an existing text, and those that define an element of its own that could not be exist without the respective HTML-tag.

To me, the "fine-tuning" tags are probably those about Bold, Underlined, etc. (where the text elements to be finde-tuned would also be there without any formatting HTML-tags), whereas the "defining" tags may be those about defining an inline-graphic or a link (where the defined elements could not be displayed at all without the defining HTML-tags). In other words: The latter would not be human-readable at all without the HTML-tags, but the other would. I guess this may be the crucial point: if it defines and creates an element of its own, an HTML-tag can be checked by content control because the element defined by that HTML-tag is human-readable (and therefore subject to our rule of thumb (a)).

 

I hope I could make myself clear. I cannot stress enough that I am only guessing why some HTML-tags are processed whie others are not. I am sure about the basic description (see the rules of thumb in (a) and (b)), but my answer to your actual question is somewhat speculative, I have to admit. Nevertheless, you may now have a place to start from, I hope.

 

 

&lt;p&gt; &lt;/p&gt;&lt;p&gt; Hello! &lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&lt;span style=&quot;font-weight: bold;&quot;&gt;--&lt;/span&gt;In order to answer your question, I want to describe what content control is able to do. This might help understand why content control sometimes does not work as you would have expected it to.&lt;span style=&quot;font-weight: bold;&quot;&gt;--&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;When working with content control, you should understand what content control does and does not. In general, it checks: both (a) the decoded version of the mail body, and (b) the headers at the beginning of the mail message. &amp;nbsp;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&lt;b&gt;As far as (a) is concerned, there is an easy rule of thumb: &lt;i&gt;what &lt;u&gt;you&lt;/u&gt; can read will be checked by content control.&lt;/i&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Imagine reading a mail message in Pegasus Mail: you haven opened the message in a Message Reader window of its own, you have the several tabs like &quot;Message&quot;, &quot;Raw view&quot; and so on. Compare the &quot;Message&quot;-tab and the &quot;Raw view&quot;-tab: the &quot;Message&quot;-tab contains a so called &lt;i&gt;decoded&lt;/i&gt; version of the message, whereas the &quot;Raw view&quot;-tab contains the &lt;i&gt;encoded&lt;/i&gt; version of the message text (and several headers). The &lt;i&gt;decoded&lt;/i&gt; version is the human-readable text; the &lt;i&gt;encoded&lt;/i&gt; version is a (kind of a) source text - or in simple words: the &lt;i&gt;decoded&lt;/i&gt; version is what &lt;u&gt;you&lt;/u&gt; read, the &lt;i&gt;encoded&lt;/i&gt; version is what &lt;u&gt;the machine&lt;/u&gt; reads.&lt;/p&gt;&lt;p&gt;Content control checks the decoded version of the message body, i.e. what you are displayed in the &quot;Message&quot;-tab (or in the message preview when in preview mode). That is why I have said that content control checks what you can read.&lt;/p&gt;&lt;p&gt;There might be some complicated variants to that rule of thumb, for example if a message contains two human-readable versions of the text (which usually happens if the message has both an HTML- and a &quot;pure text&quot;-version of the message text). However, this does not affect the way content control basically works. &lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&amp;nbsp; &lt;b&gt;As far as (b) is concerned, content control checks the initial headers.&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Reading the raw view of a message, you see the headers at the beginning of the message, followed by a blank line and the so called body. As a rule of thumb, content control is able to check the headers above the first blank line of the raw view. (Technically speaking, the first blank line in the raw view separates the initial headers from the message body.)&lt;/p&gt;&lt;p&gt;Note that a message can have some more headers. You usually see this when a message consists of several parts, for example an HTML-version of the message text and a &quot;pure text&quot;-version of the text and some inline-graphics (i.e. pictures that are embedded in the HTML-text and sent within the mail message).&lt;/p&gt;&lt;p&gt;Then, each part of the message is introduced by some boundary lines (some additional headers before each part). You can see them only when looking at the raw view of the message, but not within the decoded version of the message text. These boundaries lines are not checked by content control, only the headers above the very first blank line are.&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p style=&quot;font-weight: bold;&quot;&gt;----&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;This is the technical background about how content control works...now back to your question. I have to admit that I am only guessing what is happening because I am just another user, not the software auther - nevertheless, I hope my answer is helpful. &lt;/p&gt;&lt;p&gt;I expect most of the HTML-tags not to be checked by content control. Any HTML-tag itself is visible only in raw view, but not in the decoded version of the message text. As described above, the main goal of content control is to check the decoded version of the text - that is why an HTML-tag is not checked itself (as it is only visible in raw view). From that point of view, it is no surprise that &lt;span style=&quot;font-style: italic;&quot;&gt;meta&lt;/span&gt; or &lt;span style=&quot;font-style: italic;&quot;&gt;body&lt;/span&gt; are left out. &lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;You said that some HTML-tags were indeed checked by content control. This seems to contradict the rule in (a), saying that only the human-readable text is checked by content control (that is why I am somewhat surprised to read that some HTML-tags are processed).&lt;/p&gt;&lt;p&gt;Reading your description, I think the tags that are seen by content control link to and define a visible element. As far as content control is concerned, there seem to be two kinds of HTML-tags: those that only &quot;fine tune&quot; an existing text, and those that define an element of its own that could not be exist without the respective HTML-tag.&lt;/p&gt;&lt;p&gt;To me, the &quot;fine-tuning&quot; tags are probably those about Bold, Underlined, etc. (where the text elements to be finde-tuned would also be there without any formatting HTML-tags), whereas the &quot;defining&quot; tags may be those about defining an inline-graphic or a link (where the defined elements could not be displayed at all without the defining HTML-tags). In other words: The latter would not be human-readable at all without the HTML-tags, but the other would. I guess this may be the crucial point: if it defines and creates an element of its own, an HTML-tag can be checked by content control because the element defined by that HTML-tag is human-readable (and therefore subject to our rule of thumb (a)).&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;I hope I could make myself clear. I cannot stress enough that I am only guessing why some HTML-tags are processed whie others are not. I am sure about the basic description (see the rules of thumb in (a) and (b)), but my answer to your actual question is somewhat speculative, I have to admit. Nevertheless, you may now have a place to start from, I hope. &lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;

Thomas, thanks for that very nice description of how content control works-- I learned some good things from that.

Cheers, Dave in ABQ

Thomas, thanks for that very nice description of how content control works-- I learned some good things from that. Cheers, Dave in ABQ
live preview
enter atleast 10 characters
WARNING: You mentioned %MENTIONS%, but they cannot see this message and will not be notified
Saving...
Saved
With selected deselect posts show selected posts
All posts under this topic will be deleted ?
Pending draft ... Click to resume editing
Discard draft