Content control ignores some HTML tags?

lesles

posted Aug 20 '07 at 12:58 pm

Thank you Thomas for a long answer!

I also guessed that Content Control is checking (only) the "readable" form of the message (and with that the tags that contribute to it), athough checking the html tag itself does not fit into that.

It is possible that the person who wrote Content Control simply decided not to check (many) other html tags in order to speed things up.

I think that it would be a good thing to check also the html tags that usually contain information about the rest of the html document (at least !doctype and meta tags) because they very often contain telltale signs of spam. It could be done as one of the "Specialized Content Control Tests". For example a new test named "HTMLTag" with parameters for a html tag name and a string to search for.

Thank you Thomas for a long answer!I also guessed that Content Control is checking (only) the "readable" form of the message (and with that the tags that contribute to it), athough checking the html tag itself does not fit into that.It is possible that the person who wrote Content Control simply decided not to check (many) other html tags in order to speed things up.I think that it would be a good thing to check also the html tags that usually contain information about the rest of the html document (at least !doctype and meta tags) because they very often contain telltale signs of spam. It could be done as one of the "Specialized Content Control Tests". For example a new test named "HTMLTag" with parameters for a html tag name and a string to search for. &nbsp;&nbsp;

lesles

posted Aug 10 '07 at 4:46 pm

It seems that Content Control (Pmail v 4.41) ignores some HTML tags when parsing messages.

For example, when a message with HTML contents contains these tags (among others):

<META content=3D"MSHTML 6.00.2900.3132" name=3DGENERATOR>
<img alt="" src="cid:part1.07050806.06030705@mrainc.com" height="480" width="452">

and I have these rules:

if body contains "generator" weight 50
if body contains "cid:" weight 51

the Content Control will find positive only the rule looking for "cid:".

Now, when I add the attribute "name=3DGENERATOR" to the img tag, like this:

the Content Control finds both rules positive.

Also, CC seems to ignore complete tags, not just attributes - rules looking for "<meta" turn negative while rules looking for "<img" turn positive!

I did a few tests with other tags and CC seems to ignore these html tags: !DOCTYPE, head, meta, body, div, span, br
These html tags are not ignored: a, img

Why is Content Control ignoring these html thags?

It seems that Content Control (Pmail v 4.41) ignores some HTML tags when parsing messages. For example, when a message with HTML contents contains these tags (among others): &lt;META content=3D"MSHTML 6.00.2900.3132" name=3DGENERATOR&gt; &lt;img alt="" src="cid:part1.07050806.06030705@mrainc.com" height="480" width="452"&gt; and I have these rules: if body contains "generator" weight 50 if body contains "cid:" weight 51 the Content Control will find positive only the rule looking for "cid:". Now, when I add the attribute "name=3DGENERATOR" to the img tag, like this: &lt;img alt="" src="cid:part1.07050806.06030705@mrainc.com" height="480" width="452" name=3DGENERATOR&gt; the Content Control finds both rules positive.Also, CC seems to ignore complete tags, not just attributes - rules looking for "&lt;meta" turn negative while rules looking for "&lt;img" turn positive! I did a few tests with other tags and CC seems to ignore these html tags: !DOCTYPE, head, meta, body, div, span, br These html tags are not ignored: a, img Why is Content Control ignoring these html thags?&nbsp;

Thomas_N_

posted Aug 17 '07 at 6:27 pm

Hello!

--In order to answer your question, I want to describe what content control is able to do. This might help understand why content control sometimes does not work as you would have expected it to.--

When working with content control, you should understand what content control does and does not. In general, it checks: both (a) the decoded version of the mail body, and (b) the headers at the beginning of the mail message.

As far as (a) is concerned, there is an easy rule of thumb: what you can read will be checked by content control.

Imagine reading a mail message in Pegasus Mail: you haven opened the message in a Message Reader window of its own, you have the several tabs like "Message", "Raw view" and so on. Compare the "Message"-tab and the "Raw view"-tab: the "Message"-tab contains a so called decoded version of the message, whereas the "Raw view"-tab contains the encoded version of the message text (and several headers). The decoded version is the human-readable text; the encoded version is a (kind of a) source text - or in simple words: the decoded version is what you read, the encoded version is what the machine reads.

Content control checks the decoded version of the message body, i.e. what you are displayed in the "Message"-tab (or in the message preview when in preview mode). That is why I have said that content control checks what you can read.

There might be some complicated variants to that rule of thumb, for example if a message contains two human-readable versions of the text (which usually happens if the message has both an HTML- and a "pure text"-version of the message text). However, this does not affect the way content control basically works.

As far as (b) is concerned, content control checks the initial headers.

Reading the raw view of a message, you see the headers at the beginning of the message, followed by a blank line and the so called body. As a rule of thumb, content control is able to check the headers above the first blank line of the raw view. (Technically speaking, the first blank line in the raw view separates the initial headers from the message body.)

Note that a message can have some more headers. You usually see this when a message consists of several parts, for example an HTML-version of the message text and a "pure text"-version of the text and some inline-graphics (i.e. pictures that are embedded in the HTML-text and sent within the mail message).

Then, each part of the message is introduced by some boundary lines (some additional headers before each part). You can see them only when looking at the raw view of the message, but not within the decoded version of the message text. These boundaries lines are not checked by content control, only the headers above the very first blank line are.

----

This is the technical background about how content control works...now back to your question. I have to admit that I am only guessing what is happening because I am just another user, not the software auther - nevertheless, I hope my answer is helpful.

I expect most of the HTML-tags not to be checked by content control. Any HTML-tag itself is visible only in raw view, but not in the decoded version of the message text. As described above, the main goal of content control is to check the decoded version of the text - that is why an HTML-tag is not checked itself (as it is only visible in raw view). From that point of view, it is no surprise that meta or body are left out.

You said that some HTML-tags were indeed checked by content control. This seems to contradict the rule in (a), saying that only the human-readable text is checked by content control (that is why I am somewhat surprised to read that some HTML-tags are processed).

Reading your description, I think the tags that are seen by content control link to and define a visible element. As far as content control is concerned, there seem to be two kinds of HTML-tags: those that only "fine tune" an existing text, and those that define an element of its own that could not be exist without the respective HTML-tag.

To me, the "fine-tuning" tags are probably those about Bold, Underlined, etc. (where the text elements to be finde-tuned would also be there without any formatting HTML-tags), whereas the "defining" tags may be those about defining an inline-graphic or a link (where the defined elements could not be displayed at all without the defining HTML-tags). In other words: The latter would not be human-readable at all without the HTML-tags, but the other would. I guess this may be the crucial point: if it defines and creates an element of its own, an HTML-tag can be checked by content control because the element defined by that HTML-tag is human-readable (and therefore subject to our rule of thumb (a)).

I hope I could make myself clear. I cannot stress enough that I am only guessing why some HTML-tags are processed whie others are not. I am sure about the basic description (see the rules of thumb in (a) and (b)), but my answer to your actual question is somewhat speculative, I have to admit. Nevertheless, you may now have a place to start from, I hope.

Hello! &nbsp;--In order to answer your question, I want to describe what content control is able to do. This might help understand why content control sometimes does not work as you would have expected it to.--&nbsp;When working with content control, you should understand what content control does and does not. In general, it checks: both (a) the decoded version of the mail body, and (b) the headers at the beginning of the mail message. &nbsp;&nbsp;As far as (a) is concerned, there is an easy rule of thumb: what you can read will be checked by content control.Imagine reading a mail message in Pegasus Mail: you haven opened the message in a Message Reader window of its own, you have the several tabs like "Message", "Raw view" and so on. Compare the "Message"-tab and the "Raw view"-tab: the "Message"-tab contains a so called decoded version of the message, whereas the "Raw view"-tab contains the encoded version of the message text (and several headers). The decoded version is the human-readable text; the encoded version is a (kind of a) source text - or in simple words: the decoded version is what you read, the encoded version is what the machine reads.Content control checks the decoded version of the message body, i.e. what you are displayed in the "Message"-tab (or in the message preview when in preview mode). That is why I have said that content control checks what you can read.There might be some complicated variants to that rule of thumb, for example if a message contains two human-readable versions of the text (which usually happens if the message has both an HTML- and a "pure text"-version of the message text). However, this does not affect the way content control basically works. &nbsp;&nbsp; As far as (b) is concerned, content control checks the initial headers.Reading the raw view of a message, you see the headers at the beginning of the message, followed by a blank line and the so called body. As a rule of thumb, content control is able to check the headers above the first blank line of the raw view. (Technically speaking, the first blank line in the raw view separates the initial headers from the message body.)Note that a message can have some more headers. You usually see this when a message consists of several parts, for example an HTML-version of the message text and a "pure text"-version of the text and some inline-graphics (i.e. pictures that are embedded in the HTML-text and sent within the mail message).Then, each part of the message is introduced by some boundary lines (some additional headers before each part). You can see them only when looking at the raw view of the message, but not within the decoded version of the message text. These boundaries lines are not checked by content control, only the headers above the very first blank line are.&nbsp;----&nbsp;This is the technical background about how content control works...now back to your question. I have to admit that I am only guessing what is happening because I am just another user, not the software auther - nevertheless, I hope my answer is helpful. I expect most of the HTML-tags not to be checked by content control. Any HTML-tag itself is visible only in raw view, but not in the decoded version of the message text. As described above, the main goal of content control is to check the decoded version of the text - that is why an HTML-tag is not checked itself (as it is only visible in raw view). From that point of view, it is no surprise that meta or body are left out. &nbsp;You said that some HTML-tags were indeed checked by content control. This seems to contradict the rule in (a), saying that only the human-readable text is checked by content control (that is why I am somewhat surprised to read that some HTML-tags are processed).Reading your description, I think the tags that are seen by content control link to and define a visible element. As far as content control is concerned, there seem to be two kinds of HTML-tags: those that only "fine tune" an existing text, and those that define an element of its own that could not be exist without the respective HTML-tag.To me, the "fine-tuning" tags are probably those about Bold, Underlined, etc. (where the text elements to be finde-tuned would also be there without any formatting HTML-tags), whereas the "defining" tags may be those about defining an inline-graphic or a link (where the defined elements could not be displayed at all without the defining HTML-tags). In other words: The latter would not be human-readable at all without the HTML-tags, but the other would. I guess this may be the crucial point: if it defines and creates an element of its own, an HTML-tag can be checked by content control because the element defined by that HTML-tag is human-readable (and therefore subject to our rule of thumb (a)).&nbsp;I hope I could make myself clear. I cannot stress enough that I am only guessing why some HTML-tags are processed whie others are not. I am sure about the basic description (see the rules of thumb in (a) and (b)), but my answer to your actual question is somewhat speculative, I have to admit. Nevertheless, you may now have a place to start from, I hope. &nbsp;&nbsp;

Dave.In.ABQ

posted Aug 17 '07 at 9:44 pm

Thomas, thanks for that very nice description of how content control works-- I learned some good things from that.

Cheers, Dave in ABQ

Thomas, thanks for that very nice description of how content control works-- I learned some good things from that. Cheers, Dave in ABQ

Related Topics

Pending draft

Confirm move posts

Insufficient permissions

Select a different topic

Edit history