Using SpamHalter and Content Control in concert

chriscw

posted Mar 9 '09 at 12:38 pm

It is possible to use an expression filtering rule to check the *X-UC-Weight* setting in the header of emails and automatically forward those with this header to the is_spam account on your system. You can then use the same rule to file the messages wherever you want.

It is possible to use an expression filtering rule to check the *X-UC-Weight* setting in the header of emails and automatically forward those with this header to the is_spam account on your system.&nbsp; You can then use the same rule to file the messages wherever you want.

pmerik

posted Jan 29 '08 at 1:45 pm

Hello all,

[Background]
I'm running Pegasus 4.41 and use both SpamHalter and Content Control to filter out junk email. This works very well in my experience, with SH filtering out maybe 95% of the junk - and almost no false positives. CC catches some two thirds of the rest. A big Thank You to David Harris and Lukas Gebauer!

[Initial problem]
In my original setup, Content Control would put junk in the same place as SpamHalter: The standard "Junk or suspicious mail" folder. It didn't seem like SH learned from CC's decisions (telling from headers and the "Explain classification" dialog box), so junk messages that didn't quite reach up to CC's threshold would end up in my Inbox despite being very similar so other junk.

[Work-around]
I created a new folder "Junk - Content Control" for the output from CC plus a "Quick Move" action that moves messages into the standard junk folder. Checking headers and the "Explain..." dialog, it now looks like SH learns lessons from CC's decisions. When a new class of junk appears it only takes a few days for SH to filter out the bulk, relieving CC. However, this comes at the cost of two spam folders to browse and an extra copy action to train SH, before finally deleting junk messages.

[Questions/Issues]
(1) SH is trained every time a message is moved into the "Junk..." folder. However, my observations seem to indicate that training is not triggered when CC moves a message. Am I right on the money here?

(2) Any ideas on a better (=less extra manual actions) solution than the work-around I use?

Cheers,

Erik

Hello all,[Background] I'm running Pegasus 4.41 and use both SpamHalter and Content Control to filter out junk email. This works very well in my experience, with SH filtering out maybe 95% of the junk - and almost no false positives. CC catches some two thirds of the rest. A big Thank You to David Harris and Lukas Gebauer! [Initial problem] In my original setup, Content Control would put junk in the same place as SpamHalter: The standard "Junk or suspicious mail" folder. It didn't seem like SH learned from CC's decisions (telling from headers and the "Explain classification" dialog box), so junk messages that didn't quite reach up to CC's threshold would end up in my Inbox despite being very similar so other junk. [Work-around] I created a new folder "Junk - Content Control" for the output from CC plus a "Quick Move" action that moves messages into the standard junk folder. Checking headers and the "Explain..." dialog, it now looks like SH learns lessons from CC's decisions. When a new class of junk appears it only takes a few days for SH to filter out the bulk, relieving CC. However, this comes at the cost of two spam folders to browse and an extra copy action to train SH, before finally deleting junk messages. [Questions/Issues] (1) SH is trained every time a message is moved into the "Junk..." folder. However, my observations seem to indicate that training is not triggered when CC moves a message. Am I right on the money here?(2) Any ideas on a better (=less extra manual actions) solution than the work-around I use? &nbsp;Cheers,Erik&nbsp;

Marc

posted Feb 6 '08 at 2:46 pm

[quote user="pmerik"](1) SH is trained every time a message is moved into the "Junk..." folder. However, my observations seem to indicate that training is not triggered when CC moves a message. Am I right on the money here?[/quote]Your observations are right AFAIK.

[quote user="pmerik"](2) Any ideas on a better (=less extra manual actions) solution than the work-around I use?[/quote]My opinion is: Try to improve Spamhalter's spam detection ratio and make Content Control superfluous.

[quote user="pmerik"](1) SH is trained every time a message is moved into the "Junk..." folder. However, my observations seem to indicate that training is not triggered when CC moves a message. Am I right on the money here?[/quote]Your observations are right AFAIK.&nbsp; [quote user="pmerik"](2) Any ideas on a better (=less extra manual actions) solution than the work-around I use?[/quote]My opinion is: Try to improve Spamhalter's spam detection ratio and make Content Control superfluous.

pmerik

posted Feb 6 '08 at 8:53 pm

Thank you for confirming my observation on SpamHalter vs. Content Control behaviour.

Making Content Control superfluous means training SpamHalter to recognize any spam that CC would otherwise catch. That is not happening automatically (as observed), which is what led me to the work-around I mentioned. This works, but involves more manual handling than I enjoy.

Although my SH now does learn from CC's decisions, it will never catch things like Arbitrary1ViagraArbitrary2. Far too many such concatenated pseudo-words exist for SH to use them as indications of spam. That is, IMHO, why rule based methods can not be replaced by Bayesian methods.

Thank you for confirming my observation on SpamHalter vs. Content Control behaviour. Making Content Control superfluous means training SpamHalter to recognize any spam that CC would otherwise catch. That is not happening automatically (as observed), which is what led me to the work-around I mentioned. This works, but involves more manual handling than I enjoy.Although my SH now does learn from CC's decisions, it will never catch things like Arbitrary1ViagraArbitrary2. Far too many such concatenated pseudo-words exist for SH to use them as indications of spam. That is, IMHO, why rule based methods can not be replaced by Bayesian methods.

Thomas R. Stephenson

posted Feb 6 '08 at 10:09 pm

[quote user="pmerik"]

Thank you for confirming my observation on SpamHalter vs. Content Control behaviour.

Making Content Control superfluous means training SpamHalter to recognize any spam that CC would otherwise catch. That is not happening automatically (as observed), which is what led me to the work-around I mentioned. This works, but involves more manual handling than I enjoy.

Although my SH now does learn from CC's decisions, it will never catch things like Arbitrary1ViagraArbitrary2. Far too many such concatenated pseudo-words exist for SH to use them as indications of spam. That is, IMHO, why rule based methods can not be replaced by Bayesian methods.

[/quote]

I'm catching 99.77% of the spam right now with Bayesian filtering using POPFile. Not too sure it's worth all that much effort to use CC ( I have one entry in the CC file) at all to catch the 0.13% that get through the Bayesian filtering. In looking at Spamhalter on the same system it's getting well over 99% of the spam as well. YMMV though. ;-)

FWIW, I use TOE, Not Spam boost of 1 and threshold of 50% in Spamhalter.

[quote user="pmerik"]Thank you for confirming my observation on SpamHalter vs. Content Control behaviour. Making Content Control superfluous means training SpamHalter to recognize any spam that CC would otherwise catch. That is not happening automatically (as observed), which is what led me to the work-around I mentioned. This works, but involves more manual handling than I enjoy.Although my SH now does learn from CC's decisions, it will never catch things like Arbitrary1ViagraArbitrary2. Far too many such concatenated pseudo-words exist for SH to use them as indications of spam. That is, IMHO, why rule based methods can not be replaced by Bayesian methods. [/quote]&nbsp;I'm catching 99.77% of the spam right now with Bayesian filtering using POPFile.&nbsp; Not too sure it's worth all that much effort to use CC ( I have one entry in the CC file) at all to catch the 0.13% that get through the Bayesian&nbsp; filtering.&nbsp; In looking at Spamhalter on the same system it's getting well over 99% of the spam as well.&nbsp; YMMV though.&nbsp; ;-)FWIW, I use TOE, Not Spam boost of 1 and threshold of 50% in Spamhalter. &nbsp;&nbsp;

pmerik

posted Feb 7 '08 at 11:24 am

[quote user="Thomas R. Stephenson"]

I'm catching 99.77% of the spam right now with Bayesian filtering using POPFile. Not too sure it's worth all that much effort to use CC ( I have one entry in the CC file) at all to catch the 0.13% that get through the Bayesian filtering. In looking at Spamhalter on the same system it's getting well over 99% of the spam as well. YMMV though. ;-)

FWIW, I use TOE, Not Spam boost of 1 and threshold of 50% in Spamhalter.

[/quote]

Thomas, thanks for your notes on SpamHalter settings. I'll try changing the threshold and boost. I haven't done that before to avoid false positives.

Indeed, my mileage has varied ;-) : In the beginning, SH would catch 75% and CC 20%. After introducing my "dual

spam folder work-around" SH got better and caught perhaps 99%. And now

SH is down to 90% again, because of the "new" concatenation strategy

used by spammers to fool word-based Bayesian filters. My figures are like 90% spam caught by SH, 9% caught by CC and 1% caught by me.

I will probably remove CC rules that SH makes superfluous. Still, the ones that do trigger, trigger often. A lot of the spam I have received over the last few weeks is of the xxxViagrayyy type that SpamHalter will never catch. I'm not keen on throwing out CC and handling that myself ;-)

Seems like this boils down to that my work-around is needed (because of SH not learning from CC's decisions) and no easier solution exists. I'll post a suggestion about removing the non-learning behaviour.

Cheers,
Erik

[quote user="Thomas R. Stephenson"]I'm catching 99.77% of the spam right now with Bayesian filtering using POPFile.&nbsp; Not too sure it's worth all that much effort to use CC ( I have one entry in the CC file) at all to catch the 0.13% that get through the Bayesian&nbsp; filtering.&nbsp; In looking at Spamhalter on the same system it's getting well over 99% of the spam as well.&nbsp; YMMV though.&nbsp; ;-)FWIW, I use TOE, Not Spam boost of 1 and threshold of 50% in Spamhalter. &nbsp;&nbsp;[/quote]Thomas, thanks for your notes on SpamHalter settings. I'll try changing the threshold and boost. I haven't done that before to avoid false positives.Indeed, my mileage has varied ;-) : In the beginning, SH would catch 75% and CC 20%. After introducing my "dual spam folder work-around" SH got better and caught perhaps 99%. And now SH is down to 90% again, because of the "new" concatenation strategy used by spammers to fool word-based Bayesian filters. My figures are like 90% spam caught by SH, 9% caught by CC and 1% caught by me. I will probably remove CC rules that SH makes superfluous. Still, the ones that do trigger, trigger often. A lot of the spam I have received over the last few weeks is of the xxxViagrayyy type that SpamHalter will never catch. I'm not keen on throwing out CC and handling that myself ;-)Seems like this boils down to that my work-around is needed (because of SH not learning from CC's decisions) and no easier solution exists. I'll post a suggestion about removing the non-learning behaviour.Cheers, Erik&nbsp;&nbsp;

dilberts_left_nut

posted Feb 7 '08 at 1:07 pm

Try raising the "Probability level for unknown tokens"

I put mine up to 80 and it seems to work very well, especially on the "the xxxViagrayyy type" (these tokens are very unlikely to be found in our legit mail [:)]).

I also use 'Train Always' so SH has a good (and up to date) idea of what our 'good mail' looks like.

I suspect this a high level may cause some FP's if you use TOE and/or get a lot of mail with new words.

Try raising the "Probability level for unknown tokens"I put mine up to 80 and it seems to work very well, especially on the "the xxxViagrayyy type" (these tokens are very unlikely to be found in our legit mail [:)]).I also use 'Train Always' so SH has a good (and up to date) idea of what our 'good mail' looks like.I suspect this a high level may cause some FP's if you use TOE and/or get a lot of mail with new words.&nbsp;

pmerik

posted Feb 8 '08 at 3:39 pm

Dil, thanks for the suggestion. That sounds like a good idea.

However, I can't find the "unknown tokens" setting - not even after upgrading to SpamHalter 1.1.0.160. (Before I used version 1.0.0.whatever, which was included with Pegasus 4.41) I even looked in the WI_sph.ini file, to no avail.

Please tell me if there is something more I need to do, or if I'm looking in the wrong place.

Thanks,
Erik

Dil, thanks for the suggestion. That sounds like a good idea. However, I can't find the "unknown tokens" setting - not even after upgrading to SpamHalter 1.1.0.160. (Before I used version 1.0.0.whatever, which was included with Pegasus 4.41) I even looked in the WI_sph.ini file, to no avail.Please tell me if there is something more I need to do, or if I'm looking in the wrong place.Thanks, Erik

pmerik

posted Feb 8 '08 at 3:41 pm

BTW, I use 'Train always', too. Spam level = 70% and Non-spam boost=2.

Erik

BTW, I use 'Train always', too. Spam level = 70% and Non-spam boost=2. Erik&nbsp;

dilberts_left_nut

posted Feb 9 '08 at 12:16 am

This is the relevant section of my spamhalter.ini

[quote]

[bayDynamic]
bayForcedWrites=0
bayNoSpamBoost=1
bayClasifyMaxTokens=20
bayUnknownProb=80 << unknown token setting
baySpamProb=40
bayMaxCorrCnt=50
bayOldDays=30
bayExpire=180
bayWhiteOldDays=365
[/quote]

This is the relevant section of my spamhalter.ini[quote][bayDynamic] bayForcedWrites=0 bayNoSpamBoost=1 bayClasifyMaxTokens=20 bayUnknownProb=80&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;&lt; unknown token setting baySpamProb=40 bayMaxCorrCnt=50 bayOldDays=30 bayExpire=180 bayWhiteOldDays=365 [/quote]

Marc

posted Feb 9 '08 at 12:09 pm

I've never seen a file called "spamhalter.ini". You are probably talking about Spamhalter for Mercury. "WI_sph.ini" doesn't have these configuration options.

dilberts_left_nut

posted Feb 9 '08 at 11:29 pm

Yes, sorry. I haven't used it with pmail, I assumed it worked the same. [:$]

Yes, sorry. I haven't used it with pmail, I assumed it worked the same. [:$]&nbsp;

Marc

posted Feb 10 '08 at 12:21 pm

But it would be nice if both Spamhalters would work the same. [:)]

@ Erik:
I recommend using TOE. 'Train always' can easily mess up your whole database (I wouldn't call that bayesian poisoning but it's something similar).

But it would be nice if both Spamhalters would work the same. [:)]@ Erik: I recommend using TOE. 'Train always' can easily mess up your whole database (I wouldn't call that bayesian poisoning but it's something similar).

pmerik

posted Feb 11 '08 at 12:53 pm

Well, I added the 'unknown token' setting to the .ini file anyway.

It seems it neither hurts, nor helps. I agree that it would be nice if

both SpamHalters worked the same.

As for 'train on error' I might try that, although I haven't seen any problems stemming from 'train always'.

Of course, neither of these options help with the original issue of this thread: The fact that SpamHalter for Pegasus only learns from a subset of all messages moved to the 'Junk...' folder.

Well, I added the 'unknown token' setting to the .ini file anyway. It seems it neither hurts, nor helps. I agree that it would be nice if both SpamHalters worked the same. As for 'train on error' I might try that, although I haven't seen any problems stemming from 'train always'.Of course, neither of these options help with the original issue of this thread: The fact that SpamHalter for Pegasus only learns from a subset of all messages moved to the 'Junk...' folder.

Thomas R. Stephenson

posted Feb 11 '08 at 4:42 pm

Spamhalter works with all files moved manually to the folder. You can have CC and Spamhalter dumping into different folders and then manually move the mail from the CC folder ro the Spamhalter folder for training.

Spamhalter works with all files moved manually to the folder.&nbsp; You can have CC and Spamhalter dumping into different folders and then manually move the mail from the CC folder ro the Spamhalter folder for training.&nbsp;

pmerik

posted Feb 11 '08 at 5:51 pm

Thomas, thanks for the suggestion.

Perhaps this thread is now getting a bit too long, obscuring the actual issue at hand. Please let me know if I can make the problem description any clearer.

I really do appreciate that you are trying to help me. The suggested solution is exactly the [Work-around] in my original post, though. It does work.

However, I want to get rid of the extra actions that I have to take manually. Extra actions needed because SpamHalter only learns from a subset of the mail in the 'Junk...' folder.

This non-behaviour caused very slow initial training of SH and would have caused an equally slow adaption to changing spammer practices, had I not devised the work-around. Spam caught by Content Control contains extra stuff, which my CC rules are not sofisticated enough to care about. It turns out, though, that it is enough for SH detection to improve considerably. And that's good, because SH has a higher detection ratio, at least in my experience. YMMV, of course.

IMHO it's very difficult to see a reason for not having SH train on every piece of junk mail. It seems more like an oversight in the subscription mechanism SH uses for notification about messages moved to/from the 'Junk...' folder. Although I can live with it, because I have used Pegasus for several years, it is definitively an obstacle when I try to introduce Pegasus to Outlook users.

Imagine explaining this to Mum, 75+ years old. "You have to move all junk mail from this 'Junk...' folder to the other 'Junk...' folder". Isn't that the kind of boring, no-brain, repetitive tasks best left to computers?

Also, I think quirks like this one deters even the more computer savvy people. That is the other reason leading me to ask for a change in the 'Suggestions' forum.

But I digress. Again, thank you for trying to help!

Best regards,

Erik

Thomas, thanks for the suggestion. Perhaps this thread is now getting a bit too long, obscuring the actual issue at hand. Please let me know if I can make the problem description any clearer.I really do appreciate that you are trying to help me. The suggested solution is exactly the [Work-around] in my original post, though. It does work. However, I want to get rid of the extra actions that I have to take manually. Extra actions needed because SpamHalter only learns from a subset of the mail in the 'Junk...' folder.This non-behaviour caused very slow initial training of SH and would have caused an equally slow adaption to changing spammer practices, had I not devised the work-around. Spam caught by Content Control contains extra stuff, which my CC rules are not sofisticated enough to care about. It turns out, though, that it is enough for SH detection to improve considerably. And that's good, because SH has a higher detection ratio, at least in my experience. YMMV, of course. IMHO it's very difficult to see a reason for not having SH train on every piece of junk mail. It seems more like an oversight in the subscription mechanism SH uses for notification about messages moved to/from the&nbsp; 'Junk...' folder. Although I can live with it, because I have used Pegasus for several years, it is definitively an obstacle when I try to introduce Pegasus to Outlook users.Imagine explaining this to Mum, 75+ years old. "You have to move all junk mail from this 'Junk...' folder to the other 'Junk...' folder". Isn't that the kind of boring, no-brain, repetitive tasks best left to computers? Also, I think quirks like this one deters even the more computer savvy people. That is the other reason leading me to <a href="/forums/thread/6874.aspx" mce_href="/forums/thread/6874.aspx">ask for a change in the 'Suggestions' forum</a>. But I digress.&nbsp; Again, thank you for trying to help!Best regards,Erik

Related Topics

Pending draft

Confirm move posts

Insufficient permissions

Select a different topic

Edit history