Classe simples de linha de comando spam de texto simples ou classificador de ham

2

Eu tenho uma carga de entradas de banco de dados que foram salvas e cheias de spam. Eu gostaria de ser capaz de canalizar a saída de texto de cada um em um spamassassin ou ferramenta similar para obter uma pontuação sobre a probabilidade de ser spam, mas sem a coisa toda de aprendizado de máquina das caixas de correio, ou mesmo rodando em um servidor de email. Parece que tudo o que encontrei é incrivelmente tendencioso em relação a e-mails, em vez de apenas uma simples coisa do tipo stdin > process > stdout .

Se há um escrito em uma linguagem de script, tudo bem, mas eu prefiro algo que possa funcionar com uma máquina centosa pronta para uso. Qualquer ajuda apreciada.

    
por Matt Fletcher 20.10.2014 / 12:47

1 resposta

2

É interessante você mencionar spamassassin, porque ele tem um modo que parece ser exatamente o que você quer ( /tmp/spammy neste caso contém um único e-mail candidato):

[me@lory tmp]$ spamassassin < /tmp/spammy 
Oct 20 11:54:47.097 [19986] warn: netset: cannot include 127.0.0.1/32 as it has already been included
From: "REDACTED" <redacted>
To: REDACTED
Subject: Pharmacy
Date: 20 Oct 2014 02:22:04 +0100
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lory.teaparty.net
X-Spam-Flag: YES
X-Spam-Level: *********
X-Spam-Status: Yes, score=9.2 required=3.9 tests=BAYES_20,MISSING_MID,
        NO_RECEIVED,NO_RELAYS,TVD_SPACE_RATIO,URIBL_BLACK,URIBL_DBL_SPAM,
        URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL autolearn=no version=3.3.1
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_5444E9FB.89EA3D9F"

This is a multi-part message in MIME format.

------------=_5444E9FB.89EA3D9F
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Spam detection software, running on the system "lory.teaparty.net", has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Good medicines special http://canadiantabletstore.com/ [...]


Content analysis details:   (9.2 points, 3.9 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 2.5 URIBL_DBL_SPAM         Contains a spam URL listed in the DBL blocklist
                            [URIs: canadiantabletstore.com]
 1.7 URIBL_BLACK            Contains an URL listed in the URIBL blacklist
                            [URIs: canadiantabletstore.com]
 1.6 URIBL_WS_SURBL         Contains an URL listed in the WS SURBL blocklist
                            [URIs: canadiantabletstore.com]
 1.2 URIBL_JP_SURBL         Contains an URL listed in the JP SURBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
 1.6 URIBL_SBL              Contains an URL's NS IP listed in the SBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 BAYES_20               BODY: Bayes spam probability is 5 to 20%
                            [score: 0.1750]
 0.5 MISSING_MID            Missing Message-Id: header
-0.0 NO_RECEIVED            Informational: message has no Received headers
 0.0 TVD_SPACE_RATIO        TVD_SPACE_RATIO



------------=_5444E9FB.89EA3D9F
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Date: 20 Oct 2014 02:22:04 +0100
From: "REDACTED" <REDACTED>
To: REDACTED
Subject: Pharmacy

Good medicines special
http://canadiantabletstore.com/


------------=_5444E9FB.89EA3D9F--
    
por 20.10.2014 / 12:59

Tags