Working with the blacklists

Contents

Suggested Regexp syntax

If you're working with blacklists and whitelists you need to know something about regular expressions (regexp). There is plenty of information on the internet, but here are some basics.

Please try these simple examples in the Regexp Editor, and it shouldn't take too long before you become familiar with it. You can read this section (below) to get started.

First of all, it's good to have a little understanding of how Net Responsibility (NR) works. The program takes each blacklist entry to match it against a given string. It tries to divide the URLs into natural groups of words. Then each group is tested against the blacklist.

Multiple words

If the blacklist entry contains more than one word, NR will try to find each word in the string. If any of the words doesn't exist, there will be no warning. The words don't need to be found in the correct order. For example, Jenna Jameson will match against (matches in bold):

  • www.google.com/q=jenna+jameson
  • www.google.com/q=jameson+mrs+jenna
  • www.google.com/q=james+jameson
  • www.google.com/q=jenna

Case doesn't matter

The matches are always case-insensitive. For example:

  • porn will match against porn, Porn, PORN, pOrn, etc.

Alphabetic characters

You might want to think of regular expressions as patterns. The simplest ones are only made of alphabetic characters. For example:

  • porn
  • pornstar
  • jenna jameson

Question mark (?) with one character

Some characters have certain functions. For instance, a question mark (?) is used to tell the filter to match one or zero of the preceding character. For example:

  • boobs? will match boob or boobs because the s is optional.

Question mark (?) with multiple characters

You can also use the question mark for more than one character, but then you need to group them together with parentheses. For example:

  • porn(star)? will match porn or pornstar because star is optional.
  • porn(stars?)? will match porn, pornstar or pornstars, but not porns.

Vertical bar (|)

Another useful character is the vertical bar (|). It is used as the word or. Match this or that. It's best to use it inside parentheses. For example:

  • porn(star|ography) will match pornstar or pornography, but not porn.
  • masturbat(e|ing|ion) will match masturbate, masturbating and masturbation.

Question mark (?) with vertical bar (|)

Of course, you can also use the question mark together with the vertical bar. For example:

  • porn(star|ography)? will match porn, pornstar or pornography.

Conditional word matches

Sometimes it might be useful to match one word, and if existing, a second one. The following example is valid:

  • sex (anal|oral|group)? will match sex, and if anal, oral or group also exists, it will be included in the match.

Other characters (*) (+) (.)

Similar to (?) we have (*) and (+). While (?) means one or zero of the preceding, (*) means zero or more of the preceding, and (+) means one or more. It is recommended to avoid these characters with Net Responsibility because they may cause many false positives, but you might use them at your own risk. For example:

  • xxx+ will match xxx, xxxxxxx, xxxxxxxxxxxxxxxxxx and so on.
  • po+rn will match porn, pooorn, poooooooorn and so on.

One character that is often used inside regular expressions is the dot (.). A dot means one of any character. It is often used together with the asterisk (.*) to create a pattern that matches anything or nothing. It is strongly suggested to avoid this pattern since it will produce many false positives, and the reports will be harder to understand. Since Net Responsibility doesn't care about what comes before or after a match, the following patterns would be totally unnecessary:

  • .*porn.*
  • porn.*star

Making the reports easier to read

In the reports, every match is categorized according to the blacklist keyword it matched against. So both porn and pornstar would fall into the category porn(star|ography)?, but to make the reports a little easier to understand, Net Responsibility tries to strip out irrelevant information. Therefore, all question marks are removed along with the preceding character, or characters in parentheses. The example above would simply be displayed as porn. Also, for other parentheses only the first word will be displayed, so you should put the easiest to understand first. For example:

  • boobs? will be displayed as boob
  • porn(star|ography)? will be displayed as porn
  • masturbat(e|ing|ion) will be displayed as masturbate

That's it. Feel free to try out your own keywords in the Regexp Editor.

Using the Regexp editor

When you're editing your personal blacklist or whitelist it's strongly recommended that you use the Regexp Editor. The main reason is because it will validate your regular expressions (regexp), so you can be sure they'll work as expected.

You might want to open the Regexp Editor and try it out while you're reading this guide.

Make sure you've read Suggested Regexp Syntax (above) before you proceed. It's useful to know the guidelines mentioned there.

Matching text

Here you can type any text you'd like to match. It could be a forbidden URL, or it could be some ugly words. Let's say you want to create a regexp that finds the words jenna and jameson, then you simply type these words in this box. Note that you can insert a lot of words here, and later see which ones of these will get matched. Every new line (created by [Enter], not automatic) is independent from the others. That means you can have several groups of words and see which ones are matched. For example, you could insert the following words:

  • jenna jameson
  • jameson jenna
  • jenna
  • jameson
  • jen jameson
  • jennas jameson
  • www.google.com/search?q=jenna+jameson
  • www.jenna.com/jameson
  • jenna jamesson

Regexp

Here you type the regular expression (regexp) made using the syntax mentioned here.

Match

This shows the result of the text you entered in the 'Matching text' section, filtered with the regexp you typed in the 'Regexp' section. If there's a match, it'll be in bold. It's updated instantly when you edit the matching text or the regexp. With the matching text we suggested before, the regexp jenna jameson would match lines 1, 2, 6 and 7. If you only have jenna as a regexp, the third line will be a match as well. To also match the last line you can change the regexp to the following: jenna jamess?on.

You can experiment like this until you're happy with the output.

Displayed as

This shows how the regexp will be shown in the report. We try to make it more readable. You can look here to see more details on how it works.

Method

Here you choose which method to use in the filtering. If you're using the Tokenmatch (Advanced) method, you want to choose this. Actually, the difference is only shown if you've inserted an URL (or several divided by [Enter]) in the 'Matching' textbox.

Add to

When you're done and happy with your brand new regexp, simply hit 'Blacklist' or 'Whitelist' to add it. Note that you'll have to save and logout in order to really save the new entries.

Conclusion

That's it! Feel free to play around with it, and try to develop the best regexp ever seen. ;-) If you want to share your regexps, please do. Just send us an email at responsibility-devel@lists.sourceforge.net.