Jump to content
Ketarin forum

An example regex and mini-tutorial -- finding multiple strings


appyface
 Share

Recommended Posts

I just wanted to comment that the most useful regular expression I could use right now is one that matches a particular string and then matches another string to start the variable selection, followed by the string to demarc the end of the variable.

 

Easiest way to see how this works is to try it.

 

This example locates a string, then continues on and locates another string, then scrapes something for a version number, based on terminating that scrape with yet another string.

 

1. Start a new download entry in Ketarin

 

2. Go to define variables and start to define one

 

3. Paste this into Contents from URL: http://raproducts.org/javara.html and click LOAD button

 

4. Paste this into Use regular expression: (?<=JavaRa Version History.*?JavaRa ).+?(?=<)

 

 

At this moment in time, you should see this snippet somewhere around line 70:

EDIT: 2009-02-20 RaProducts site is down. However, the below snippet can be pasted into any regex builder tool and this tutorial still followed, should the RaProducts site never return.

 

-------------------------------------------------------------------------------------------------------------

<h2>Windows Versions Supported</h2>

 

Currently, JavaRa supports Windows 9x, 2k, XP, and Vista without UAC.<br><br>

 

<h2>JavaRa Version History</h2>

 

[28dec08] JavaRa 1.13<br>

- [Fixed] JavaRa crashing upon not finding "JavaRa.def" file.<br>

- [Fixed] Minor typo<br><br>

 

[14dec08] JavaRa 1.12<br>

- [Added] JavaRa registry definitions file.<br>

- [Added] Program now asks Windows to delete locked folders upon reboot.<br>

-------------------------------------------------------------------------------------------------------------

 

Ketarin's variable will be loaded with "1.13", exactly what I wanted.

 

 

There are excellent tutorials available on the 'net, here's a good one http://www.regular-expressions.info/tutorial.html

 

There are excellent programs available to help guide you through the construction of complex regex's, personally I use RegexBuddy (pay-for product) http://www.regexbuddy.com/ and there is also Expresso http://www.ultrapico.com/Expresso.htm which is free, I have not used it but it gets good marks.

 

 

I'll break down the components of the above regex and explain each piece. What follows is a pretty long tutorial in itself, and yet it explains not very much of the full power of regex.

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
^^^^

 

(?<= starts a "positive lookbehind" command to the regex engine. The important points to know of this command, are its influence on the regex engine's "cursor" position when the match is successful, and that the match is NOT put into the result. Just exactly what we need. More on that in a bit.

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
   ^^^^^^^^^^^^^^^^^^^^^^

 

"JavaRa Version History" is the first literal string I'm looking to match.

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
                         ^^^

 

.*? Matches any characters (and I don't care what they are, including any line terminators) for an unknown length. I'll explain this kind of instruction in more detail later on.

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
                            ^^^^^^^^

 

"JavaRa " is the second literal string I'm looking to match on. Note that it is "JavaRa" plus a trailing space, which is what I want. The closing parenthesis ends the "positive lookbehind" command.

 

 

Now let's look at just those parts of the regex, together:

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

I'm telling the regex engine that a successful match consists of finding this potentially giant string:

 

1. Locating the literal string "JavaRa Version History"

2. Locating some unknown number (0 to infinity) of characters, even if they are line terminator characters

3. Locating the literal string "JavaRa "

 

Using the webpage snippet above, this is the part that the "positive lookbehind" successfully matches (without the dashed lines of course):

 

-------------------------------------------------------------------------------------------------------------

JavaRa Version History</h2>

 

[28dec08] JavaRa

-------------------------------------------------------------------------------------------------------------

 

The regex "cursor" is now poised between the END of the match and the next character following the match. AND this part of the match will NOT be given to Ketarin's variable. Both of these points are crucial to the next part, which is the actual scrape for the variable.

 

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
                                    ^^^

 

.+? tells the regex to match some unknown characters for unknown length. Now we are actually grabbing these characters to return into our Ketarin variable, not just looking for them. (The use of the '+' instead of '*' changes the matching slightly, more on that later.) The regex engine's "cursor" will be poised at the end of whatever these characters are. We don't know just where the end is located -- yet.

 

(?<=JavaRa Version History.*?JavaRa ).+?(?=<)
                                        ^^^^

 

(?= is a "positive lookahead" command to the regex engine.

< is a literal string to be matched - we're looking for "<"

) closing parenthesis ends the "positive lookahead" command

 

Like the "positive lookbehind" we used in the beginning, nothing from a successful match is included in the result. The difference in this command is the cursor position. A successful match BEGINS wherever the regex "cursor" is right now. We don't know just where that cursor is, our .+? command is gobbling up some unknown number of characters for Ketarin's variable right now. However, the regex "cursor" position is advancing with each character grabbed. At some point that position will start a successful match of our "positive lookahead" string. That is how we are able to mark the end of the unknown number of characters put into Ketarin's variable.

 

To sum up, you can imagine the regex looks like this:

 

<-------------------{vers}------------------->

 

Where there is something we're looking for, just before the version number, and there is something we're looking for, just after the version number :-)

 

 

Some notes regarding the use of .*? and .+? in the above regex.

 

The '.' (dot) is a wildcard to match anything including line terminators. This is the way you get the regex to continue through the webpage as if it were one giant string to search instead of searching individual lines, which is what most other matching commands do.

 

The '*' (asterisk) says the '.' matches an unknown quantity, from -0- to infinity. A '+' (plus) character says the '.' must match at least ONE character, to infinity.

 

Which one to use? In our example webpage here, it actually doesn't matter, we can use .* in both places or .+ in both places of the regex, as they both return our desired result because of what is in the webpage. The heart of creating a regex is taking into account these subtle differences, so that you are assured of getting a successful match only on the part of the webpage that is what you are looking for, and do not get a match anywhere before that point. (Explaining further is beyond the scope of this example.)

 

The '?' is an optional modifier of the .* and .+ commands. Without the '?' modifier, the .* and .+ are called "greedy", adding the optional "?" makes them "lazy". Thorough explanation is again beyond the scope of this tutorial, but I can give example by using our same webpage snippet and changing our regex to be "greedy" instead of "lazy".

 

Here's the snippet again:

 

-------------------------------------------------------------------------------------------------------------

<h2>Windows Versions Supported</h2>

 

Currently, JavaRa supports Windows 9x, 2k, XP, and Vista without UAC.<br><br>

 

<h2>JavaRa Version History</h2>

 

[28dec08] JavaRa 1.13<br>

- [Fixed] JavaRa crashing upon not finding "JavaRa.def" file.<br>

- [Fixed] Minor typo<br><br>

 

[14dec08] JavaRa 1.12<br>

- [Added] JavaRa registry definitions file.<br>

- [Added] Program now asks Windows to delete locked folders upon reboot.<br>

-------------------------------------------------------------------------------------------------------------

 

Here's our regex again, but this time we let the '.+' be GREEDY instead of LAZY (the '?' is gone).

 

(?<=JavaRa Version History.*?JavaRa ).+(?=<)

 

Try it. Instead of picking up "1.13" for our variable, Ketarin gets everything from this:

crashing upon not finding

 

...all the way up to the </html> closing tag at the very bottom of the webpage. Obviously not what we wanted!

 

So you could describe the behavior like this: "lazy" makes the .* or .+ grab the fewest number of characters and still be able to complete the rest of the regex. The default behavior of "greedy", however, is to grab as many characters as possible -- even if some of them match the rest of the regex -- and keep going until the it is the last possible match that completes the regex successfully.

 

In this regex our scrape into Ketarin's variable was intended to be stopped by the first appearance of a "<" character. Using "lazy" gets us that. Without requesting "lazy", the default behavior of "greedy" keeps advancing the cursor position until the very last "<" match is found. That might be what you want to do on a different webpage, but on this one it does not give us the result we desire.

 

I hope the above mini-tutorial is helpful.

 

--appyface

 

 

NOTE: Not all 'flavors' of regex engine behave the same way. So I tend to preserve "case" as a habit in my regex's, but Ketarin's regex engine (the .NET regex engine) is not case-sensitive by default. Also, as mentioned earlier, if the '.' (dot) wildcard is used, the .NET's regex engine will search right through any line terminators (carriage return, linefeed, etc.) This effectively makes the entire web page one giant string to search, which is very handy for our purposes. Keep this in mind when building searches in Ketarin, as other types of regex patterns will only be successful if the match is found between the start and end of a single line.

Edited by appyface
Link to comment
Share on other sites

@Stalker -- great links, thanks for adding them!

 

@CybTekSol, MadDog -- Thank you for letting me know this was useful to you :-)

 

@kwe -- Yes regex can be VERY intimidating. The same symbols mean different things in different contexts, they behave differently depending on WHAT regex engine you're using... the list goes on and it is a long one.

 

But it really isn't as cryptic as it first appears! If I can give newbies any particular points, it would be to pay attention to two basics, which are illustrated in the above tutorial:

 

  1. When this piece of the regex finishes/stops, are successfully matched characters returned to me? Or is this piece of the regex just an "assertion" *** ?
  2. When this piece of the regex finishes/stops, where is the regex "cursor"? Does this piece of the regex move the cursor position at all? Where does the cursor end up when the match is successful? Unsuccessful?

 

(***An assertion is a zero-width expression, meaning it gives back no characters in the result -- the result is always "zero width" or 'empty'. Our 'lookarounds' in this tutorial are assertions. If their match is successful, the matching characters themselves are discarded, and the regex continues on to do whatever it is supposed to do when it gets a successful match. If unsuccessful, it's the same as any other command, the regex does whatever it is supposed to do next, when there is no match.)

 

 

Spend the time to closely examine any regex you're working with, and see how the above two basics apply. This is the best investment you can make in yourself when it comes to learning regex!

 

And the best way to learn regex is to just jump in. Start with working examples like the one in this tutorial (it works as long as the author of JavaRa doesn't change his webpage much).

 

Watch it work. Then use the regex on your own webpage scrapes. You'll be changing just the three search strings to adapt it and it will work for you. But then one day it won't work! "Wha? It always worked and I changed it the same way I have all the others..."

 

The mostly likely reason is the source webpage just doesn't play by the same rules. Then the regex needs further alteration, to give you what you want.

 

A common variation I use, is to limit the 'reach' of the wildcard commands. In my above tutorial we used .* (0 to infinity) and .+ (1 to infinity). Instead we could substitute a maximum number of characters for 'infinity'. The command .{0,400} behaves the same as .* except that it goes from 0 to 400 characters only. The command .{1,350} behaves the same as .+ except that it goes from 1 to 350.

 

You can also use something like .{57,213} -- which means a successful match starts at least 57 characters forward from "here" ("here" being wherever the regex "cursor" is at that moment), and cannot be farther away than 213 characters.

 

We can add the optional '?' to these as well, making them "lazy" instead of the default "greedy". "lazy" will make the regex take the FIRST successful match within the limits provided, whereas "greedy" will make the regex take the LAST possible match within the limits provided.

 

This is very useful. Consider that regex lets you pick "first" (lazy) or "last" (greedy). What about the second match? Or the fifth match of six? Perhaps limiting the minimum and/or maximum "reach" and using "lazy" or "greedy", is what will work. By limiting the reach you can change what part of the material contains the first match, or what part of the material contains the last match...

 

There are lots of Character Classes in regex. The '.' we have been working with here matches EVERYTHING including line terminators (in the .NET regex engine).

 

But maybe we only want to consider numeric digits, or maybe only capital letters? Then you use one of the character classes instead of the the '.' wildcard. For example, the character class \s can be substituted for the '.' and then the matches only look for 'whitespace' characters. 'Whitespace' is defined as being any of: space character, tab character, carriage return, line feed, form feed, vertical tab.

 

Here's an example of a regex that has been restricted, perhaps not in a good way. See this post:

http://ketarin.canneverbe.com/forum/viewtopic.php?id=21

 

Load the target webpage and use the regex I gave. What does it do? Alter the regex a little bit. What does it do?

 

I did not spend a lot of time examining that webpage and building the regex I offered ttheil. It might be possible to make the regex a bit more 'generic' -- that is, one that doesn't rely so heavily on finding some very specific strings within specific limits, to come up with our intended scrape.

 

Sometimes a more-generic regex doesn't work so well. When you first write it, it works for awhile. Then the author of the webpage alters the content such that our regex starts matching on something other than what we want. We have to make the regex "less generic" in some fashion, to force it to pass up the new match and continue on to our desired target.

 

Similarly, a less-generic regex may stop working. If the author of the webpage alters the content in such a way that our limits are incorrect, our regex may never work again until we relax the limits, or substitute different limits for what we had been using.

 

A regex is only as good as the rules it expects the webpage to play by. When the rules change, the regex might need to change too :-)

 

--appyface

Edited by appyface
Link to comment
Share on other sites

  • 1 month later...
I hope the above mini-tutorial is helpful.

 

It certainly is, Regular expressions are to me the equivalent of 'practising the dark arts' but after looking at the templates CybTekSol & others have created I went off and downloaded a few 'free' apps in particular Expresso and have a firm intention to learn more, and then you appyface post one of the most comprehensive threads I have ever read on the subject.

 

It will still take me some time but thanks for taking the time & effort to help me peer into the abyss lol

 

BTW I suggest you do take a look at Expresso, it does eventually require registration (free) for a key

 

some others I found were by a chap named Roy Osherove

 

these are written in asp.NET (I think) no install required and some source is available

 

Beginners (like me) should look at: Regulazy (2nd link) and once you start the app click the 'huh?' button for a nice little screencast demo

 

Advanced look at: Regulator (1st link)

 

http://weblogs.asp.net/rosherove/pages/tools-and-frameworks-by-roy-osherove.aspx

 

for those who need some 'visual' guidance I also found the 'ReAnimator' @ appyface me thinks you'll enjoy this lots lol

 

Flash plugin required:

 

http://osteele.com/tools/reanimator/

Edited by somerandomhash
Link to comment
Share on other sites

I hope the above mini-tutorial is helpful.

It certainly is' date=' [/quote']

I'm glad to hear this was useful to you, thank you for letting me know :)

I do have Expresso (and a key), but I already owned and used heavily RegexBuddy (pay-for product by JGSoft) long before I had heard of Expresso. Expresso gets high marks, so I think it must be pretty useful.

 

Regex can do so much more than what I've posted here. The author of RegexBuddy has a very nice tutorial and manual on different flavors of regex, here:

http://www.regular-expressions.info/

 

Just diving in, as you have done, is really the best way to learn IMO. But examples sure don't hurt :) and the link above will give you some good ones to practice with.

 

Best regards,

--appyface

Link to comment
Share on other sites

  • 9 months later...

First off, great tutorial – thank you for taking the time to put this together and share, appyface.

 

NOTE: Not all 'flavors' of regex engine behave the same way. So I tend to preserve "case" as a habit in my regex's, but Ketarin's regex engine (the .NET regex engine) is not case-sensitive by default. Also, as mentioned earlier, if the '.' (dot) wildcard is used, the .NET's regex engine will search right through any line terminators (carriage return, linefeed, etc.) This effectively makes the entire web page one giant string to search, which is very handy for our purposes. Keep this in mind when building searches in Ketarin, as other types of regex patterns will only be successful if the match is found between the start and end of a single line.

 

This has got me confused... I understand what you’re saying, but according to the .NET Framework Developer's Guide Regular Expression Language Elements, the dot (“.”) wildcard matches any single character except \n, where \n matches a new line. The only reason I care is that these are the rules used by Expresso, so your sample doesn’t work in Expresso, even though it works in Ketarin. This suggests that Ketarin doesn’t use the .NET regex engine. Am I missing something here?

Link to comment
Share on other sites

Ah, that explains it - thanks Flo.

 

Does anyone know of a way to change this behavior in Expresso? Expresso seems like a sweet tool for testing regular expressions, but I need a tool that behaves the same way as Ketarin. Alternatively, are there other free tools that behave the same as Ketarin?

Link to comment
Share on other sites

Thanks for the suggestions, andreone and appyface.

 

I found the solution. If anyone else is using Expresso - click the Design Mode tab, then right down the bottom, check the "Single Line" checkbox.

 

Flo, looks like there are 10 .NET regex options: Singleline, IgnoreCase, IgnorePatternWhitespace, Multiline etc. How are these set in Ketarin? I know Singleline is set to On, but what about the others? I want to set Expresso to use the same rules as Ketarin...

 

Thanks!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.