Jump to content
Ketarin forum

Ketarin 0.9.9.22 beta


floele
 Share

Recommended Posts

What do you mean? I don't see a problem.

I've been trying to determine how many app entries this occurs with on my PC and so far I've found three. So happens these are the ones using the template I was perfecting late last night that have a complex regex. The site was very difficult and the 'anchor points' for the regex cover several lines with wildcards which is probably a larger amount of data than Ketarin wants to highlight. Is there a max character or max line for the highlighting buffer, etc. ? It's not a major issue for me if this is the only template I have that it affects. ;)

Link to comment
Share on other sites

  • Replies 111
  • Created
  • Last Reply

Top Posters In This Topic

After further investigation, it turns out it's an issue with my regex in the template... it was picking up 2 matches and the jump went to the one other than my intended match while my intended match was highlighted properly... Oh, the elusive goal of a 'universal' regex... Sorry for the false alarm... next time I'll dig a little deeper before posting. Doooooh! :o

Link to comment
Share on other sites

it's an issue with my regex in the template... it was picking up 2 matches and the jump went to the one other than my intended match while my intended match was highlighted properly...

 

This is not uncommon with RegularExpression variables (in pages with lots of code and duplicate content), maybe we could have some sort of warning ? Like, "a duplicate match as been found, do you want to proceed" ? Besides Ketarin's match and an external validation of regex (against a given content), there's not much else user can do.

 

I've had one experience identical to CybTekSol that caused Ketarin to download 1/2 of the page code while atempting to replace app name variable. Oh yes, DB also jumped from 75KB to 168KB after this failed "update".

Link to comment
Share on other sites

@CybTekSol and FranciscoR

 

By definition, there can only be one match with regex, outside of a function or other looping construct. So you could say it only returns the first match.

 

CybTekSol, I know what you meant when you wrote, "it was picking up 2 matches and the jump went to the one other than my intended match" -- but in actuality it only matched the first one based on what you told it to do :-) I know that sounds like anal semantics, but it is an important distinction to be aware of with regex.

 

FranciscoR, I'm not sure how Ketarin would be able to help? There may be additional matches beyond the one first matched with the regex, do you want Ketarin to highlight those additional matches, as if a looping construct were in effect? E.g. Keep reapplying the regex on the portion of the material left, following each successful match, and highlight it?

 

The tool I use, RegexBuddy, does that. (I'm not sure if Expresso or other tools will do this.) Although the regex itself will always return the first match (by definition), RegexBuddy will highlight in the source material any additional matches that follow. This is purely informational in case it helps with the construction of the intended regex.

 

Or did I misunderstand what you were proposing for Ketarin to do?

 

--appyface

Edited by appyface
Link to comment
Share on other sites

CybTekSol, I know what you meant when you wrote, "it was picking up 2 matches and the jump went to the one other than my intended match" -- but in actuality it only matched the first one based on what you told it to do :-) I know that sounds like anal semantics, but it is an important distinction to be aware of with regex.
I understand this completely and should have described the issue a little differently... please forgive me if I misled anyone. ;)
RegexBuddy will highlight in the source material any additional matches that follow. This is purely informational in case it helps with the construction of the intended regex.

 

Or did I misunderstand what you were proposing for Ketarin to do?

If RegexBuddy has the capability to highlight additional matches that follow, I believe that this would be VERY productive as an added feature of Ketarin. I have full faith that Florian can code it if he chooses... maybe POST 1.0 Flo?
Link to comment
Share on other sites

Hmm. Not related to multiple highlighting, but to highlighting in general...

 

I'm using Ketarin 0.9.9.22 (latest upload of it, I think).

 

I have an entry with three variables defined. All three variables use a regex, and all three scrape the SAME webpage.

 

1. First variable: urlpart1

I click on the variable name, the webpage loads, and I see a dark blue highlight of what will be placed into the variable: http://ftp.snt.utwente.nl/pub/software/videolan/ This is correct.

 

2. Second variable: urlpart2

I click on the variable name, the webpage loads (it is the same page as for #1), but there is no highlight anwhere. If I use "goto match" (ctrl-g) the cursor is placed at the start of the string that I know will be scraped into the variable. But there is no indication of the actual string that will be scraped.

 

3. Third variable: vers

I click on the variable name, the webpage loads (same page as #1 and #2), but there is no highlight anywhere for this one either. Again if I use "goto match" the cursor is placed at the start of the string that I know will be scraped into the variable. But there is no indication of the actual string that will be scraped.

 

So I run the update and this is Ketarin's logfile, exactly what I would expect:

 

1/25/2009 3:54:48 PM: Update started with 1 application(s)

1/25/2009 3:54:49 PM: Replacing {urlpart1} in '{urlpart1}{urlpart2}' with 'http://ftp.snt.utwente.nl/pub/software/videolan/'

1/25/2009 3:54:51 PM: Replacing {urlpart2} in 'http://ftp.snt.utwente.nl/pub/software/videolan/{urlpart2}' with 'vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe'

1/25/2009 3:54:51 PM: VLC Media Player: Checking if update is required...

1/25/2009 3:54:51 PM: VLC Media Player: Update not required

1/25/2009 3:54:52 PM: Replacing {vers} in '{vers}' with '0.9.8a'

1/25/2009 3:54:53 PM: Update finished

 

So I am thinking the new highlighting thing is not yet working quite right? Even before asking Ketarin to highlight all possible matches in a file?

 

--appyface

Link to comment
Share on other sites

@appyface,

I have noticed this behavior also and believe it only occurs when the string is longer than the display window is wide... at least that has been my experience... YMMV. ;) In my cases, it fails to go to the match, but the match (and 'anchor' characters) is still highlighted, I just have to manually scroll to it.

Edited by CybTekSol
Link to comment
Share on other sites

No... that doesn't sound the same as my experience?

 

My strings fit on the screen in one line. The "goto match" command goes right to them. Everything Ketarin is doing seems accurate except that nothing is highlighted anywhere in the source material. The update log proves to me that Ketarin is getting the string as I intend, it just doesn't highlight that string.

 

I'm thinking this has to do with loading the same webpage for multiple variables... but that's just conjecture at this point. I will do some testing.

 

--appyface

 

P.S. Where in the world are you, CybTekSol? Flo has long ago gone to bed and left us here :-)

Link to comment
Share on other sites

FranciscoR, I'm not sure how Ketarin would be able to help? There may be additional matches beyond the one first matched with the regex, do you want Ketarin to highlight those additional matches, as if a looping construct were in effect? E.g. Keep reapplying the regex on the portion of the material left, following each successful match, and highlight it?

 

The tool I use, RegexBuddy, does that. (I'm not sure if Expresso or other tools will do this.)

 

@appyface

Thanks for reply.

 

As for Expresso: yes it does that, it's the tool I'm using to validate regex prior to use (but doesn't load source code). And it's great, today, while working in another template, I tested + then 50 file's urls and versions with it, it's really an excellent companion for Ketarin. Thanks for your advice. ;)

 

As for problem: right now I have it fixed but I needed to go through Ketarin's error to see there was a problem with my regex. I had to re-define anchors and regex itself; "to highlight additional matches" is a possibility but I think some sort of warning would be more usefull (mostly in pages with lots of code where you don't see the end of it). To be honest, I didn't know that "By definition, there can only be one match with regex" so right now I doubt my initial description, based on CybTekSol issue, is the more accurate.

 

But my problem remains: I had a regex match in Ketarin (also validated with Expresso), the page had part of the matched content doubled (and I didn't knew), so while updating this caused Ketarin to extract a A4 page of code as "app name" and saving all that "garbage" into database.

Edited by FranciscoR
Link to comment
Share on other sites

@appyface,

I haven't experienced any problem with failed highlighting, thank goodness. Since you asked, I live in the lovely Commonwealth of Virginia, U.S. and it's still early evening, GMT-5 I believe... and yes Flo is probably reaching REM stage of sleep by now unless he's a night owl like me! ;)

Link to comment
Share on other sites

Beautiful state! I visited long ago but enjoyed my time there.

 

The only problem for me, this is referring to "back east" in general, is humidity. I am California born and raised, and still live here. If the humidity goes over 40% in the summer I really suffer. 15% is what I'm used to!

 

And "night owl" I'm definitely not! When you see me posting about the same time as Flo, it's because I really am up at 3am. Not as in, stayed up until 3am, but got up at 3am :) Bedtime for me in a couple of hours, and it's just shy of 5pm here right now :)

Link to comment
Share on other sites

LOL Yes back on topic!

 

I understand this completely and should have described the issue a little differently... please forgive me if I misled anyone. ;)

No worries! It is always good to clarify regardless of reason for need of it :-)

 

If RegexBuddy has the capability to highlight additional matches that follow, I believe that this would be VERY productive as an added feature of Ketarin. I have full faith that Florian can code it if he chooses... maybe POST 1.0 Flo?

I have found it very helpful myself. I also use EditPadPro, another nice program by the same author. Since it can use regex for find loops or for search-and-replace loops, having RegexBuddy as a companion makes for a powerful combination. I pretty much live and breathe in EditPadPro and RegexBuddy anymore. :-)

 

But as to whether Ketarin should take on this function? You'll get no argument from me about how helpful it would be to have, but as you say it will be up to Flo to decide how far he wants to take Ketarin into more of a Regex development tool.

 

If it were up to me, regardless of highlighting, etc. that Ketarin might do, I would have a preference setting in Ketarin where I could state the full path and filename of my preferred external regex tool. Ketarin could then launch RegexBuddy (or Expresso or other tool) for me, with the click of a button right from Ketarin's variable definition window!

 

Taking the idea a step further, Ketarin might also be able to pass the regex and the source material contents to the tool, if it had some knowledge of a few common tools and this was possible to do?

 

EditPadPro does this. I'll be working on a text file and writing a regex in the find pane or search-and-replace pane in EditPadPro. I click on "RegexBuddy" icon and RegexBuddy is launched for me. My current regex from EditPadPro is loaded into the regex pane and the contents of my EditPadPro editor session is loaded into the working material pane of RegexBuddy. When I close RegexBuddy my regex from there is copied back to EditPadPro.

 

Of course, because both programs are by the same author, all this can happen. I certainly wouldn't expect Ketarin to receive the new regex when I close RegexBuddy (unless the author of RegexBuddy makes that kind of thing available to 3rd party programs).

 

I also don't know if RegexBuddy (or Expresso or other tool) will accept the source material and/or regex to be passed in from 3rd party programs. But how neat if Ketarin could do that too!

 

I'm just thinking (typing) out loud again...

 

--appyface

Link to comment
Share on other sites

Usually, it would not be that easy to pass regular expressions back and forth between the applications, as long as there is no public API. I don't know whether or not there is, but such features seem somewhat detached at the moment.

 

I'd be interested in the XML of the application where the highlighting fails, I can't reproduce it myself.

 

And yeah, at 1:30 is certainly had my REM phase ;)

Don't know if this forum converts the date/time display for your area.

Link to comment
Share on other sites

But my problem remains: I had a regex match in Ketarin (also validated with Expresso), the page had part of the matched content doubled (and I didn't knew), so while updating this caused Ketarin to extract a A4 page of code as "app name" and saving all that "garbage" into database.

I'm glad you're getting good use out of Expresso. I've never used it, being that I had RegexBuddy before I heard about it.

 

RegexBuddy 'understands' different flavors of regex engine. I set RegexBuddy to '.NET' for regex engine and then all the regex I write there should function the same way when I copy it to Ketarin. Does Expresso account for the different flavors of regex engine too?

 

So. Without seeing the actual content of the webpage or your regex, my best guess would be you experienced a 'greedy' vs. 'lazy' issue, greediness being the most common reason for 'runaway' regex match. Recall from my mini-tutorial :) the + and * and curly braces {} are greedy by default, which means they will gobble up everything in their path until what follows is the LAST time the match can be made. (or, the regex cannot match at all and fails)

 

In .NET regex the limit for + and * is the end of the webpage, since .NET regex treats a multi-line file as a single giant string. The braces limit the reach with character count, but are still greedy.

 

So if I have a regex: somestring1.*somestring2

 

And my content looks like this:

 

     somestring1djdjdkjkdsfdfkjkdsomestring2sdjdfsjfdsfddfsomestring2skjfkdfs
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

The large string I show there with '^' characters, is what is returned by my regex, because the * is greedy.

 

Now if I change the regex to this: somestring1.*?somestring2

 

     somestring1djdjdkjkdsfdfkjkdsomestring2sdjdfsjfdsfddfsomestring2skjfkdfs
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

The shorter string is returned for my regex, because the ? changed the * to lazy.

 

If you still have the example content and regex, see if this is what happened to you?

Edited by appyface
Link to comment
Share on other sites

@Flo Yes the time you posted is converted to my time, "Reply by floele Today 00:06:13"!

 

For highlighting issue, here's the XML for the entry I wrote about above, that highlights only the first variable's regex:

 

<?xml version="1.0" encoding="utf-16"?>

<Jobs>

<ApplicationJob xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Guid="a22f26bf-ff5e-4532-9d6a-fc324c4ef0ef">

<DownloadBeta>Default</DownloadBeta>

<DownloadDate xsi:nil="true" />

<CanBeShared>true</CanBeShared>

<ShareApplication>false</ShareApplication>

<HttpReferer />

<Variables>

<item>

<key>

<string>urlpart1</string>

</key>

<value>

<UrlVariable>

<VariableType>RegularExpression</VariableType>

<Regex>(?<=.*?mirror=).*?(?=&file=)</Regex>

<Url>http://www.videolan.org/vlc/download-windows.html</Url>

<Name>urlpart1</Name>

</UrlVariable>

</value>

</item>

<item>

<key>

<string>urlpart2</string>

</key>

<value>

<UrlVariable>

<VariableType>RegularExpression</VariableType>

<Regex>(?<=.*?mirror=.*?&file=).*?(?='>Download)</Regex>

<Url>http://www.videolan.org/vlc/download-windows.html</Url>

<Name>urlpart2</Name>

</UrlVariable>

</value>

</item>

<item>

<key>

<string>vers</string>

</key>

<value>

<UrlVariable>

<VariableType>RegularExpression</VariableType>

<Regex>(?<=.*?mirror=.*?&file=vlc/).*?(?=/)</Regex>

<Url>http://www.videolan.org/vlc/download-windows.html</Url>

<Name>vers</Name>

</UrlVariable>

</value>

</item>

</Variables>

<ExecuteCommand />

<Category>000100</Category>

<SourceType>FixedUrl</SourceType>

<PreviousLocation>d:\Stuff\filestore\cd_DVD_Video_burners_progs_utils\vlc-0.9.8a-win32.exe</PreviousLocation>

<DeletePreviousFile>false</DeletePreviousFile>

<Enabled>true</Enabled>

<FileHippoId />

<LastUpdated>2008-12-20T12:27:29.56975</LastUpdated>

<TargetPath>d:\Stuff\filestore\cd_DVD_Video_burners_progs_utils\</TargetPath>

<FixedDownloadUrl>{urlpart1}{urlpart2}</FixedDownloadUrl>

<Name>VLC Media Player</Name>

</ApplicationJob>

</Jobs>

Link to comment
Share on other sites

RegexBuddy 'understands' different flavors of regex engine. I set RegexBuddy to '.NET' for regex engine and then all the regex I write there should function the same way when I copy it to Ketarin. Does Expresso account for the different flavors of regex engine too?

Yes it does, I've been using C# (alternatives are C++ and VB, no .NET support though).

So. Without seeing the actual content of the webpage or your regex, my best guess would be you experienced a 'greedy' vs. 'lazy' issue, greediness being the most common reason for 'runaway' regex match.

Well maybe, but higlight was very clear and I'm pretty sure it was only capturing "app name" (=red). Right now I have inserted more than 10 different regexes into Ketarin and I only experienced an issue with this one (in this specific site). I don't have it anymore (it was overwriten with backup) but I think I will be able to reproduce problem to help flo. I will create a new post for this as I belive this is a different issue (e.g. not the strict higlight you are discussing).

 

And yes, I've also experienced several situations where highlight isn't working as it should (using latest .22 beta). Next time I'll post XML.

 

PS.: I'm aware of that problem, I use only ".*?"

Edited by FranciscoR
Link to comment
Share on other sites

There you go:

 

<?xml version="1.0" encoding="utf-16"?>
<Jobs>
 <ApplicationJob xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
   <DownloadBeta>Default</DownloadBeta>
   <DownloadDate xsi:nil="true" />
   <VariableChangeIndicator />
   <CanBeShared>true</CanBeShared>
   <ShareApplication>false</ShareApplication>
   <HttpReferer />
   <Variables>
     <item>
       <key>
         <string>app</string>
       </key>
       <value>
         <UrlVariable>
           <VariableType>RegularExpression</VariableType>
           <Regex>http://download.sysinternals.com/Files/(\w*?\p{P}zip|(\w*?(\d{1,2}|\p{P})\w*?\p{P}zip))</Regex>
           <Url>http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx</Url>
           <Name>app</Name>
         </UrlVariable>
       </value>
     </item>
   </Variables>
   <ExecuteCommand />
   <Category>Test</Category>
   <SourceType>FixedUrl</SourceType>
   <PreviousLocation>D:\Programas\Test\ProcessExplorer.zip</PreviousLocation>
   <DeletePreviousFile>true</DeletePreviousFile>
   <Enabled>true</Enabled>
   <FileHippoId />
   <LastUpdated>2009-01-26T15:43:15.6908866+00:00</LastUpdated>
   <TargetPath>{target}\{category}\</TargetPath>
   <FixedDownloadUrl>http://download.sysinternals.com/Files/{app}</FixedDownloadUrl>
   <Name>Test</Name>
 </ApplicationJob>
</Jobs>

 

 

To test the incorrect replacement of {app} variable (warning: don't try this without DB backup):

1. Press update.

2. After a while you should see:

 

--------------------------------------------------

Application Error

Test The operation has timed out (http://download.sysinternals.com/Files/

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head id="ctl00_Head1"><link id="ctl00_HeaderLink1" rel="stylesheet" type="text/css" href="http://i3.technet.microsoft.com/global/global-bn2090.0.css" /><link id="ctl00_HeaderLink2" rel="stylesheet" type="text/css" href="http://i3.technet.microsoft.com/Platform/MasterPages/TechnetPage/TechnetPage-bn2090.0.css" /><meta name="ROBOTS" content="NOINDEX,NOFOLLOW" /><meta name="MN" content="74838608-7:55:55 AM" /><meta name="ms.locale" content="en-us" /><meta name="Search.ShortId" content="bb896653" />

 

(etc, etc)

--------------------------------------------------

3. For full details, check log.

 

 

To test the higlight problem (one flavour of it) on a wide string:

1. Go to {app} variable and remove the full regex. Press OK.

2. Go into {app} variable again and insert "http://download.sysinternals.com/Files/(\w*?\p{P}zip|(\w*?(\d{1,2}|\p{P})\w*?\p{P}zip))" (without "")

3. There's no higlight. Press OK.

4. Go into {app} variable again. Now you can see regex highlighted.

Link to comment
Share on other sites

@FranciscoR

 

OK 'greedy' vs 'lazy' wasn't it :-) Let's see what it is. Yes if you can recreate it, a new thread is great.... hopefully we can spot it.

 

I think 'C#' in Expresso ought to be the same as '.NET', so that should be fine. Oh, RegexBuddy doesn't load webpages either :-( I wish it did. I let Ketarin load the page and then I select all and copy it to RegexBuddy.

 

As I mentioned, EditPadPro loads RegexBuddy but then that's the same author... still, I will ask that author whether that is a publicly-available API call. It would be handy if Ketarin could not only launch a preferred tool, but load it too if it knows it can :-)

 

--appyface

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share


×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.