Jump to content
Ketarin forum

Request for quick Regex


Omniferum
 Share

Recommended Posts

Howdy, i've spent the last day trying to get my head around regex and while i'm able to do simple matching like define the start and end of a string (even that I screw up) my ultimate goal is just 20% more complicated.

 

Essentially I want a regex expression that I can specify as such

 

Beginning of string is href= or http

the final match must find a string containing a certain specified word i.e. x64 or slim or ubuntu

The end of string to be a file extension like .exe or something like ?download

 

Also I assume there must be a way to stop it from capturing the ENTIRETY of text i.e. a character limiter? Like exclude matches that exceed 50 characters? so the final match can only be a certain number of characters long.

 

I feel that having this would be fine for a large portion of download links as to date all my programs are either href= then a partial url which I just pipe to my fixed url for a complete one or an ENTIRE download link.

 

There may be more but this is the start of thing for me :P I've tried finding similar regex expressions around but my head keeps exploding as i've sort of gone ketarin bonkers the last couple of days.

 

I have the download links sorted (I just feel that regex could perhaps streamline it better and make it more uniform)

The execute command after download is all sorted, that took a little while as I had to identify what each program needed. Some had silent/auto install switches, uniextract or 7-zip and to move programs/clean up extraneous files/folders.

 

I'm just in the making everything tidy phase, any help would be greatly appreciated. Happy to call myself useless and needing a hand :P

Link to comment
Share on other sites

You can play with these to see if it can be tweaked for your needs:

 

(?:href=("|'))([a-z]+://.*?\.(?:exe|7z|zip|zip2|bz|bz2|bzip|gz|gzip|jar|lha|lzh|lzw|pak|rar|sit|sit!|sit5|sitd|sithqx|sitx|tz|wsz|cab|msi|bin|img|iso|xpi|pbp))(?:"|')

 

(?:href=)(?:"|')(((\w+:\/\/([\w@][\w.:@]+))?\/?[\w\s_\.?=%&=\-@/$,]*\.[a-z]{2,3}))(?:"|')

 

You can place what I call anchor variables that you define within the regex to limit their capture, i.e. :

 

{anchor1} = 'string that always occurs on app page'

 

{anchor2} = 'another string'

 

{anchor3] = 'possibly a preferred file extension instead of the choices above'

 

Note: I highly recommend 'Expresso', for testing and tweaking regular expressions... just search Expresso here in the forum for more info.

Link to comment
Share on other sites

Well I left it be for a few days, general frustration.

 

Came back, and seeing as I know nothing of regex I can't really decipher what that regex expression does and with the URL's i've given it there doesn't seem to be much consistent result.

 

The best i've been able to concoct is

 

http://.*?\.exe

 

So essentially i've got the start of the string, the middle is allowed to be whatever length and contain anything but must end at .exe. But this just gets me the first match all the way to exe, hence the need for character limiting.

So that's... a start. I feel retarded at going at this sort of pace but i'll get there eventually.

 

So now I just need to character limit it first, I know the command is {1,50} for the string to be anywhere from 1 to 50 characters and exclude all others. I still don't know how to make the middle string include a special word though.

 

I appreciate the ones given but if I can't tell what they do specifically then i'm sort of at a loss. As I said, this is my first time with Regex and really the only thing i'll use it for.

Link to comment
Share on other sites

Well now that i've creamed my panties oh so very slightly.

 

Someone now inform me how the bloody hell do I donate to Wonderful Mr. Floele?

 

I've never donated anything before to any program's creator but seriously i'm off the walls with how much I love this program. Having global user defined variables and just a lot of extra stuff makes this a brilliant program, i'm happy I found it.

 

Someone tell me, pwetty pwease.

 

Only thing left is trying to figure out why even when I give Ketarin a direct link to sourceforge firefoxportable and thunderbidportable all I get is a junk exe file.

Link to comment
Share on other sites

Oh, for anyone else that wants to use this for future use.

 

This is the expression

 

http://[^'"]+bonk[^'"]+\.exe

 

 

The Breakdown

 

http:// = plaintext. Type in plaintext whatever you want the start of the string to be

 

[^'"]+bonk[^'"]+ = String must also contain this keyword. Replace bonk with whatever identifying text you need, i.e. x64

 

\.exe = End of string, This is also just plaintext. The \ is only there to make the . behave as straight text, otherwise it is a special regex character.

 

This works for ALL my apps, tested and enjoyed. Some don't have full http: paths so I just use what it gives me partial and add the rest manually. Some use href, others are just php links etc. etc. etc. but seriously this little bit of code identifies exactly the string I want every single time brilliantly. To the point it should be included as a default type thing, works better than the Content from URL (Start/End)

Edited by Omniferum
Link to comment
Share on other sites

Note that you'll also need to escape (that's what putting the "\" in front of the . and other special characters is called) parentheses, dots, question marks and any literal slashes, whenever you're trying to treat these characters as literal characters within a RegEx statement.

 

We really need to see the XML file for the apps that aren't working (export them to XML format, then copy the contents into a post here between 'code' tags) in order to be able to figure out what's going wrong. With Sourceforge, the URL pattern is ABSOLUTELY important, and it MUST have a "spoofed referer" on the "advanced settings" tab.

Link to comment
Share on other sites

Because I love shawn so much!

 

Not that what I just said was a non-sequitor, or this sentence either.

 

Anyway, What would I have to do to have multiple middle string matches? I tried just adding an extra + but it just went "I don't find shit!"

 

If it is too extra complicated doesn't matter, I don't actually 'need' it. More just future proofing for some programs that specify both language and 32/64 bit in the filename i.e. CPU-Z.

 

If I don't get the hand s'ok, I still love you.

Link to comment
Share on other sites

The specific string that I use to say "part of a URL goes here" is:

[^'"]+

 

If you want multiple matches, you need to duplicate ALL of that between each matched portion, as so:

[^'"]+64[^'"]+EN[^'"]+

 

That would look for something that looks like this:

stuffhere64morestuffENmorestuff

 

Follow?

Link to comment
Share on other sites

Oh, and the point of using that specific string is what it means. The brackets mean "match the stuff inside here". The carat inside as the first character reverses it's functionality, effectively turning it into "match everything EXCEPT what's inside here". The use of an apostrophe and a quote inside means that it's going to grab everything that matches without getting outside of the current attribute value, and since anchors in HTML are supposed to have quotes or apostrophes around the href (url) portion, it should only grab a continuous string - the URL. The + means "one or more of the characters (or group) before me", so grab all consecutive non-apostrophe and non-quote characters.

 

On some sites (particularly those with crappy HTML that doesn't include quotes around URLs), I convert it to this:

[^'"\s<>]+

 

That adds spaces and angle brackets to the exclusions, so the URL will be picked up correctly even if it's not properly wrapped. Be careful though, as while this will work for some, it will not work on some sites, since the stupid developers actually include a space or angle brackets in the URL, even though they're improperly encoded (as %20, %3c, %3e).

Link to comment
Share on other sites

Add another to that Shawn... SF uses ; instead of ' or " to envelop URLs... They changed it months ago which forced me to alter my template and do a global search & replace on my xml. Stuff like this is why I use any other possible source. By contrast, I developed a SnapFiles template 18 months ago and have not had to make a single tweak that I can remember! So [^"';\s<>]+ to include SF's method... I don't use the above methods much as I like the regex to be more selective limiting the capture to only one possible URL per page on complex sites such as SF and others. Just my preference.

 

Correction: They added it (;) some months ago and just to some apps such as FileZilla... just another example of the SF inconsistencies that make a 100% successful template a pipe dream IMHO.

Link to comment
Share on other sites

Thanks, CybTekSol. I've only noticed the ";" and "&" splits on certain file links in the source within "/files/". You can avoid them in every instance I've found (including FileZilla), since the same URLs appear elsewhere *without* double-encoding of some characters.

 

I've created an SF template, which I'll be posting in the Templates section after I test it with a couple more apps tonight.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.