Jump to content
Ketarin forum

Search for beautiful Regex (to help newbies)


Recommended Posts

Edited to clean

 

Primarily there are only two groups of extensions that windows programs come in.

 

Binaries - rar/zip/7z (These only require to be unzipped to be installed)

Executables - exe/msineed (These can be either extracted using the program Universal Extractor, installed manually or silently with switches/parameters)

Extras/Addons - Things like greasemonkey userscripts that are .user.js or thunderbird/firefox addons which are .xpi

 

I use only one regex that works for EVERY app I have. Edited only in two ways for each app. One is I change the last three letters of my regex to the actual file extension I want to find. So from this

 

[^"'=]+\.zip

 

if I want to find an exe I just do

 

[^"'=]+\.exe

 

The \. is to tell the regex it has to find the period character, if you just put .zip it will find any character before zip which you don't want. So it would only find godilovezip from godilovezip.zip, which will return an error obviously.

 

 

To find specific words in URL (Like x32 or x64 builds) all you need to do is add the text after the first + sign. To find multiple just keep adding [^'']+ followed by the keyword. Keep in mind it is sequential.

 

[^"'=]+64[^"']+\.zip

 

Keep in mind that if the 64 is directly before the file extension the regex won't match, so just remove a character and you'll be fine. So the above would turn into [^"'=]+6[^"']+\.zip

 

If your download page lists the latest release from the bottom of the page instead of the top, just enclose your regex with .*( at the beginning and ) at the end. So it would end up looking like

 

.*([^"'=]+64[^"']+\.zip)

 

I have yet to find a way to 'exclude' specific words, more specifically words like source or src as sometimes the first match is a source file which I have no use for. I've been asking on a few forums but sometimes I need to be spoonfed then hit on the head.

 

For those instances where I have run into the source/src problem I just add extra inclusion words that aren't in the found source link.

 

Anything and everything i've garnished from Regex was from asking the people in this forum, who were kind enough to give me stuff to fiddle with and chew on. I would not say I have a good grasp of regex, just that this 'functions' and it the simplest one i've seen so far and seeing as our primary purpose for regex is to find the download link I believe this fits the bill.

 

 

SPECIAL CIRCUMSTANCES

 

There may be a time where that regex doesn't accurately capture download links that have an equal sign in them. To get that as well you only need to do the following.

 

[^"'=]+=[^"']+\.zip

 

Also a helpful hint is that if you find it is capturing some weird part of the page try adding / instead to make it look for something that has a folder structure.

 

[^"'=]+/[^"']+\.zip

Edited by Omniferum
Link to post
Share on other sites
  • 4 weeks later...
  • 1 month later...
  • 2 weeks later...

Small question, how to exclude some "string" for example:

 

I have two deownload links one contains x64 other doesnt, how to I mach that one which doesnt include x64?

 

http://esteid.googlecode.com/files/Eesti_ID_kaart_2_8_0_x64.msi
http://esteid.googlecode.com/files/Eesti_ID_kaart_2_8_0.msi

Edited by Etz
Link to post
Share on other sites
  • 3 months later...
How to get url:

http://ftp.drupal.org/files/projects/drupal-7.2.tar.gz

but exclude:

http://ftp.drupal.org/files/projects/drupal-8.x-dev.tar.gz

?

 

http://drupal.org/project/drupal

 

That page lists recommended releases first and then development releases.

 

You don't really need any special exclusion regex. Just using [^"']+\.zip works fine.

 

However for future reference you could just do [^"']+[^"'v]+\.zip and that would do what you asked. However in this case there is no need, you only need exclusion regex for dl links that are aren't the first or last link in the page.

Link to post
Share on other sites

hello again

[^"'=]+jdk[^"']+[^jre]+\.md5

why this regex give this link:

http://www.java.net/download/jdk6/6u27/promoted/b01/binaries/jre-6u27-ea-bin-b01-windows-i586-18_may_2011.md5

but not this:

http://www.java.net/download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-i586-18_may_2011.md5

?

 

UPD: i got second url in template:

[^"'=]+/jdk-[^"']+\.md5

but interesting what's wrong with first template

Edited by UksusoFF
Link to post
Share on other sites

It helps to know exactly what you're wanting to capture, but the flaw in your logic is an assumption of sequence. RegEx doesn't have explicit exclusion for a character sequence, only for distinct characters. So if you're trying to exclude the string "jre" you can't use [^jre] as this will only exclude the letters j OR r OR e. Adding the + after it will exclude any combination of those three for any length. Further, if you have "greedy" captures (the default) on either side of it, then it just has to capture any single letter at the beginning, then any NON-"jre" character, then any length of other characters. It's best, where possible, to use an explicit string sequence you WANT that will always appear in a true match between normal greedy captures. As so:

[^"'=]+i-want-this[^"']+\.md5

 

Now that I've actually looked at the code for the JDK page I sent you in the other thread, you might want to use a capture for a portion of the javascript instead. There's some very distinct code in there that REALLY looks beneficial, as so:

document.getElementById("winOffline64JDK").href = "/download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-amd64-18_may_2011.exe";

 

I'd match that (assuming I wanted the "winOffline64JDK" version) with:

"winOffline64JDK"\)\.href = "([^'"]+)"

 

That'll extract this portion:

/download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-amd64-18_may_2011.exe

 

Then in your "download" box you just need to build it as so:

http://jdk6.java.net{myvariablename}

 

...assuming you captured the variable above as "myvariablename".

Link to post
Share on other sites
  • 3 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.