Omniferum Posted November 5, 2010 Report Share Posted November 5, 2010 (edited) Edited to clean Primarily there are only two groups of extensions that windows programs come in. Binaries - rar/zip/7z (These only require to be unzipped to be installed) Executables - exe/msineed (These can be either extracted using the program Universal Extractor, installed manually or silently with switches/parameters) Extras/Addons - Things like greasemonkey userscripts that are .user.js or thunderbird/firefox addons which are .xpi I use only one regex that works for EVERY app I have. Edited only in two ways for each app. One is I change the last three letters of my regex to the actual file extension I want to find. So from this [^"'=]+\.zip if I want to find an exe I just do [^"'=]+\.exe The \. is to tell the regex it has to find the period character, if you just put .zip it will find any character before zip which you don't want. So it would only find godilovezip from godilovezip.zip, which will return an error obviously. To find specific words in URL (Like x32 or x64 builds) all you need to do is add the text after the first + sign. To find multiple just keep adding [^'']+ followed by the keyword. Keep in mind it is sequential. [^"'=]+64[^"']+\.zip Keep in mind that if the 64 is directly before the file extension the regex won't match, so just remove a character and you'll be fine. So the above would turn into [^"'=]+6[^"']+\.zip If your download page lists the latest release from the bottom of the page instead of the top, just enclose your regex with .*( at the beginning and ) at the end. So it would end up looking like .*([^"'=]+64[^"']+\.zip) I have yet to find a way to 'exclude' specific words, more specifically words like source or src as sometimes the first match is a source file which I have no use for. I've been asking on a few forums but sometimes I need to be spoonfed then hit on the head. For those instances where I have run into the source/src problem I just add extra inclusion words that aren't in the found source link. Anything and everything i've garnished from Regex was from asking the people in this forum, who were kind enough to give me stuff to fiddle with and chew on. I would not say I have a good grasp of regex, just that this 'functions' and it the simplest one i've seen so far and seeing as our primary purpose for regex is to find the download link I believe this fits the bill. SPECIAL CIRCUMSTANCES There may be a time where that regex doesn't accurately capture download links that have an equal sign in them. To get that as well you only need to do the following. [^"'=]+=[^"']+\.zip Also a helpful hint is that if you find it is capturing some weird part of the page try adding / instead to make it look for something that has a folder structure. [^"'=]+/[^"']+\.zip Edited January 16, 2011 by Omniferum Link to comment Share on other sites More sharing options...
Reformed Pirate Posted November 8, 2010 Report Share Posted November 8, 2010 I like those Regexes...very smart. I've been doing things like "http\S+\.\w{3}" ...something like that (can't remember right now), but I may have to give yours a shot. Link to comment Share on other sites More sharing options...
Omniferum Posted December 6, 2010 Author Report Share Posted December 6, 2010 Edited to be a bit more user friendly Link to comment Share on other sites More sharing options...
Omniferum Posted January 16, 2011 Author Report Share Posted January 16, 2011 Updated again Link to comment Share on other sites More sharing options...
Etz Posted January 29, 2011 Report Share Posted January 29, 2011 (edited) Small question, how to exclude some "string" for example: I have two deownload links one contains x64 other doesnt, how to I mach that one which doesnt include x64? http://esteid.googlecode.com/files/Eesti_ID_kaart_2_8_0_x64.msi http://esteid.googlecode.com/files/Eesti_ID_kaart_2_8_0.msi Edited January 29, 2011 by Etz Link to comment Share on other sites More sharing options...
Omniferum Posted January 29, 2011 Author Report Share Posted January 29, 2011 Again I don't know how to put in 'exclude strings', tad fiddly. However in your case. .*([^"'=]+\.msi) should do the trick Link to comment Share on other sites More sharing options...
Etz Posted January 29, 2011 Report Share Posted January 29, 2011 Again I don't know how to put in 'exclude strings', tad fiddly. However in your case. .*([^"'=]+\.msi) should do the trick Yes it does, but it breaks if they change file order on webpage... Link to comment Share on other sites More sharing options...
Omniferum Posted January 29, 2011 Author Report Share Posted January 29, 2011 (edited) That would be annoying as all buggery. Still this one will work so long as 4 is at the end of the filename for the 64bit version [^"'=]+[^4]\.msi Edited January 29, 2011 by Omniferum Link to comment Share on other sites More sharing options...
Etz Posted January 29, 2011 Report Share Posted January 29, 2011 That would be annoying as all buggery. Still this one will work so long as 4 is at the end of the filename for the 64bit version [^"'=]+[^4]\.msi Wow... does exactly whats needed...thx Link to comment Share on other sites More sharing options...
Omniferum Posted January 29, 2011 Author Report Share Posted January 29, 2011 Yeah, problem is though the only way to add exclusion strings (That I know of) is to put stuff like 4 in this [^] To make [^4] And add those around until you have all the exclusions you need in the exact spot you need, sorta inelegant but such is life. Link to comment Share on other sites More sharing options...
shawn Posted January 31, 2011 Report Share Posted January 31, 2011 I would recommend excluding for "x" instead of "4", otherwise when v2.8.4 (or similar) is released, it won't be downloaded. [^"'=x]+\.msi Link to comment Share on other sites More sharing options...
UksusoFF Posted May 26, 2011 Report Share Posted May 26, 2011 thanks! very helpful! Link to comment Share on other sites More sharing options...
UksusoFF Posted May 26, 2011 Report Share Posted May 26, 2011 (edited) How to get url: http://ftp.drupal.org/files/projects/drupal-7.2.tar.gz but exclude: http://ftp.drupal.org/files/projects/drupal-8.x-dev.tar.gz ? Edited May 26, 2011 by UksusoFF Link to comment Share on other sites More sharing options...
necrox Posted May 26, 2011 Report Share Posted May 26, 2011 And again a ban! Congratulation! Link to comment Share on other sites More sharing options...
Omniferum Posted May 26, 2011 Author Report Share Posted May 26, 2011 How to get url: http://ftp.drupal.org/files/projects/drupal-7.2.tar.gz but exclude: http://ftp.drupal.org/files/projects/drupal-8.x-dev.tar.gz ? http://drupal.org/project/drupal That page lists recommended releases first and then development releases. You don't really need any special exclusion regex. Just using [^"']+\.zip works fine. However for future reference you could just do [^"']+[^"'v]+\.zip and that would do what you asked. However in this case there is no need, you only need exclusion regex for dl links that are aren't the first or last link in the page. Link to comment Share on other sites More sharing options...
UksusoFF Posted May 26, 2011 Report Share Posted May 26, 2011 thanx! Link to comment Share on other sites More sharing options...
UksusoFF Posted May 31, 2011 Report Share Posted May 31, 2011 (edited) hello again [^"'=]+jdk[^"']+[^jre]+\.md5 why this regex give this link: http://www.java.net/download/jdk6/6u27/promoted/b01/binaries/jre-6u27-ea-bin-b01-windows-i586-18_may_2011.md5 but not this: http://www.java.net/download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-i586-18_may_2011.md5 ? UPD: i got second url in template: [^"'=]+/jdk-[^"']+\.md5 but interesting what's wrong with first template Edited May 31, 2011 by UksusoFF Link to comment Share on other sites More sharing options...
shawn Posted May 31, 2011 Report Share Posted May 31, 2011 It helps to know exactly what you're wanting to capture, but the flaw in your logic is an assumption of sequence. RegEx doesn't have explicit exclusion for a character sequence, only for distinct characters. So if you're trying to exclude the string "jre" you can't use [^jre] as this will only exclude the letters j OR r OR e. Adding the + after it will exclude any combination of those three for any length. Further, if you have "greedy" captures (the default) on either side of it, then it just has to capture any single letter at the beginning, then any NON-"jre" character, then any length of other characters. It's best, where possible, to use an explicit string sequence you WANT that will always appear in a true match between normal greedy captures. As so: [^"'=]+i-want-this[^"']+\.md5 Now that I've actually looked at the code for the JDK page I sent you in the other thread, you might want to use a capture for a portion of the javascript instead. There's some very distinct code in there that REALLY looks beneficial, as so: document.getElementById("winOffline64JDK").href = "/download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-amd64-18_may_2011.exe"; I'd match that (assuming I wanted the "winOffline64JDK" version) with: "winOffline64JDK"\)\.href = "([^'"]+)" That'll extract this portion: /download/jdk6/6u27/promoted/b01/binaries/jdk-6u27-ea-bin-b01-windows-amd64-18_may_2011.exe Then in your "download" box you just need to build it as so: http://jdk6.java.net{myvariablename} ...assuming you captured the variable above as "myvariablename". Link to comment Share on other sites More sharing options...
UksusoFF Posted May 31, 2011 Report Share Posted May 31, 2011 shawn, thanks for explanation about regex Link to comment Share on other sites More sharing options...
UksusoFF Posted September 11, 2011 Report Share Posted September 11, 2011 Hi All! Can you give sample for get url from this: <a href="http://samlple.com/engine/download.php?id=123" >known_part_with_random_numbers</a> ? Link to comment Share on other sites More sharing options...
Omniferum Posted September 11, 2011 Author Report Share Posted September 11, 2011 [^"']+/[^"']+ That would do the job perfectly Link to comment Share on other sites More sharing options...
UksusoFF Posted September 12, 2011 Report Share Posted September 12, 2011 Not worked.. It's worked if known part is in url ("[^"']+download[^"']+/[^"']+), but not between <a></a>. Link to comment Share on other sites More sharing options...
Omniferum Posted September 12, 2011 Author Report Share Posted September 12, 2011 (edited) You only gave me a line of text. You should just give the full URL anyway, this should work http[^"']+ Edited September 12, 2011 by Omniferum Link to comment Share on other sites More sharing options...
UksusoFF Posted September 12, 2011 Report Share Posted September 12, 2011 You only gave me a line of text. You should just give the full URL look at this http://terrariago.ru/download/client/6-terraria.html Link to comment Share on other sites More sharing options...
Omniferum Posted September 12, 2011 Author Report Share Posted September 12, 2011 [^"']+\?id=[^"']+ the above regex gives me http://terrariago.ru/engine/download.php?id=26 from that url Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now