Omniferum Posted April 27, 2011 Report Share Posted April 27, 2011 I'm having a problem scraping http://www.gonvisor.com, it keeps giving me a 403 Error. Anybody got any ideas as to how to resolve this? Previous issues point to a header error of some sort. Tried analyzing the http upon loading the page but got nothing of any substance. Link to comment Share on other sites More sharing options...
shawn Posted April 27, 2011 Report Share Posted April 27, 2011 403 is "forbidden". Typically this means that your user-agent isn't allowed on their site. Trying with the following forged IE header works fine (for me): Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727) It's also possible that your IP has been banned due to abuse or exceeding a certain number of hits over a period (common rules for stuff like APF/BFD). Link to comment Share on other sites More sharing options...
Omniferum Posted April 27, 2011 Author Report Share Posted April 27, 2011 Whoops, thought that referrer field was only for the download url, not for everything. Was dicking around with httpx://&header:accept stuff. Thanks shawn Link to comment Share on other sites More sharing options...
CybTekSol Posted May 28, 2011 Report Share Posted May 28, 2011 I am finding it necessary to add a 'user agent' entry more often these days... your thoughts on this Shawn? Link to comment Share on other sites More sharing options...
shawn Posted May 28, 2011 Report Share Posted May 28, 2011 Sadly, it's very common for servers to ensure a valid connection now, especially if the data is being hosted within a cloud setup. I have the following custom variables setup in my Ketarin to help get around these issues: ie32 Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C) ie64 Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Media Center PC 5.0; SLCC1; Tablet PC 2.0; .NET4.0C) firefox Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 opera Opera/9.80 (Windows NT 6.1; U; en) Presto/2.6.30 Version/10.62 chrome Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 wget wget/1.9+cvs-stable+(red+hat+modified) curl pycurl/7.18.2 If you have UA header issues, start trying to fix it with curl and wget, and if they don't work, use ie32, ie64 and others. Usually it'll work by the time you get to ie32. Link to comment Share on other sites More sharing options...
CybTekSol Posted May 31, 2011 Report Share Posted May 31, 2011 WGet as 'user agent' has been very effective for me overall, so far. Link to comment Share on other sites More sharing options...
shawn Posted May 31, 2011 Report Share Posted May 31, 2011 Me, too -- as long as the site is actually intended to distribute files. If they're a "mom and pop" or a very small biz then it's likely it'll fail completely. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now