You don't need to be an 'investor' to invest in Singletrack: 6 days left: 95% of target - Find out more
I am after contact data for a certain type of company.
There is an official (i.e. governing authority) web site that allows you to search for these companies and retrieve the contact data, although the search page only returns 100 results each time. This could lead to missed contacts (and duplicates but I can deal with that in other ways).
I know how the link to each piece company data is created and I have recreated this in a database (basically a URL including a unique number). So now I have 500,000 web links (most of which will not work but a some will) that my scraper can crawl though and retrieve the data without me carrying out a lot of manual searches to get 100 links each time.
So, given that the contact data is publicly available elsewhere (just a PITA to collect) would I be a bad boy if I clicked 'go'?
screen scraping is a regular thing for a lot of price consolidation/comparison sites.
If someone doesn't want it to happen they should code to prevent it.
I would schedule the collection process. Hammering their website with half a million hits might not feel like a very nice thing to happen to them.
I know if one of my web sites got half a million hits from one IP address I'd treat it as an attack and make an effort to track down the perps. In fact, some of our websites would block the traffic quite quickly as they'd deem it a DOS attack.
Yep I was going to leave the start until tonight.
I know if one of my web sites got half a million hits from one IP address I'd treat it as an attack and make an effort to track down the perps
I did think about this, but what would they do then? I'm only retrieving data that they make available anyway.
I may not bother as 500,000 hits isn't going to cover it really.
Depends who they are and what resources they have at their beck and call really ;-). My guys would be chomping at the bit to make a retribution hit for this sort of behaviour. (only joking)
As I say though, if you spread it out they probably wouldn't be too bothered but the last thing you need is a complaint raising with your ISP or even that they blacklist you.
Hire a botnet for an evening?
What would they do? They'd block or rate limit your IP, so if it is a work IP or similar then it could have a negative impact on other people being able to work against that site should they need to.
Keep it under control and I doubt you'll have a problem getting all the data.
I did think about this, but what would they do then? I'm only retrieving data that they make available anyway.
Big difference from getting a bit of data for you to book one holiday and hitting the same site 50,000 for all their holiday data, for example.
I've seen sites drop in a CAPTCHA blocker after a certain number of hits from the same IP so you might not get all the data you were expecting.
Does the website in question have terms of use?