A custom ubot robot can do web mining projects but…
What is website scraping?
Website scraping is the process of collecting information from a website and storing that information on the collecting computer. This can be done with a custom ubot robot. There are a variety of ways to scrape this data from something as simple as copying and pasting to a spreadsheet or word processing document. More elaborate methods include scripts and programs specifically designed to gather certain information from specific sites. These programs are typically known as ROBOTs or BOTs. Further because of the proliferation of BOTs whole development languages have been created to make the common tasks easier to develop and adapt to changing environments.
What kind of information do BOTs collect?
ROBOT programs, such as a custom ubot robot, have been widely used to capture contact information for both people and companies. This information is used to prospect for new customers. In addition, it has become increasingly more popular to scrape sites for content. This content is then spun into a unique new article and re-purposed for another site or sites. This saves the site owners time in creating their own unique content for search engines to find.
Is it legal to scrape a website for information?
Many BOT owners believe that information placed on the Internet is public domain and don’t think twice about copying the text and graphics. Obviously some material is copyrighted and patent protect and several lawsuits have arisen as a result. Many suits have been settled before conclusion in the courts but because of the legal costs of prosecution or defense the number of cases is somewhat limited to-date. Surely there will be many suits in the future before the subject is well defined in the justice system in the US as well as globally.|
How do websites try to block the ROBOT/BOTs?
Two main methods are being used to try and stop the use of custom ubot robot. Once the source of a BOT is identified either by excessive traffic or systematic probing an IP block is placed on the web server and the server refuses connection from that address or address range. Many ROBOT program get around the block by using a proxy address to conceal their true location. The second common method of blocking is CAPTHA solving. The idea being that only a human can solve the CAPTHA question. CAPTHA solving services and advanced technology are defeating this method too.