public static void main(String[] args) throws IOException { //Create a new url object and pass in the url value through the constructor URL url = new URL('https://www.jd.com/') //Establish a connection between Java and url, which is equivalent to our access to the URL, the difference is that Java returns the connection and what we return to the naked eye is the content of the webpage HttpURLConnection connection = (HttpURLConnection) url.openConnection() //Judging whether the access is successful through the corresponding status code int code = connection.getResponseCode() if (code != 200) { return } //Get an input stream after the connection accesses the webpage, which contains the information content of the webpage InputStream stream = connection.getInputStream() //Get the information in the stream through BufferedReader BufferedReader reader = new BufferedReader(new InputStreamReader(stream, 'utf-8')) //Output information String r = null while ((r = reader.readLine()) != null) { System.out.println(r) } }


Exception in thread 'main' java.net.MalformedURLException: no protocol: jd.com


Jingdong (JD.COM)-Genuine low price, quality assurance, timely delivery, easy shopping! window.pageConfig = { compatible: true, preload: false, navId: 'jdhome2016', timestamp: 1521683291000, isEnablePDBP: 0,, surveyTitle: 'Survey Questionnaire', surveyLink : '//surveys.jd.com/index.php?r=survey/index/sid/889711/newtest/Y/lang/zh-Hans', leftCateABtestSwitch : 0, '' : '' }

次に、トピック「crawler4jを使用して高速クロールする方法」に入ります。 githubソースのダウンロード


crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.



onebeartoe onebeartoe https://repository-onebeartoe.forge.cloudbees.com/snapshot/ edu.uci.ics crawler4j 4.4-SNAPSHOT



public class MyCrawler2 extends WebCrawler { @Override public boolean shouldVisit(Page referringPage, WebURL url) { //The function of this method is to filter the URLs that you don't want to visit //return false when the url is filtered out and will not be crawled return super.shouldVisit(referringPage, url) } @Override public void visit(Page page) { //The function of this method is to call this method when the shouldVisit method returns true to get the page content, which has been encapsulated in the Page object super.visit(page) } }

3番目のステップ:コントローラークラスを作成し(実際にはどのクラスでもかまいません)、公式ドキュメントに従ってmainメソッドを作成し、変更するだけです。スレッド数とURLそれでおしまい。 URLはシードと呼ぶことができます。 URLが渡される限り、crawler4jはURLのコンテンツに従ってページ内のすべてのURLを取得し、何度も何度もクロールしますが、繰り返されるURLは繰り返しクロールされず、無限のループは発生しません。 。 。

public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = '/data/crawl/root'//File storage location int numberOfCrawlers = 7//number of threads CrawlConfig config = new CrawlConfig() config.setCrawlStorageFolder(crawlStorageFolder)//Configuration object settings PageFetcher pageFetcher = new PageFetcher(config) RobotstxtConfig robotstxtConfig = new RobotstxtConfig() RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher) CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer)//Create controller.addSeed('http://www.ics.uci.edu/~lopes/')//Incoming url controller.start(MyCrawler.class, numberOfCrawlers)//Start crawling } }



Exception in thread 'main' java.lang.NoClassDefFoundError: org/apache/http/ssl/TrustStrategy at com.chenyao.Controller.main(Controller.java:25) Caused by: java.lang.ClassNotFoundException: org.apache.http.ssl.TrustStrategy これは本当に不可解で、インターネットでこのエラーを検索しました。 2つのjarパッケージをインポートするだけで、バージョンが高くなる必要があります。 org.apache.httpcomponents httpcore 4.4 org.apache.httpcomponents httpclient 4.4.1



public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = '/data/crawl/root'//File storage location int numberOfCrawlers = 1//number of threads CrawlConfig config = new CrawlConfig() config.setCrawlStorageFolder(crawlStorageFolder)//Configuration information setting PageFetcher pageFetcher = new PageFetcher(config) RobotstxtConfig robotstxtConfig = new RobotstxtConfig() RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher) CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer)//Create crawler executor Controller.addSeed('https://search.jd.com/Search?keyword=notebook&enc=utf-8&wq=notebook')//Incoming seed URL to crawl controller.start(MyCrawler2.class, numberOfCrawlers)//Start crawling



  • タグを付けて、製品の特定の情報を確認します。ここで、作成者は画像リンクをクロールしますが、読者のニーズに応じて他の情報をクロールすることもできます。

    public class MyCrawler2 extends WebCrawler { //Custom filter rules private final static Pattern FILTERS = Pattern.compile('.*(\.(css|js|gif|jpg|png|mp3|mp4|zip|gz))$') @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase()//Crawled URL to lowercase //The URL to be filtered is defined here. My requirement is to crawl only the products in the page searched by JD. The URL needs to start with https://search.jd.com/search boolean b =!FILTERS.matcher(href).matches()&&href.startsWith('https://search.jd.com/search') return b } @Override public void visit(Page page) { String url = page.getWebURL().getURL() System.out.println(url) //Determine whether the page is a real web page if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData() String html = htmlParseData.getHtml()//page html content Document doc = Jsoup.parse(html)//Using jsoup to parse html, you will not be able to simply search this //When using the selector, you need to understand the html rules in the webpage, go to F12 in the webpage yourself, Elements elements = doc.select('.gl-item') if(elements.size()==0){ return } for (Element element : elements) { Elements img = element.select('.err-product') if(img!=null){ //Output picture link System.out.println(img.attr('src')) System.out.println(img.attr('data-lazy-img')) } } } } }



