闲来无事,做个快速收集企业信息导出Excel表的程序。所以...嘿嘿,开始对天眼查进行研究,废话不多说。
一、天眼查网站地址:https://www.tianyancha.com,到天眼查网站后例如:查询关键字:教育,天眼查说查询到100000+条企业信息,但是当你去翻页看的时候会发现在不登录的时候只能查看2页,后面就提示你登录查看更多了,那就登录一下,反正天眼查有短信快捷登录,登陆后,着手分析,(建议使用谷歌浏览器)F12调出开发者工具,Ctrl+Shift+C 点击咱们需要拿下来的信息块,嘿嘿...嘻嘻原来全在下图红框节点中啊!
然后知道它在这个区域,那么怎么把这个网页拿下来呢?
Java自带类就能实现这个问题!java.net.HttpURLConnection包就能模拟浏览器访问,直接上代码:
package com.zsx.crawler.utils.TianYanChaCompanyCrawler;import java.io.BufferedReader;import java.io.InputStream;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.URL;public class WebUtil {public static String getPageContent(String url){StringBuffer sb = new StringBuffer();try {// 建立连接URL u = new URL(url);HttpURLConnection httpUrlConn = (HttpURLConnection) u.openConnection();httpUrlConn.setDoInput(true);httpUrlConn.setRequestMethod("GET");//设置请求头//httpUrlConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");//httpUrlConn.setRequestProperty("Accept-Encoding","gzip, deflate, br");//httpUrlConn.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9");//httpUrlConn.setRequestProperty("Connection", "keep-alive");//httpUrlConn.setRequestProperty("Host", "www.tianyancha.com");//httpUrlConn.setRequestProperty("Referer", "https://www.tianyancha.com/");//httpUrlConn.setRequestProperty("Upgrade-Insecure-Requests", "1");httpUrlConn.setRequestProperty("Cookie",这里就写使用浏览器访问天眼查携带的"小点心");httpUrlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36");// 获取输入流InputStream is = httpUrlConn.getInputStream();// 将字节输入流转换为字符输入流InputStreamReader isr = new InputStreamReader(is, "utf-8");// 为字符输入流添加缓冲BufferedReader br = new BufferedReader(isr);// 读取返回结果String data = null;while ((data = br.readLine()) != null) {sb.append(data);System.out.println(data);}// 释放资源br.close();isr.close();is.close();httpUrlConn.disconnect();} catch (Exception e) {e.printStackTrace();}return sb.toString();}}
调用该方法打印返回值或者保存到txt文件就能看到,成功将该网站代码获取到,拿到后分析网站代码发现,噢.......原来还有隐藏域啊,隐藏域中竟然放了该页企业信息JSON数据,那就更简单了,直接从JSON数据中拿出想要的数据,存到Excel表就OK了!哈哈
上代码:
package com.zsx.crawler.utils.TianYanChaCompanyCrawler;import org.json.JSONObject;import java.util.regex.Matcher;import java.util.regex.Pattern;public class CrawlerCompanyUtil {public static void main(String []args){String web = WebUtil.getPageContent("https://www.tianyancha.com/search?key=%E6%95%99%E8%82%B2%E7%A7%91%E6%8A%80&base=bj");Pattern pageCount = Pattern.compile("<div class=\"result-footer\">.*?</div>");Matcher matcher1 = pageCount.matcher(web);while (matcher1.find()){ String group =matcher1.group(); System.out.println("页数:" + group);}Pattern companyInfo = Pattern.compile("<span class=\"tt hidden\">.*?</span>");Matcher matcher = companyInfo.matcher(web);while(matcher.find()){String group = matcher.group();//System.out.println(group);String eachGroup = group.substring(group.indexOf("<span class=\"tt hidden\">")+24,group.indexOf("</span>"));//System.out.println(eachGroup);JSONObject json = new JSONObject(eachGroup);//System.out.println(json);//公司名称String companyName = json.get("name").toString();//法人String legalperson = json.get("legalPersonName").toString();//注册资本String registeredfund = json.get("regCapital").toString();//注册时间String registeredtime = json.get("estiblishTime").toString();//电话列表String phone = json.get("phoneList").toString();//邮箱列表String email = json.get("emailList").toString();//注册地址String address = json.get("regLocation").toString();String qita = "经营领域:"+ json.get("businessScope") + "\n"+json.get("matchField");System.out.println(companyName);System.out.println(legalperson);System.out.println(registeredfund);System.out.println(registeredtime);System.out.println(email);System.out.println(address);System.out.println(qita);System.out.println();}}}
将代码存入数据库或者导出Excel表就看咱们的心情了!
不过天眼查只允许普通用户查看前5页内容,所以我又去研究了启信宝网站,下篇咱们说说启信宝!爬虫启信宝文章中富含详细数据导出Excel表格代码,并且无限爬取数据,传送门——>Java爬虫启信宝
最后给各位看官来波福利!
阿里云服务器代金券和折扣免费领:https://promotion.aliyun.com/ntms/yunparter/invite.html?userCode=ypbt9nme
性能级主机2-5折:https://promotion.aliyun.com/ntms/act/enterprise-discount.html?userCode=ypbt9nme
新用户云通讯专享8折:https://www.aliyun.com/acts/alicomcloud/new-discount?userCode=ypbt9nme
新老用户云主机低4折专项地址:https://promotion.aliyun.com/ntms/act/qwbk.html?userCode=ypbt9nme
680元即可注册商标专项地址:https://tm.aliyun.com/?userCode=ypbt9nme