I'm trying to scrape a website using node.js and the request module. So far, I've been able to connect to the target site and generate the page's HTML using a query string from the site, but I am now realizing that the site renders the data using a script on their site.
Here is the HTML that I'm getting back.
C:\Users\ZHarriott>node test
error: null
statusCode: 200
body: <!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<link rel="canonical" href="https://bethesda.net">
<title>Bethesda.net</title>
<meta name="description" content="The official site for Bethesda, publisher of Fallout, DOOM, Dishonored, Skyrim, Wolfenstein, The Elder Scrolls, more. Your source for news, features & community.">
<meta name="keywords" content="Fallout,Fallout 4,Fallout 3,Fallout New Vegas,DOOM,Dishonored,Dishonored 2,The Elder Scrolls,The Elder Scrolls Online,The Elder Scrolls Online Tamriel Unlimited,Wolfenstein,Wolfenstein The Old Blood,Wolfenstein The New Order,The Evil Within,The Elder Scrolls V Skyrim,Skyrim,Rage,Brink,Wet,The Elder Scrolls IV Oblivion,Oblivion,Bethesda Game Studios,ZeniMax Online Studios,id Software,Arkane Studios,MachineGames,Machine Games,Tango Gameworks,Bethesda Softworks,Todd Howard">
<meta name="author" content="">
<meta property="og:site_name" content="Bethesda.net">
<meta property="twitter:site" content="@bethesda">
<meta property="twitter:title" content="Bethesda.net">
<link href="https://bethesda.net" hreflang="x-default" rel="alternate">
<link href="https://bethesda.net/en/dashboard" hreflang="en" rel="alternate">
<link href="https://bethesda.net/de/dashboard" hreflang="de" rel="alternate">
<link href="https://bethesda.net/es/dashboard" hreflang="es" rel="alternate">
<link href="https://bethesda.net/fr/dashboard" hreflang="fr" rel="alternate">
<link href="https://bethesda.net/it/dashboard" hreflang="it" rel="alternate">
<link href="https://bethesda.net/pl/dashboard" hreflang="pl" rel="alternate">
<meta name="referrer" content="origin">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
<link rel="stylesheet" type="text/css" href="/main.css">
<script src="https://cdn02.bethesda.net/contentful@3.8.1/browser-dist/contentful.min.js"></script>
</head>
<body>
<!--[if lte IE 9]>
<p class="browserupgrade"><h3>You are using an <strong>outdated</strong> browser.</h3></p>
<p class="browserupgrade">Many things may be non-functional if you continue, please upgrade your browser to improve your experience.</p>
<![endif]-->
<noscript>
<p class="browserupgrade">Please enable javascript to use this site.</p>
<META HTTP-EQUIV="Refresh" CONTENT="0;URL=nojs.html">
</noscript>
<app></app>
<section id="_bnContent"></section>
<globalfooter></globalfooter>
<script>
// Please do not use www
if (window.location.hostname === 'www.bethesda.net') {
window.location.replace('https://bethesda.net/' + window.location.hash)
}
try {
// This ensures the user is using javascript, this is required for bethesda.net
document.getElementsByTagName('html')[0].classList.remove('no-js')
} catch (e) {
console.log(e)
}
</script>
<script src="/sites/main.js"></script>
</body>
</html>
C:\Users\ZHarriott>
It looks like because I'm not actually using a web browser that the site is having trouble knowing how to handle this request. Any thoughts on this?
via Zach Harriott
No comments:
Post a Comment