The Ultimate Guide to the Invisible Web
“Considering search engines reveal only a fraction of overall search results, perhaps the Invisible Web, or Deep Web, could hold the real information you seek. But what is it, and how do we get to it? The staff at OEDb provide us with the know-how to do just that in the following article. ”
Search engines are, in a sense, the heartbeat of the internet; “googling” has become a part of everyday speech and is even recognized by Merriam-Webster as a grammatically correct verb. It’s a common misconception, however, that googling a search term will reveal every site out there that addresses your search. In fact, typical search engines like Google, Yahoo, or Bing actually access only a tiny fraction – estimated at 0.03% – of the internet. The sites that traditional searches yield are part of what’s known as the Surface Web, which is comprised of indexed pages that a search engine’s web crawlers are programmed to retrieve.
So where’s the rest? The vast majority of the Internet lies in the Deep Web, sometimes referred to as the Invisible Web. The actual size of the Deep Web is impossible to measure, but many experts estimate it is about 500 times the size of the web as we know it.
Deep Web pages operate just like any other site online, but they are constructed so that their existence is invisible to Web crawlers. While recent news, such as the bust of the infamous Silk Road drug-dealing site and Edward Snowden’s NSA shenanigans, have spotlighted the Deep Web’s existence, it’s still largely misunderstood.
Search Engines and the Surface Web
Understanding how surface Web pages are indexed by search engines can help you understand what the Deep Web is all about. In the early days, computing power and storage space was at such a premium that search engines indexed a minimal number of pages, often storing only partial content. The methodology behind searching reflected users’ intentions; early Internet users generally sought research, so the first search engines indexed simple queries that students or other researchers were likely to make. Search results consisted of actual content that a search engine had stored.
Over time, advancing technology made it profitable for search engines to do a more thorough job of indexing site content. Today’s Web crawlers, or spiders, use sophisticated algorithms to collect page data from hyperlinked pages. These robots maneuver their way through all linked data on the Internet, earning their spidery nickname. Every surface site is indexed by metadata that crawlers collect. This metadata, consisting of elements such as page title, page location (URL) and repeated keywords used in text, takes up much less space than actual page content. Instead of the cached content dump of old, today’s search engines speedily and efficiently direct users to websites that are relevant to their queries.
To get a sense of how search engines have improved over time, Google’s interactive breakdown “How Search Works” details all the factors at play in every Google search. In a similar vein, Moz.com’s timeline of Google’s search engine algorithm will give you an idea of how nonstop the efforts have been to refine searches. How these efforts impact the Deep Web is not exactly clear. But it’s reasonable to assume that if major search engines keep improving, ordinary web users will be less likely to seek out arcane Deep Web searches.
How is the Deep Web Invisible to Search Engines?
Search engines like Google are extremely powerful and effective at distilling up-to-the-moment Web content. What they lack, however, is the ability to index the vast amount of data that isn’t hyperlinked and therefore immediately accessible to a Web crawler. This may or may not be intentional; for example, content behind a paywall or a blog post that’s written but not yet published both technically reside in the Deep Web.
Some examples of other Deep Web content include:
- Data that needs to be accessed by a search interface
- Results of database queries
- Subscription-only information and other password-protected data
- Pages that are not linked to by any other page
- Technically limited content, such as that requiring CAPTCHA technology
- Text content that exists outside of conventional http:// or https:// protocols
While the scale and diversity of the Deep Web are staggering, it’s notoriety – and appeal – comes from the fact that users are anonymous on the Deep Web, and so are their Deep Web activities. Because of this, it’s been an important tool for governments; the U.S. Naval research laboratory first launched intelligence tools for Deep Web use in 2003.
Just as Deep Web content can’t be traced by Web crawlers, it can’t be accessed by conventional means. The same Naval research group to develop intelligence-gathering tools created The Onion Router Project, now known by its acronymTOR. Onion routing refers to the process of removing encryption layers from Internet communications, similar to peeling back the layers of an onion. TOR users’ identities and network activities are concealed by this software. TOR, and other software like it, offers an anonymous connection to the Deep Web. It is, in effect, your Deep Web search engine.
But in spite of its back-alley reputation there are plenty of legitimate reasons to use TOR. For one, TOR lets users avoid “traffic analysis” or the monitoring tools used by commercial sites, for one, to determine web users’ location and the network they are connecting through. These businesses can then use this information to adjust pricing, or even what products and services they make available.
According to the Tor Project site, the program also allows people to, “[…] Set up a website where people publish material without worrying about censorship.” While this is by no means a clear good or bad thing, the tension between censorship and free speech is felt the world over; the Deep Web. The Deep Web furthers that debate by demonstrating what people can and will do to overcome political and social censorship.
Reasons a Page is Invisible
When an ordinary search engine query comes back with no results, that doesn’t necessarily mean there is nothing to be found. An “invisible” page isn’t necessarily inaccessible; it’s simply not indexed by a search engine. There are several reasons why a page may be invisible. Keep in mind that some pages are only temporarily invisible, possibly slated to be indexed at a later date.
- Engines have traditionally ignored any Web pages whose URLs have a long string of parameters and equal signs and question marks, on the off chance that they’ll duplicate what’s in their database – or worse – the spider will somehow go around in circles. Known as the “Shallow Web,” a number of workarounds have been developed to help you access this content.
- Form-controlled entry that’s not password-protected. In this case, page content only gets displayed when a human applies a set of actions, mostly entering data into a form (specific query information, such as job criteria for a job search engine). This typically includes databases that generate pages on demand. Applicable content includes travel industry data (flight info, hotel availability), job listings, product databases, patents, publicly-accessible government information, dictionary definitions, laws, stock market data, phone books and professional directories.
- Passworded access, subscription or non subscription. This includes VPN (virtual private networks) and any website where pages require a username and password. Access may or may not be by paid subscription. Applicable content includes academic and corporate databases, newspaper or journal content, and academic library subscriptions.
- Timed access. On some sites, like major news sources such as the New York Times, free content becomes inaccessible after a certain number of pageviews. Search engines retain the URL, but the page generates a sign-up form, and the content is moved to a new URL that requires a password.
- Robots exclusion. The robots.txt file, which usually lives in the main directory of a site, tells search robots which files and directories should not be indexed. Hence its name “robots exclusion file.” If this file is set up, it will block certain pages from being indexed, which will then be invisible to searchers. Blog platforms commonly offer this feature.
- Hidden pages. There is simply no sequence of hyperlink clicks that could take you to such a page. The pages are accessible, but only to people who know of their existence.
Ways to Make Content More Visible
We have discussed what type of content is invisible and where we might find such information. Alternatively, the idea of making content more visible spawned the Search Engine Optimization (SEO) industry. Some ways to improve your search optimization include:
- Categorize your database. If you have a database of products, you could publish select information to static category and overview pages, thereby making content available without form-based or query-generated access. This works best for information that does not become outdated, like job postings.
- Build links within your website, interlinking between your own pages. Each hyperlink will be indexed by spiders, making your site more visible.
- Publish a sitemap. It is crucial to publish a serially linked, current sitemap to your site. It’s no longer considered a best practice to publicize it to your viewers, but publish it and keep it up to date so that spiders can make the best assessment of your site’s content.
- Write about it elsewhere. One of the easiest forms of Search Enging Optimization (SEO) is to find ways to publish links to your site on other webpages. This will help make it more visible.
- Use social media to promote your site. Link to your site on Twitter, Instagram, Facebook or any other social media platform that suits you. You’ll drive traffic to your site and increase the number of links on the Internet.
- Remove access restrictions. Avoid login or time-limit requirements unless you are soliciting subscriptions.
- Write clean code. Even if you use a pre-packaged website template without customizing the code, validate your site’s code so that spiders can navigate it easily.
- Match your site’s page titles and link names to other text within the site, and pay attention to keywords that are relevant to your content.
How to Access and Search for Invisible Content
If a site is inaccessible by conventional means, there are still ways to access the content, if not the actual pages. Aside from software like TOR, there are a number of entities who do make it possible to view Deep Web content, like universities and research facilities. For invisible content that cannot or should not be visible, there are still a number of ways to get access:
- Join a professional or research association that provides access to records, research and peer-reviewed journals.
- Access a virtual private network via an employer.
- Request access; this could be as simple as a free registration.
- Pay for a subscription.
- Use a suitable resource. Use an invisible Web directory, portal or specialized search engine such as Google Book Search, Librarian’s Internet Index, or BrightPlanet’s Complete Planet.
Invisible Web Search Tools
Here is a small sampling of invisible web search tools (directories, portals, engines) to help you find invisible content. To see more like these, please look at our Research Beyond Google article.
- A List of Deep Web Search Engines – Purdue Owl’s Resources to Search the Invisible Web
- Art – Musie du Louvre
- Books Online – The Online Books Page
- Economic and Job Data – FreeLunch.com
- Finance and Investing – Bankrate.com
- General Research – GPO’s Catalog of US Government Publications
- Government Data – Copyright Records (LOCIS)
- International – International Data Base (IDB)
- Law and Politics – THOMAS (Library of Congress)
- Library of Congress – Library of Congress
- Medical and Health – PubMed
- Transportation – FAA Flight Delay Information
10 Search Engines to Explore the Invisible Web:
<firstimage=”http: main.makeuseoflimited.netdna-cdn.com=”” wp-content=”” uploads=”” 2010=”” 03=”” maze.png”=””>No, it’s not Spiderman’s latest web slinging tool but something that’s more real world. Like the World Wide Web.
The Invisible Web refers to the part of the WWW that’s not indexed by the search engines. Most of us think that that search powerhouses like Google and Bing are like the Great Oracle”¦they see everything. Unfortunately, they can’t because they aren’t divine at all; they are just web spiders who index pages by following one hyperlink after the other.
But there are some places where a spider cannot enter. Take library databases which need a password for access. Or even pages that belong to private networks of organizations. Dynamically generated web pages in response to a query are often left un-indexed by search engine spiders.
Search engine technology has progressed by leaps and bounds. Today, we have real time search and the capability to index Flash based and PDF content. Even then, there remain large swathes of the web which a general search engine cannot penetrate. The term, Deep Net, Deep Web orInvisible Web lingers on.
To get a more precise idea of the nature of this “˜Dark Continent’ involving the invisible and web search engines, read what Wikipedia has to say about the Deep Web. The figures are attention grabbers ““ the size of the open web is 167 terabytes. The Invisible Web is estimated at 91,000terabytes. Check this out – the Library of Congress, in 1997, was figured to have close to 3,000terabytes!
How do we get to this mother load of information?
That’s what this post is all about. Let’s get to know a few resources which will be our deep diving vessel for the Invisible Web. Some of these are invisible web search engines with specifically indexed information.
Infomine has been built by a pool of libraries in the United States. Some of them are University of California, Wake Forest University, California State University, and the University of Detroit. Infomine “˜mines’ information from databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other resources.
You can search by subject category and further tweak your search using the search options. Infomine is not only a standalone search engine for the Deep Web but also a staging point for a lot of other reference information. Check out its Other Search Tools and General Reference links at the bottom.
This is considered to be the oldest catalog on the web and was started by started by Tim Berners-Lee, the creator of the web. So, isn’t it strange that it finds a place in the list of Invisible Web resources? Maybe, but the WWW Virtual Library lists quite a lot of relevant resources on quite a lot of subjects. You can go vertically into the categories or use the search bar. The screenshot shows the alphabetical arrangement of subjects covered at the site.
Intute is UK centric, but it has some of the most esteemed universities of the region providing the resources for study and research. You can browse by subject or do a keyword search for academic topics like agriculture to veterinary medicine. The online service has subject specialists who review and index other websites that cater to the topics for study and research.
Intute also provides free of cost over 60 free online tutorials to learn effective internet research skills. Tutorials are step by step guides and are arranged around specific subjects.
Complete Planet calls itself the “˜front door to the Deep Web’. This free and well designed directory resource makes it easy to access the mass of dynamic databases that are cloaked from a general purpose search. The databases indexed by Complete Planet number around 70,000 and range from Agriculture to Weather. Also thrown in are databases like Food & Drink and Military.
For a really effective Deep Web search, try out the Advanced Search options where among other things, you can set a date range.
Infoplease is an information portal with a host of features. Using the site, you can tap into a good number of encyclopedias, almanacs, an atlas, and biographies. Infoplease also has a few nice offshoots like Factmonster.com for kids and Biosearch, a search engine just for biographies.
DeepPeep aims to enter the Invisible Web through forms that query databases and web services for information. Typed queries open up dynamic but short lived results which cannot be indexed by normal search engines. By indexing databases, DeepPeep hopes to track 45,000 forms across 7 domains.
The domains covered by DeepPeep (Beta) are Auto, Airfare, Biology, Book, Hotel, Job, and Rental. Being a beta service, there are occasional glitches as some results don’t load in the browser.
IncyWincy is an Invisible Web search engine and it behaves as a meta-search engine by tapping into other search engines and filtering the results. It searches the web, directory, forms, and images. With a free registration, you can track search results with alerts.
DeepWebTech gives you five search engines (and browser plugins) for specific topics. The search engines cover science, medicine, and business. Using these topic specific search engines, you can query the underlying databases in the Deep Web.
Scirus has a pure scientific focus. It is a far reaching research engine that can scour journals, scientists’ homepages, courseware, pre-print server material, patents and institutional intranets.
TechXtra concentrates on engineering, mathematics and computing. It gives you industry news, job announcements, technical reports, technical data, full text eprints, teaching and learning resources along with articles and relevant website information.
Just like general web search, searching the Invisible Web is also about looking for the needle in the haystack. Only here, the haystack is much bigger. The Invisible Web is definitely not for the casual searcher. It is a deep but not dark because if you know what you are searching for, enlightenment is a few keywords away.
Do you venture into the Invisible Web? Which is your preferred search tool?
Image credit: MarcelGermain
Resources to Search the Invisible Web:
The invisible web includes many types of online resources that normally cannot be found using regular search engines. The listings below can help you access these resources:
- Alexa: A website that archives older websites that are no longer available on the Internet. For example, Alexa has about 87 million websites from the 2000 election that are for the most part no longer available on the Internet.
- Complete Planet: Provides an extensive listing of databases that cannot be searched by conventional search engine technology. It provides access to lists of databases which you can then search individually.
- The Directory of Open Access Journals: Another full-text journal searchable database.
- FindArticles: Indexes over 10 million articles from a variety of different publications.
- Find Law: A comprehnsive site that provides information on legal issues organized by category.
- HighWire: Brought to you by Stanford University, HighWire press provides access to one of the largest databases of free, full-text, scholarly content.
- Infomine: A research database created by librarians for use at the university level. It includes both a browsable catalogue and searching capabilities.
- MagPortal: A search engine that will allow you to search for free online magazine articles on a wide range of topics.