Editor’s note: this article is the second part of The Fix’s series on advanced search for journalists. Make sure you’re subscribed to our weekly newsletter so you don’t miss the next instalments.

Advanced search is fascinating because it involves at the same time technical skills – you need to know how tools work and how to use them – and the old-fashioned nose for the news. 

Sometimes you simply do not know what you are looking for. Sometimes you need to trust serendipity. Sometimes you know that you need patience, method, and time. 

In the first part of this guide, we have seen some ideas related to the concept of advanced search, starting from the possibility of mapping all the most typed queries about any topic. In this instalment, let’s look at tools to organise your searches, mastering advanced search operators, and more.

Stay Keen

There is a tool that I started using to build a repository of articles and sources I need.

It is called Keen. It’s been developed by Area120 (Google’s in-house incubator), so don’t take it for granted because it could be closed at some point: use it when it’s working. 

Keen is based on Google Search. 

When you start to create a new “keen”, you can insert whatever query you want.
Let’s try with the “Just Stop Oil” movement. 

You receive some suggestions for related queries or you can add your own – maybe queries that you’ve discovered mapping with the autocomplete suggestions (the technique covered in the first part) – and finally create a new keen.

At this point, the tool starts providing you with results related to your queries, just like any search engine. But you can manage those results as you were working on a personal repository.

Moreover, you can hide irrelevant results, explaining why you find them not valid: the algorithm should learn from your choices.

In the image, for example, I’m hiding an article about Essential Oils, which is clearly not related to “Just Stop Oil”.

You can also save the so-called “Gems” (very relevant content) and refine your dashboard by adding searches whenever.

You can finally choose to receive a ping via email twice a week if there is any news, share the collection you are creating and take notes.

Keen is a powerful and simple tool that will help you be up-to-date when you don’t know the specific source you want to follow, but still need to follow a particular topic. 

Moreover, it’s perfect to practise again all the concepts related to search before going deeper. Advanced search is a never-ending learning technique. It’s always useful to start with something very simple and then try something more difficult.

Search for any object reachable with an URL

Now, let’s say you want to practise more with advanced search operators. 

You can search into any repository reachable with an URL using Google. 

site:

is the operator you need, sometimes combined with other operators. The other thing you need to know is the domain. 

You can search among 

  • Telegram channels: site:t.me/ (public content only)
  • Facebook groups: site:facebook.com/groups (again, public content only, but you can see them even if you are not logged in Facebook)
  • TikTok videos: site:tiktok.com inurl:video (in this last case, I combined site: with another operator, which looks for specific words in the URL)
  • Instagram discover: (site:instagram.com/explore/)

If you know the handle of someone, you can look for that handle on a platform (@albertopi78 site:instagram.com – this is my Instagram handle!). Or you can search for a name or a nickname on different platforms (“alberto puliafito” site:twitter.com | site:linkedin.com | site:facebook.com | site:instagram.com | site:youtube.com is a string that will provide you a lot of information on my digital presence, if you want to investigate it for any reason).

If you know the structure of a set of URLs, you can use that information to search on that set. This requires, in general, two consecutive different searches. Let’s see an example.

site:gov.uk filetype:pdf “Just Stop Oil”

provides us a results page with pdf documents stored on the official UK government websites and containing the words “Just Stop Oil”. Analysing the results, we find a pdf with this complex URL

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1100785/National_Policing_Board_minutes_27_June_2022.pdf

So, reading the URL and trying to make some guesses, we can imagine that some files will be stored under a folder called “attachment_data”. To see what kind of files, we can try another search, removing keywords – as we use them in a traditional way – and just using advanced operators. For example:

site:gov.uk inurl:attachment_data

We can see that there are thousands of documents here. And again, we can refine our search. For example, let’s say that we want to remove Covid-19 as a topic.

site:gov.uk inurl:attachment_data -covid-19

And then we want to look for climate-related documents.

site:gov.uk inurl:attachment_data -covid-19 climate


Once you’ll start with this kind of search, you will see that there are endless possibilities. You’ll see that some websites are more search-friendly while some others aren’t. 

You can also search into the so-called dark web (please be careful doing this). A possible way to do this kind of search is this string: site:*.onion.* “green pass”

Knowing the structure of any of these URLs and using the wild card operator (*), I’m asking Google to search among the pages with the .onion. extension containing the specific words green pass.

Just notice that the results you will obtain are not necessarily safe.

So, be careful with the results: if you are browsing with your personal device, without a VPN, and without a browser like TOR, you still can be safe just clicking on the three dots and choosing to see the cached option: in this way, you are still in the Google domain.  

Dorking for journalism

The next step for increasing your knowledge in advanced search is, for sure, Google Dorking (or Google Hacking). Here is a quick (old and famous) example to start.

If you type this string in your search bar inurl:”ViewerFrame?Mode=” and press enter, you will find a list of public webcams. 

Many of these strings are collected in several repositories online and exchanged among communities devoted to different attitudes (journalism, security, hacking, all these together).

The Google Hacking Database (GHDB) is one of these repositories. It defines itself as “a categorized index of Internet search engine queries designed to uncover interesting, and usually sensitive, information made publicly available on the Internet”.

“In most cases”, continues the official description, “this information was never meant to be made public, but due to many factors this information was linked in a web document that was crawled by a search engine that subsequently followed that link and indexed the sensitive information”.

In simple terms, sometimes the owners of a website don’t want some content or tool to go public, but they don’t adequately protect them. So, Google and other search engines are free to crawl and index those content. And you are free to find them: you are not violating the rules, you are not committing a crime. You’re using the advanced search to see everything that the site makes available.

As you can imagine, we are now moving into a grey area: the point is that google-dorking techniques exploit, in some way, people’s lack of technical knowledge and websites’ weaknesses and vulnerabilities in extracting information.

We have to set some ethical boundaries here: they depend – I suggest – on the topic or people you are investigating and on the usage of the information you possibly get. As a journalist, you are not supposed to use that information to hurt someone: you are supposed to search and use information to verify and find the truth about topics relevant for citizens in order to help them make better decisions. 


The Fix Newsletter

Everything you need to know about European media market every week in your inbox