Print this page


My dissertation project for my postgraduate MSc Information Technology degree focused on the Automatic Extraction Of Information From HTML Documents For Semantic Technologies

Here is a PDF copy 

The cutting edge of research at the time was leading the Web to become more than just linked pages. Imagine Amazon is selling a copy of Romeo and Juliet on dvd. That page shows you the dvd, who is in it and that is was written by William Shakespeare.  You can read that but you will have to do some clicking to find out more about Mr Shakespeare or the lead actors, Leonardo DiCaprio or Claire Danes.  You may find other movies that Leonardo DiCaprio has been in.  However, work had already begun to be able to represent the data on that page into a linked network of information.

Linked Data would make it easier for machines, i.e. Semantic Agents, to read the information and help you to make decisions. It is more than just Amazon suggesting other items that may be of interest to you.  It could be your Semantic Agent suggests something from a totally different website altogether.  Quickly and Easily.

My project wasn't about buying DVDs from Amazon.  It was about finding researchers who were already working in a similar field in different institutions and linking their contact details and research topics.  Work was being done on how the Linked Data could be represented.  Work was being done on how to extract data from documents.  My project attempted to bring this research together.


The World Wide Web (Web) is a global collection of documents containing information presented in a format that can only be readily understood by humans. The problem is to extract this information and make it useful to machines by presenting it in a machine-readable format. This project presents ‘FOAFextra’, an application designed to extract information from HTML documents, transcode the information into XML syntax, and produce adapted files containing the information. This project discusses the Friend-Of-A-Friend (FOAF) vocabulary model, in conjunction with other shared RDF ontologies, to represent knowledge regarding personal information.

The project scope is limited to the personal information of academic staff at UK based academic institutions. An object-oriented approach has been used to provide greater portability and maintainability of the application so that in the future it would be easier to widen the scope to include other geographical domains, topic disciplines, foreign languages, etc.

FOAFextra uses a supervised top-down covering algorithm for inducing extraction rules and comprises a desktop application structured as a three-tiered architecture for document acquisition, information extraction, and information transcoding. This approach highlights the complexities of Information Extraction and reinforces the need for future research.


Semantic Web, Information Extraction, Transcoder, Adapter, FOAFextra



You are reading the project’s website



Previous page: computing
Next page: languages