Archive for 21 czerwca, 2012

21 czerwca, 2012

The power of tables of the WEB

- autor: tsissput

Witam wszystkich,
w zalaczniku przedstawiam moje zaliczeniowe streszczenie trzech artykulow dot. bardzo ciekawej technologii Web Tables. Jest to próba wywnioskowania danych relacyjnych poporzez agregację ogromnych ilości danych pobranych z internetu. . Na co dzien zajmuje sie zupelnie innymi tematami (SAP Defect Management/ Test Management). Wspomniane artykuly byly dla mnie ciekawa ‚wycieczka’ w swiat danych relacyjnych. Uciesza mnie wnioski i komentarze przeslane na adres: agnieszka.krzeminska@ingenieur.de.
Pozdrawiam
A.Krzeminska, Frankfurt/ Main

1. Motivation
This assignment is prepared on the base of three articles provided by Dr hab. Mikolaj Morzy. I tried to find out how to use outcomes from researches of web tables for searching among web-based repositories at global IT organizations (Logica or Deutsche Bank AG).
2. First paper: ‘Entity Relation Discovery from Web Tables and Links’
The World-Wide Web consists a huge number of not well sructured texts and a vast amount of structured data e.g. Web tables. Such tables are a type of structured information which is pervasive on the web, therefore Web-scale methods that automatically extract web tables are studied intensively.
In the database vernacular, a table is defined as a set of tuples which have the same attributes. By analogy, a web table is defined as a set of rows (corresponding to database tuples) which have the same column headers (corresponding to database attributes). That is why, to extract a web table is to extract a relation on the web. In databases, tables often contain foreign keys with reference to other tables. As a consequence, hyperlinks inside a web table sometimes function as foreign keys to other relations whose tuples are contained in the hyperlink’s target pages.
The key question pointed out in this article is:
• Is it possible to discover new attributes for web tables by exploring hyperlinks inside them?
Proposed solution takes a web table as input. Frequent patterns are generated as new candidate relations by following hyperlinks in the web table. The confidence of candidates is evaluated, and trustworthy candidates are selected to become new attributes for the table. Experiments performed on a variety of web domains justify the usefulness of that method.
The Web has structural and relational character. It is presented in the article on the example of a page, where employees in an academic department are listed. This table has four columns, each with a domain-specific label and type, wherein the ‘name’ column contains a group of hyperlinks pointing to the homepage of the listed person. According to Cafarella, this web table has a character of a small relational database -> even if it lacks the explicit meta-data traditionally associated with a database.
The professors listed in the table have links to their homepages, and these homepages contain information regarding ‘teaching’, ‘publications’, etc., but in slightly different forms with different descriptions. If it will be possible to find pieces of common information in these professors’ homepages, then it will be possible to expand the web table so that each piece of common information becomes a new attribute. This common information are contents, hyperlink and other structures/ metadata of the homepages.
In this example the common information/ kind of a new attribute could be ‘teaching’ or ‘acm publication’. By observing which tuples contain the new attributes, ‘employees’ could be classified into ‘professor’ and ‘staff’.
Authors had following motivations:
• current methods retrieve web tables that are visually expressed in one HTML page, and there is limited experience on discovering attributes across pages;
• due to the fact that a reliable entity group facilitates the discovery of relations, tuples in a web table are (usually) a trustworthy entity group, which supplies guidance for relation discovery;
• the discovery of table attributes and relations will mutually help each other.
The remaining issue was:
• How to discover new attributes using traditional methods of relation extraction?
Previously, extensive studies in the area of extracting general relations have been done. Authors were focused on discovering relations specific to a table which are not common in the whole web. In the general framework, an entity-relation is a triple (ei, r, ej), where ei and ej denote two entity types and r denotes their relation. In the general framework we have the following assumptions:
• In a web table, a classifier selects tuples that belong to the entity type ei.
• we examine table columns one by one for selected tuples. For a particular table column, the destination pages (abbreviated as P) of hyperlinks in the column are gathered, and the hyperlinks on P to form a transactional database D are collected, where a transaction dk ∈ D is a bag of words of any hyperlinkassociated information
• a frequent pattern mining approach to generate frequent itemsets from D, and regard each itemset as a candidate relation is being adopted.
• for each candidate r, the trustworthiness (denoted as trust(r)) is being evaluated.
• The classifier is updated by adding r into the classifier’s
This procedure repeats iteratively until the trustworthiness converges. Candidates whose trustworthiness are larger than a pre-defined threshold – > become new relations.
In the experimental datasets were used HTML pages crawled from four websites (www.cs.uiuc.edu, cis.ksu.edu, esteelauder.com, senate.gov) downloaded in Jan. 2010.

Summing up, the four datasets contained:
• 65,452 pages
• 1,018,510 hyperlinks
• 104,596 web tables
• 44.09% of the tables contained hyperlinks.
The average numbers of generated candidates for each web table were:
• UIUC -> 108.2
• KSU – > 92.5
• ESTEE -> 20.1 (smaller number of candidates, because of the limited vocabulary of cosmetics).
• SENATE->66.0
A web table was selected from each of the 4 datasets, from which the gold standard was created by manually extracting new attributes. The authors ranked candidates of each web table according to their trustworthiness, and showed the precision-recall performance. The precisions of all datasets were generally high.
It confirmed the hypothesis that a reliable entity group facilitates the discovery of relations. In terms of recall, some relations were missed because page authors sometimes expressed the same meaning using different words. The authors decided to improve recall if prior knowledge on word correlations will be given.
3. Second paper: ‘WebTables: Exploring the power of Tables on the Web’

Works presented here were done, while all authors were employed at Google, Inc.
They extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data.
Each relational table has its own “schema” of labeled and typed columns, each such table can be considered a small structured database. As a consequence, the resulting corpus of databases is larger than any other corpus they were aware of, by at least five orders of magnitude.
Authors described the WebTables system in order to explore the following fundamental questions about this collection of databases:

• What are effective techniques for searching for structured data at search-engine scales?
• What additional power can be derived by analyzing such a huge corpus?

Key outcomes:
• Authors developed new techniques for keyword search over a corpus of tables. They justified that it is possible to achieve substantially higher relevance than solutions based on a traditional search engine.
• They introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. A distinguishing feature of the ACSDb is ->it is the first time anyone has compiled such large amounts of statistical data about relational schema usage. Therefore we can take data-intensive approaches to all of the above-listed applications, similar in spirit to recent efforts on machine translation and spell-correction that leverage huge amounts of data.
• Apart of improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete. It helps a database designer to choose schema elements; attribute synonym find-ing, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal. It allows a user to navigate between extracted schemas using automatically-generated join links.

Authors presented WebTables system, the first large-scale attempt to extract and leverage the relational information embedded in HTML tables on the Web.

They presented how to support effective search on a massive collection of tables, they demonstrated that current search engines do not support such search effectively.

What is more, they pointed out that the recovered relations can be used -> what is a very valuable data resource, the attribute correlation statistics database.

Presented in this paper ACSDb is like a breakthrough, which will help to solve a number of schema-related problems, like e.g.:
• improvement of relation ranking
• construction of a schema for auto-complete tool,
• creation synonyms for schema matching use,
• help for users in navigating the ACSDb itself.
Authors believed to start finding usage for the statistical data embodied in corpus of recovered relations. Especially, by combining it with a “row-centric” analogue to the ACSDb.

Summing up, there are tremendous opportunities for creating new data sets by integrating and aggregating data fromWebTables relations, and enabling users to combine this data with some of their private data. The WebTables relation search engine is built on the set of recovered relations, and still offers room for improvement. An obvious path is to incorporate a stronger signal of source-page quality (such as PageRank) which is currently included only indirectly via the document search results.
Authors would like to include relational data derived from non-HTML table sources, such as deep web databases and HTML-embedded lists.

4. Third paper: ‘Recovering Semantics of Tables on the Web’
Authors described a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Their annotations facilitated operations such as:
• searching for tables
• finding related tables.

To recover semantics of tables, they leveraged a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. They attached a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships.

Authors described a formal model for reasoning about when we have seen sufficient evidence for a label, and showed that it performs substantially better than a simple majority scheme. They described a set of experiments which illustrate the utility of the recovered semantics for table search. They showed that it performs substantially better than previous approaches and finally characterized what fraction of tables on the Web can be annotated using their approach.

The Web offers over 100 million high-quality tables on a wide variety of topics. These tables are embedded in HTML and therefore their meaning is only described in the text surrounding them. Header rows exist in few cases, and even when they do, the attribute names are typically useless.

Without knowing the semantics of the tables, it is very difficult to leverage their content, either in isolation or in combination with others. The challenge initially arises in table search (for queries such as countries population, or dog breeds life span), which is the first step in exploring a large collection of tables.

Authors pointed out the WebTables system, which is the first large-scale attempt to extract and leverage the relational information embedded in HTML tables on the Web. They described how to support effective search on a massive collection of tables and demonstrated that current search engines do not support such search effectively.
Finally, they showed that the recovered relations can be used to create the attribute corre-lation statistics database.

They believe, that it is possible to find uses for the statistical data embodied in corpus of recovered relations, especially by combining it with a “row-centric” analogue to the ACSDb, in which they stored statistics about collocations of tuple keys rather than attribute labels. They could enable a “data-suggest” feature similar to schema autocompleter.

There are tremendous opportunities for creating new data sets by integrating and aggregating data fromWebTables relations, and enabling users to combine this data with some of their private data. The WebTables relation search engine is built on the set of recovered relations, and still offers room for improvement.

Authors would like to also include relational data derived from non-HTML table sources, such as deep web databases and HTML-embedded lists.

5. Individual observations & conclusions

From my perspective the first article provides basics concerning searches on web tables. It is good introduction for everyone who is not very familiar with the topic.
The second article introduces business related example, because researches presented there, were perform at Google, Inc. Authors developed new techniques for keyword search over a corpus of tables and justified that it is possible to achieve substantially higher relevance than solutions based on a traditional search engine.
They introduce a new object – > the attribute correlation statistics database (AcsDB) which records corpus-wide statistics on co-occurrences of schema elements.

The third article is interesting enhancement of both above mentioned papers. It describes a set of experiments which illustrate the utility of the recovered semantics for table search. It performs substantially better than previous approaches and characterises what fraction of tables on the Web can be annotated using their approach.

From my point of view, the most important thing is -> searching mechanisms presented in all these papers can be used at global corporations for the following purposes:
• to search global programs/ projects sharepoints or other web-based repositories
• to search info among corporate Intranet
• to search info in web-based tools (Defect Management -> JIRA)
• to search different web-based work instructions, regulations and other project references
• to search employee directories (my Logica or db Directory)
• to search different web-based Knowledge Management Systems (dbWiki, Logica Cortex)

6. References

• Entity Relation Discovery from Web Tables nad Links
Cindy Xide Lin, Bo Thao, Tim Weninger, Jiawei Han, Bing Liu
April 26-30, Raleigh, NC USA

• WebTables: Exploring the Power of Tables on the Web
Michael J.Cafarella, University of Washington
Alon Halevy. Google Inc.
Daisy Zhe Wang, UC Berkeley

• Recovering semantics of tables on the Web
Petros Venetis, Stanford University
Alon Halevy, Google Inc
Jayant Madhavan, Google Inc
Marius Pasca, Google Inc
Warren Shen, Google Inc
Fei Wu, Google Inc
Gengxin Miao, UC Santa Barbara
Chung Wu, Google Inc

Tagi: