Despite significant advances made by convolution networks in standard recognition tasks such as image classification and segmentation, deeper and more complicated local convolutions are the dominant paradigm. But these networks compromise feature interpretability and also lack the global reasoning capability that is crucial for complicated real-world tasks.

Some works have formulated graphical models and structural constraints on final convolution predictions (e.g., CRF’s), but they are not able to explicitly enhance feature representations and so cannot be widely generalized. Capsule networks can be extended to learn the sharing of knowledge across locations to find feature clusters, but they can only exploit implicit and uncontrollable feature hierarchy. This lack of explicit reasoning over contexts and high-level semantics keeps convolution networks from recognizing objects in large concept vocabularies where it is necessary to explore semantic correlations and constraints. However, structured knowledge does help with recording human observations and common sense using symbolic words (e.g., nouns or predicates). Therefore, what we want to do is bridge symbolic semantics with learned local feature representations for better graph reasoning.

### Using “Common Sense”

In our paper, we explore how to incorporate human common sense into intermediate feature representation learning beyond local convolutions to further achieve global semantic coherency. We represent “human common sense” as various undirected graphs consisting of rich relationships (e.g., semantic hierarchy, spatial/action interactions and attributes, and concurrence) between concepts. For example, the concepts “Shetland Sheepdog” and “Husky” share the superclass “dog” due to some common characteristics; people wear hats and play guitars, but not vice-versa; and orange is a yellowish color. After associating this structured knowledge with visuals, all of these symbolic entities (e.g., “dog”) can be connected with evidence from images, allowing us to integrate visual appearance with common-sense knowledge.

Then, we attempt to mimic this reasoning procedure and integrate it into convolution networks: first, we characterize representations of different symbolic nodes by voting from local features; we perform graph reasoning for enhancing visual evidence of these symbolic nodes via graph propagation to achieve semantic coherence; and, finally, we map the evolved features of the symbolic nodes back into each local representation. Importantly, we go beyond previous approaches by directly incorporating the reasoning over the external knowledge graphs into local feature learning, which we call the Symbolic Graph Reasoning (SGR) layer. Note that we use “Symbolic” here to denote nodes with explicit linguistic meaning rather than conventional/hidden graph nodes used in graphical models or graph neural networks.

### Symbolic Graph Reasoning

The core of our SGR layer consists of three modules, illustrated in Figure 1.

Figure 1: An overview of the proposed SGR layer. Each symbolic node receives votes from all local features via a local-to-semantic voting module (long gray arrows), and its evolved features after graph reasoning are then mapped back to each location via a semantic-to-local mapping module (long purple arrows). For simplicity, we omit more edges and symbolic nodes in the knowledge graph.

#### Local-to-Semantic Voting Module

The first module is personalized visual evidence of each symbolic node that is produced by voting from all local representations. We call this the local-to-semantic voting module. The voting weights stand for the semantic agreement confidence of each local feature to a certain node. Given local feature tensors from convolution layers, our target is to leverage global graph reasoning to enhance local features with external structured knowledge. To do this, we first summarize the global information encoded in local features into representations of symbolic nodes — local features that are correlated to a specific semantic meaning (e.g., cat) are aggregated to depict the characteristic of their corresponding symbolic node. Formally, we use the feature tensor after the convolution layer as the module input.

#### Graph Reasoning Module

The second module is graph reasoning, which uses structured knowledge based on visual evidence and semantic constraints to evolve global representations of the symbolic nodes. We incorporate linguistic embeddings of each symbolic node and knowledge connections (i.e., node edges). Formally, for each symbolic node $n \in \mathcal{N}$, we use off-the-shelf word vectors as linguistic embeddings. The graph reasoning module performs graph propagation over representations of all symbolic nodes via the matrix multiplication form, resulting in evolved features.

#### Semantic-to-Local Mapping Module

The final module is a dual semantic-to-local module, which learns appropriate associations between the evolved symbolic nodes and local features to join local and global reasoning, since the feature distributions of each symbolic node have changed after graph reasoning. It enables the evolved knowledge of a specific symbolic node to only drive the recognition of semantically compatible local features with the help of global reasoning. This can be agnostic to learning the compatibility matrix between local features and symbolic nodes.

#### The SGR Layer

Each symbolic graph reasoning layer constitutes this stack of a local-to-semantic voting module, a graph reasoning module, and a semantic-to-local mapping module. The SGR layer is instantiated by a specific knowledge graph with different numbers of symbolic nodes and distinct node connections. Combining multiple SGR layers with distinct knowledge graphs into convolutional networks can lead to hybrid graph reasoning behaviors. We implement the modules of each SGR via the combination of $1\times 1$ convolution operations and nonlinear functions, detailed in Figure 2.

Figure 2: Implementation details of one SGR layer.

The key merits of our SGR layer are: a) by learning associations between image-specific observations and prior knowledge graphs, it allows for collaboration between local convolutions and global reasoning facilitated by common sense knowledge; b) each local feature is enhanced by its correlated incoming local features, whereas in standard local convolutions, local features are only based on comparisons with their own incoming features and a learned weight vector; c) after learning the representations of universal symbolic nodes, the learned SGR layer can be easily transferred to another dataset domain with discrete concept sets, and an SGR layer can be plugged between any convolution layers and subsequently personalized according to distinct knowledge graphs.

#### General-Purpose Graph Construction

The common sense knowledge graph is used to depict distinct correlations between entities (e.g. classes, attributes, and relationships) in general, which can take any form. To support general-purpose graph reasoning, the knowledge graph can be formulated as a graph with a symbol set and an edge set. Here are three examples:

1. A class hierarchy graph that is made up of a list of entity classes (e.g., “person,” “motorcyclist”) and with graph edges made up of concept belongings (e.g., “is a kind of,” or “is a part of”). The networks equipped with this hierarchy of knowledge can encourage the learning of feature hierarchy by passing the shared representations of parent classes into their child nodes;
2. A class occurrence graph with edges that are defined as the occurrence of two classes across images, characterizing the rationality of predictions;
3. A higher-level semantic abstraction — a semantic relationship graph that can extend symbolic nodes to include more actions (e.g., “ride”, “play”), layouts (e.g., “on top of”), and attributes (e.g., color or shape), while graph edges are statistically collected from language descriptions. Incorporating high-level common sense knowledge like this can help networks prune spurious explanations, resulting in good semantic coherence.

Based on this general formula, graph reasoning must be compatible with and general enough for soft graph edges (e.g., occurrence probabilities) and hard edges (e.g., belongings), as well as diverse symbolic nodes. Various structure constraints can be modeled as edge connections over symbolic nodes, in the same way we use language tools. Our SGR layer is designed to achieve the general graph reasoning that is applicable for encoding a wide range of knowledge graph forms.

## Experiments

Our SGR layer is flexible and general enough to be injected between any local convolutions. However, because SGR is designed to incorporate high-level semantic reasoning, it’s better to use it in later convolution layers, as demonstrated in our experiments.

Tables 1, 2, and 3 show the results of our SGR layer compared to recent state-of-the-art methods on the Coco-Stuff, Pascal-Context, and ADE20K datasets, respectively. Our SGR layer significantly outperforms existing methods on all three datasets, demonstrating its effectiveness in performing explicit graph reasoning beyond local convolutions for large-scale pixel-level recognition.

Figure 3 shows the qualitative comparison with the baseline “Deeplabv2.” Our SGR obtains better segmentation performance, especially for some rare classes (e.g., umbrella, teddy bear), benefiting from joint reasoning with frequent concepts over the concept hierarchy graph. It incorporates high-level semantic constraints designed for classification tasks into pixel-wise recognition, which is impressive since associating prior knowledge with dense pixels itself is difficult. Unlike prior methods, our SGR layer achieves better results with only one additional reasoning layer, while preserving both good computation and memory efficiency.

Note that our SGR learns distinct voting and mapping weights in the local-to-semantic and semantic-to-local modules, respectively. Comparing “Our SGR (ResNet-101)” with “SGR (w/o mapping)” in both testing performance and training convergence in Tables 1 and 4 show that estimating new semantic-to-local mapping weights can improve the reasoning process on the evolved feature distributions after graph reasoning.

By pioneering the integration of common sense into the design of machine learning systems, we hope our SGR will help boost research into global reasoning in convolution networks. Check out our paper for more details: http://papers.nips.cc/paper/7456-symbolic-graph-reasoning-meets-convolutions

### Related Posts

There are no related posts

There are no related posts