• No results found

Can We Automate Diagrammatic Reasoning?

N/A
N/A
Protected

Academic year: 2022

Share "Can We Automate Diagrammatic Reasoning?"

Copied!
12
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Can we automate diagrammatic reasoning?

Arif Ahmed Sekh

a,

, Debi Prosad Dogra

b

, Samarjit Kar

c

, Partha Pratim Roy

d

, Dilip K. Prasad

a

aUiT The Arctic University of Norway, Tromsø, Norway

bIndian Institute of Technology, Bhubaneswar, India

cNational Institute of Technology, Durgapur, India

dIndian Institute of Technology, Roorkee, India

a rt i c l e i n f o

Article history:

Received 18 April 2019 Revised 26 April 2020 Accepted 29 April 2020 Available online 6 May 2020 Keywords:

Abstract reasoning

Raven,s Progressive Matrices (RPM) Diagrammatic reasoning

Visual IQ test

a b s t r a c t

Diagrammaticreasoning (DR)problemsarewellknown.However,solvingDR problemsrepresentedin 4 ×1 Raven’s Progressive Matrix(RPM)form usingcomputer visionand pattern recognitionhas not yetbeentried.Emergenceofdeeplearningtechniquesaidedbyadvancedcomputingcan beexploited tosolve suchDR problems.Inthispaper, weproposeanew learningframeworkby combiningLSTM and Convolutional LSTM tosolve 4 × 1 DR problems. Initially, theelementary geometricalshapesin suchproblemsare detectedusingatypicalCNN-baseddetector. Next,relations ofvarious shapes are analyzedandahigh-levelfeaturesetisproducedandprocessedintheLSTMframework.Anew4×1DR datasethasbeenpreparedandmadeavailabletotheresearchcommunity.Webelieve,itwillbehelpful inadvancingthisresearchfurther.Wehavecomparedourmethodwithsomeoftheexistingframeworks thatcanbeusedforsolvingRPM-guidedDRproblems.Wehaverecorded18–20%increaseintheaverage predictionaccuracy as comparedtothe priorframeworkswhen appliedtoRPM-guided DR problems.

Webelieve theCVresearchcommunitywillbeinterestedtocarryoutsimilar research,particularlyto investigatethefeasibilityofsolvingothertypesofknownDRproblems.

© 2020TheAuthor(s).PublishedbyElsevierLtd.

ThisisanopenaccessarticleundertheCCBYlicense.(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Abstract reasoning or diagrammatic reasoning requires visual representationsofobjectsordiagrams.Itinvolvestheunderstand- ing ofconceptsandideasfromimageswiththepatternsthat are used in visual IQ tests [1]. Solving such diagrammatic reasoning problems using artificial intelligence can help us to understand complexpatternsofobjectsinimages.Typically,atestindiagram- matic reasoning consistsof a set of questions.The questions are usually ofmultiplechoices.Thesequestionsgenerallyconsistofa seriesofpictures,eachofwhichisdifferent.Thetaskistochoose anotherpicture froma numberofoptionstocompletethe series.

Forexamples,Fig.1showsatypicaldiagrammaticreasoningprob- lem, wherethe firstrow representsthe questionandthe second row containsfour options out of which only one is correct. The objective is tolearn the rules that can be applied to a sequence

Corresponding author.

E-mail addresses: arif.ahmed.sekh@uit.no (A .A . Sekh), dpdogra@iitbbs.ac.in (D.P. Dogra), samarjit.kar@maths.nitdgp.ac.in (S. Kar), proy.fcs@iitr.ac.in (P.P. Roy), dilip.prasad@uit.no (D.K. Prasad).

andthenusethemtopickanappropriateanswer.Solvingaques- tionrequiresanalyzingasequenceofshapesorpatternsknownas theRaven’sProgressive Matrices(RPM)[2].Thisisalsoknownas abstractorinductivereasoningtest.

1.1. Relatedwork

Reasoning is the ability to make sense of things by verifying factsandapplyinglogic.Werefertomachinelearning-basedmeth- ods forreasoningasartificial reasoning(AR). AR usesknowledge completion, value approximation, andgoal-oriented reasoning to solvedifferentformsofreasoning[3].Knowledgecompletion uses knowledge graphsto expressthe factsandit extracts a common sense knowledge. It is used in various machine learning guided reasoning such asimage captioning and question answering [4]. Value approximation is a methodfor extractingnumeric facts. It isused inquantitative questionanswering fromnaturallanguage textsand images [5].Goal-oriented reasoning is a top-down ap- proachthatheuristicallysearchesforasolutiontoachieve agoal.

Itispopularinrobotics, intelligentagent, andcase-basedreason- ing [6]. Data and knowledge-driven statistical methods [7], logic https://doi.org/10.1016/j.patcog.2020.107412

0031-3203/© 2020 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ )

(2)

The second row presents the multiple choices typically shown to an examinee. Op- tion D is the right answer for the above problem.

Fig. 2. Visual reasoning datasets. (a) How many objects are either small cylinders or red things? (b) There is exactly one big yellow square not touching any edge (True/False) (c) How many slices are there in the pizza? (d) Is the umbrella upside down? . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

programming [8], and neural network-based approaches [9] are alsopopularforsolvingvariousreasoningproblems.

AR methods are complex innature andsuch methods require logicalrepresentationof data,common sense,statistical informa- tion,andlearningtechniques.Hencethelearningguidedreasoning methodsneedfurtherimprovementthrough thefusionofknowl- edge representation, reasoning and learning techniques [10]. For examples, statistical/relational learning [11] and knowledge base reasoning [12] are used in various reasoning problems. Statisti- cal reasoningmay be definedasmaking sense of historical data.

It is widely used in psychology, health, economy, etc. In logical AI, first-order logic (FOL) and relationalrepresentations are used to gain knowledge. To know more about relational ML, the re- viewwork presentedin [13]maybe consulted. Athoroughanal-

tion andreasoning-based learningapproach.The method opened upnewinsightsofintractabilityinAI.Maoetal.[18]usedasimi- larconcepttolearnabstractknowledgefromvisualrepresentation andlanguage embedding.However,similartasksinvisualreason- ing havenotyetbeentried by theCVcommunity.Visual reason- ingisnotstraight-forwardascomparedtotheother typesofrea- soning dueto the difficulty in interpreting the objects andtheir relations [19].Therefore,existinglogic andstatisticalAImethods cannotdirectlybeappliedtosolvevisualreasoningproblems.Two similar domains of reasoning that have received the attentionof CVresearchcommunityarevisualquestionanswering[20]andvi- sual reasoning [5]. Visual questionanswering consists of images andquestions that can beanswered from theimages. Toanswer thequestions,wemayrequirepriorknowledge abouttheobjects, theircolour,position,etc.Inadditiontothesefeatures,visualrea- soningmayalsorequireshapeinformation,count,orientation,etc.

Fig. 2(a) depicts an example of visual reasoning taken from the popular Compositional Language and Elementary Visual Reason- ing (CLVR)[5]dataset.CLVR datasetis usedforreasoning colour, shape,quantity,andsize.Fig.2(b)depictsCornellNaturalLanguage Visual Reasoning (NLVR) synthetic dataset that is primarily used to categorize comments. The questions on NLVR demand under- standing ofnatural language, shape, position, color, etc. Fig.2(c) presents an example from the Visual Question Answering (VQA) dataset containing open-ended questions about the images [21]. Thesequestionsrequirean understandingofvision, languageand common sense to answer. Fig. 2(d) presents reasoning of image pairs[22].Thesequestionsrequirereasoningabouttherelationof objects.

Visual IQ questions that are based on RPM [2] vary in na- ture anddiverse in complexities. Answers of RPM-based reason- ing requires common sense, idea about the shapes, and knowl- edge ofmathematics. There existdifferenttypes ofDRproblems.

Forexamples,Fig.3(a)representsatypical2×2DRproblemand Fig.3(b) representsa typical 3× 3 DRproblem. DR problemsof the order 3 × 3 are considered as formal RPMs andtheir solu-

Fig. 3. Examples of various RPM-based DR problems. (a) Example of a 2 ×2 matrix reasoning problem. (b) Example of a 3 ×3 graphical reasoning problem.

(3)

Table 1

Summary of Raven’s Progressive Matrices (RPM) Solving Methods.

Ref. Problem Type Feature Method

Kunda et al. [25] 2 ×2 High-level(Symbol) Model-based

Lovett et al. [26] 3 ×3 High-level (Object) Model-based

Ragini et al. [27] 3 ×3 High-level (Structure) Model-based

Mcgreggor et al. [28] 3 ×3 High-level (Object) Model-based

Lovett et al. [29] 3 ×3 High-level (Structure) Model-based

Santoro et al. [23] 3 ×3 Low-level (Raw Image) Learning-based

Hill et al. [24] 3 ×3 Low-level (Raw Image) Learning-based

Zhang et al. [30] 3 ×3 Low-level (Raw Image) Learning-based

Proposed 4 ×1 High-level (Relation) + Low-level (Raw Image) Learning-based

tions are addressedin [23] usingneural networkand in[24] us- ingrelationalstructure.Kundaetal.[25]haveusedtwocomputa- tionalalgorithms,namely FractalEncodingAlgorithm(FEA)and Affine-Extended Algorithm (AEA) tosolve RPMs. FEAdecomposes theimagesofaproblemintoasetofsmallimagesbyapplyinga setofspecificaffinetransformations(copy,rotation,orflip). Next, the algorithm generates fractal solutions to RPMs by considering allpossiblepairwise transforms.Lovettetal.[26]haveusedcom- putationalmodeltosolveRPMs.Themethodusesstructuralinfor- mationsuchasshape,texture,etc.ItthenusesStructure-Mapping Enginetofindthepatternvarianceamongimages.Finally,asetof rulesisappliedtofindasolution.Ragnietal.[27]haveproposed agoal-orientedrule-basedmethodtosolveRPMs.Themethodfirst processesconsecutivecellsofthematrixtoidentifythegoal(rule and texture) and then processes the solution image by analyz- ing the difference. McGreggor et al. [28] have proposed a con- fidence score for solving such problems. Firstly, each cell of the matrixisrepresentedusingrelationalfractalrepresentations(fea- ture similarity). Then, an image is expressed asa transformation (union, rotation, etc.) of a single image or multiple images. The answeristhenchosenbasedonmaximumsimilaritywiththeop- tions. The work can be considered as a preliminary step toward structural representation of the features. Lovett et al. [29] have proposed computational model-based solution. Images are com- paredviastructuremapping,aligningthecommonrelationalstruc- turein2imagestoidentifycommonalitiesanddifferences.Barrett etal.[23]havereleasedadatasetandproposedaneuralnetwork- basedlearningmethodtosolveDRproblems.Hilletal.[24] have extendedthe methodusinganeuralnetwork.Theyconcludethat thestate-of-the-artimage-baseddeepneuralnetworksfailtosolve complexproblems.However, iftheruleisextractedcorrectly,the learningmethodsperformbetter.Zhangetal.[30]haveproposeda modeltogeneratedifferentRPMs.Theyhaveshownthatthestate- of-the-art structuralsimilarity-based, rule-based,anddeepneural networkscanachieve amaximumof65%accuracyonthedataset.

Different approaches forsolving RPMs can be categorizedby the choices of problem types, features, and the solution approaches.

State-of-the-art approacheseither use raw images as features or extract high-level informationsuch as texture, shape, colour, etc.

Therearebroadlytwotypesofsolutionapproaches,computational modeling approaches with rule-basedsystem and learning-based approaches.Afewmethods thataresimilartoourworkaresum- marizedinTable1.

Moreover,theexistingapproachestrytosolve2×2and3×3 RPMsusingcomputationalmodelsorlow-levelimage-basedlearn- ing methods. In this paper, we have considered 4 × 1 diagram- matic problemsand usehigh-level features for learningandrea- soning.Wehavemadethefollowingresearchcontributions:

We haveintroduced anew featurerepresentationthat can be usedbytypicallearningframeworkstosolvediagrammaticrea- soningproblems.

Table 2

Visual Reasoning Dataset.

Dataset Pattern

Abstract Reasoning [23,30] 3 ×3

Scanned Images [28] 3 ×3

Digital Images [31] 3 ×3

Proposed 4 ×1

Weproposeaproblemclassifierandsolvertoaddressthesolu- tion extractionoftypical 4×1DRproblems.The methodcan learn a concept with a knowledge-base using lessnumber of trainingsamples.

We haveintroducedanewdatasetcontaining4×1DRprob- lems represented in the form of RPMs. The dataset contains state-of-the-artRPMsthatcanbegeneratedusingrulesaswell ascomplex problems.The datasethasbeen madeavailable to theresearchcommunityforfurtherinvestigation.

1.2.Datasetsandbenchmarks

Theultimategoal ofvisualreasoning istolearn image under- standingandinterpretations.Lack ofdatasetsmakes itdifficultat thisstage toapply computer visiontechniques tosolve DRprob- lems.Despitetheadvancementofdeeplearningframeworks,train- ingthe networkswithasufficient amountofdata isa challenge.

To the best of our knowledge, only a 3 × 3 DR dataset has re- centlybeenreleasedbyBarrett etal.[23].Thedatasetisreferred to asProcedurally Generated Matrices (PGM) dataset. McGreggor etal.[28] haveused a smalldataset collectedfromscanned im- ages.Table 2presentssummary of variousdatasets. Therefore,at this juncture, we thought to prepare and release a DR dataset tothe research community sothat the area can advance further.

Thus, we have collectedimages of diagrammatic reasoning from the web and prepared a dataset of 4 × 1 diagrammatic reason- ing problems.The datasetcontains 619 problems. We havecate- gorizedtheseproblemsintofourgroups, namely(i) Rotation(RT), (ii)Counting(CT),(iii)ShapeScaling(SS),and(iv)OtherType(OT).

Fig.4depictsa samplequestionwithpossibleanswers fromeach categoryandFig.5depictsthedistributionoftheproblemsacross thevariouscategoriesinourdataset.

Rest of the paper is organized as follows. In Section 2, we presentthe proposed DR solving method. Experiment resultsare presentedinSection3.Conclusionandfuturework arepresented inSection4.

2. Proposedarchitecture

Theproposed methodisbased onaset offeaturesandan al- gorithmic pipeline. Majority of the reasoning problems are tack- led with relational learning [11] and reasoning capabilities [12], whereastheimage-centricneuralnetwork-basedlearningapplica- tionsdonotrequirerelationallearning.Wehaveintroducedanew

(4)

Fig. 4. Examples of four types of typical DR problems that are present in our dataset. (a) Example of a rotation problem (RT), where a pattern is rotated as compared to the first image with relative rotations that may be mentioned in a DR question as {0 , 90 , 180 , ?}. The prediction should be 270 and the correct answer is option B. (b) It is a typical problem of number series prediction (CT). The question consists of a set of filled circles. Here, the number of circles varies as 2, 4, 6, ?. Our task is to predict the picture with 8 filled-circles. The correct answer is option B. (c) Third one is an example of typical shape and scaling problem (SS). The pattern can be interpreted as { < Cicle, Large Triangle > , < Circle, Big Triangle > , < Circle, Small Triangle > , < ? > }. Our task is to predict < Circle, Tiny Triangle > which is option B. (d) The fourth one is a typical pattern understanding problem. We have categorized such problems into Other Type (OT). Our task is to predict the 4 th pattern. The correct answer is option A.

Fig. 5. Distribution of different DR problems in our dataset.

methodtoextracttherelationalfeatures (RF)ofimage sequences tosolve one specific type ofDR problems.The proposed method consistsof two major steps. During the first level of processing, thequestionandoptions are passedthrough aknowledge acqui- sitiontool to constructthe knowledge base.The knowledge con- sistsofa setof imagefeatures extractedfrom theindividual im- ageand a set of relational features extracted fromthe sequence ofimagesinthequestionandtheoptions.Next,theproblemtype isidentified usinga supervisedlearning method. We define two

typesofLong Short TermMemory (LSTM)networks. Eachone of them is responsible to solve one specific type of problems. One LSTM takestext-based features (forRT, CT, andSStype) andthe other one takes image-based features (for OT types). Unlike the test-basedreasoningproblems[11],wherethereasoningneeds to be defined by knowledge or FOL [32], we learn the logic using LSTM through training.Fig. 6depicts theproposed framework in detail. The pipeline consists of (i) a Knowledge extraction mod- ule, (ii) a problem classifier and LSTM chooser module, (iii) two LSTMs, and (iv) a matching module. Let the problem space (P) be defined in (1), where the question contains a set of images (Q)=

{

I1,I2,I3

}

andtheoptionsaregroupedinthesolutionspace (O)=

{

I4,I5,I6,I7

}

. Diagrammatic reasoning is to predict the an- swersuch that IanswerO.First, we representtheproblemusing a high-level knowledgestructure. Theindividual modules are de- scribedhereafter.

P=

{

I1,I2,I3,...,I7

}

(1)

2.1. RCNNModule

First, each image of P is passed through an RCNN module to extract the shapes and the bounding box information. We have

Fig. 6. Architecture of the proposed framework. The architecture consists of an RCNN module, a knowledge module, problem classifier module, two LSTM modules, and a solution module. We take raw question sequence and the options as input and construct a knowledge base by taking the output from the RCNN module. Next, the knowledge is used to classify the problems into 4 categories and select the suitable LSTM. Finally, it predicts the best possible option out of the four input options and produces a complete sequence of four patterns/images.

(5)

Fig. 7. We represent the rotation problem as a set of 7 images or patterns. In rotation problems, we consider the first image (red) as the reference image with 0 rotation and extract the rotation relation of other images. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. Sample images after applying rotations on the first image of P . Depiction of how a possible match is found at 90 for a given query image.

considered 7commongeometricalshapes, namelycircle, triangle, rectangle,square, diamond,star,hexagon thatare usually present in various DR problems. All the shapes are classified as either empty(onlyedges)orfilled.Wehaveexperimentedwiththestate- of-the-artRCNNtodetecttheseshapes.YOLO[33]hasbeenfound tobeagoodrecurrentclassifierascomparedtoResnet50/101[34], VGG16[35],orGoogleNet[36].

ArchitectureandTraining: YOLOpredicts theboundingboxes and class confidence of a given image. It consists of 53 no. of successive 3 × 3 and1 × 1 convolutional layers. We have used a transfer learning approach [37] to train the RCNN. We have used ImageNet as the base model. For training, we have gener- ated 14 types of images (class) consisting of 7 types of shapes with filled andunfilled objects.The synthetic dataset consistsof 100,000shapeimagessimilartotheFourShape1.Itisgeneratedby thevaryingsizeandapplyingrotation.Themodelhasbeentrained with3000epochswiththedefaultparametersofYOLOV3.

2.2. Knowledgeacquisitionmodule

Knowledgeacquisitionhasbeencarriedoutduringtrainingand solutiongeneration.Theknowledgebase(

κ

)isextractedfromthe

sequence ofimages inthe givenproblemand theset ofoptions.

Wehaveconsideredthetypesofshape,numberofshapes,andsize oftheshapesastherelativefeatures.First,theshapesineachim- ageinPandtheboundingboxesareextractedusingtheRCNN.We thenintroduceanewfeatureextractionmethodforsolving4×1 DR problems. The feature is referred to asthe relational feature (RF). Unlikeimage-basedfeatures such ascolor,texture, shapeor edge that are typically used in various computer vision applica- tions,we haveextractedthreerelationalfeatures(RF),namelyro- tation (

ρ

), counts(

χ

), andscaling (

σ

) from the setof the given

1https://www.kaggle.com/smeschke/four-shapes .

images.Thefeature-setisgivenin (2).Variouscomponentsofthe feature-set(k)aredescribedhereafter.

κ

=<

ρ (

Ik

)

,

χ (

Ik

)

,

σ (

Ik

)

>,

k,kP (2)

Rotation:Inatypicalrotationdiagrammaticreasoningproblem (Fig.7),thesolutionliesinrotatingthefigurecorrectlytocomplete thesequence.Weassumethefirstimage(I1)asthereferencewith arotationof0.Alltheother images(I2,...,I7)areexpressedus- ingrotationanglewithrespecttothereferenceimage. Toachieve this,360imagesaregeneratedby incrementallyrotatingthebase image by r, where r=1. A few samples of the rotated images corresponding to the DR problem describedin Fig. 7 are shown in Fig. 8.This set is denoted by R=

{

I1,I2,...,I360

}

.The similar-

ityscore(

ψ

)isdefinedin(3).First,aResNet50[34]networkwith

averagepoolinghas beenused toextract features ofimages.The networkusespretrainedimagenetastheweightvector. Thescore has been estimatedbetween a query image andall images of R usingtheResNet50byconsideringchi-squaredistance,whereIjis queryimageandIkistheimageinR.

ψ

jk=Similarity

(

Ij,Ik

)

(3)

Therelativerotation

ρ

(Ik) ofeach imageofPisthenextracted withrespecttoeachimageIjbelongingtoR.IftheimagesinPare differentfrom each others,we categorize the questionas a non- rotationproblemandthenotapplicable(NA)flagisset.Athresh- oldhasbeenusedtodecideaboutthesuccessofmatching.

ρ

(Ik)is settothevalue ofrotationifthematchingscorereturnedby the ResNet50isabove thethreshold.However, inthe eventofmulti- pleimagesbeingcategorizedabovethe threshold,the imagethat givesthehighestvalueisselectedanditsrotationangleistakenas thefinal input.Intheeventthatnoneisfoundsuitable, theprob- lemis categorizedas a non-rotationdiagramatic reasoningprob- lem.Forexample,the relative rotations ofthe diagrammatic rea- soningproblemdepictedinFig.4(a)are{0,90,180}fortheop-

(6)

Fig. 9. A typical counting DR problem with 7 pictures. The first three patterns represent the sequence given in the question 2, 4, 6, ? and the next four patterns represent the options for the probable answer with 8 as the correct option.

Fig. 10. (a) Detection of shapes achieved by YOLO. (b) The bounding boxes are grouped using DBSCAN. Each color represents a group of same size shapes.

tions inthe questionand{180,270,0, 90}for theoptions in theanswer.

Counting:Countingisareasoningproblemwherethesolution istoextractthecorrectnumberofshapespresentintheproblem sequence. First, the shapes are detected and the number of the sametypes of shapesis estimated. For example,Fig. 9 depictsa typicallyfilledcircledetectionandcountingusingRCNN.Eachim- ageoftheproblemspaceisexpressedusingthecountofshapesin asequenceas{2,4,6,?}.Thepredictedmissingnumberisneeded toselectfromtheset{6,8,4,10}.

Scaling:Relativescaling(

σ

)isthenextractedfromthebound-

ingbox ofthedetected shapes. First,the boundingboxesare ex- tracted fromthe shapesin P. Each shape in the question image sequenceandtheoptionsarerepresentedbywidth(w)andheight (h) of the bounding box. Next, each type of shapesare grouped using unsupervised density-based spectral clustering with appli- cation to noise (DBSCAN) [38] using w and h. The groups are thenrearrangedinincreasingorderofthearea(w× h) suchthat area(L1)<area(L2)...<area(Ln). These groups are labeled using rulesasextremelylarge, verylarge,large,normal,medium, small andtinybasedonthenumberofclusters.Thegroupingandlabel- ingofshapesaredescribedinAlgorithm1.

Fig. 10 (a) depictsa DR problemwhere size ofthe pattern is usedasaclueforthesolution.TheDBSCANalgorithmcanidentify fourclassesorgroups,wherethe problemhas beenexpressedas {< VeryLarge >, < Large > , < Small > ,?}, andthe solution optionsare{<Nil>, <Tiny>, <VeryLarge>, <Small>}.

Representationofknowledgebase:Foragivenproblemspace P, the shapes are detected and the relational features (RF) are extracted as mentioned earlier. The knowledge base consists of four sets, namely shapes, rotation (

ρ

), counting (

χ

), and scal-

ing (

σ

). Shapesstore informationabout thestructures andother sets represent various components of the relational features.

Table3showstheknowledge extractedfromfourdifferent4× 1 problems.

Algorithm1 Scaling-basedfeatureextraction.

Input: ProblemSpace(P)asdefinedinequation~(1) Output: Relationalscaling(

σ

)ofeachimage

1: S=DetectShapes(Ik),

k,IkP 2: ShapeGroup=DBSCAN(S)

3: Extractnumberofcluster(c)fromShapeGroup

4: Rearrange group and assign label L1,L2,...,Ln, where area(L1)<area(L2)...<area(Ln)

5: ifc=6then

6: L=

{

ExtremelyLarge,VeryLarge,Large,Medium,Small,Tiny

}

7: elseifc=5then

8: L=

{

VeryLarge,Large,Medium,Small,Tiny

}

9: elseifc=4then

10: L=

{

Large,Medium,Small,Tiny

}

11: elseifc=3then

12: L=

{

Large,Medium,Small

}

13: elseifc=2then 14: L=

{

Large,Small

}

15: elseifc=1then 16: L=

{

Normal

}

17: else 18: L=

{

Nil

}

19: endif

20:

σ

k=ShapeLabel(Ik)

21: Return

σ

2.3. Problemclassificationmodule

Problem classification plays an important role as it is used to select the appropriate LSTM module. Failure in classification may lead to a wrong solution selection. The knowledge base of the relative features extracted in the previous step is used to classify the problem and based on the problem category a spe- cificfeature ischosen to representtheproblem. We callthe fea-

(7)

Table 3

Typical examples of knowledge base extracted using the features described earlier (The first 3 rows are correctly ex- tracted, the last row is failure case).

DR Problem Constructed knowledge base

Shapes = {F il l edtriangl e , T riangle } ρ= {0 , 90 , 180 , 180 , 90 , 0 , 270 }

χ= {< 1 , 1 > < 1 , 1 > < 1 , 1 >, < 1 , 1 >, < 1 , 1 >, < 1 , 1 >, < 1 , 1 > } σ= {< N, N >, < N, N >, < N, N >, < N, N >,

< N, N >, < N, N >, < N, N > }, where N is Normal.

Shapes = {Filled circle } ρ= {NA }

χ= {< 2 >, < 4 >, < 6 >, < 6 >, < 8 >, < 4 >, < 10 > } σ= {< AN >, < AN >, < AN >, < AN >, < AN >,

< AN >, < AN > }, where AN is All Normal.

Shapes = {Filled triangle } ρ= {NA }

χ= {< 1 >, < 1 >, < 1 >, < 1 >, < 1 >, < 1 >, < 1 > } σ= {< Ver yLar ge >, < Large >, < Small >,

< Nil > , < Tiny > , < VeryLarge > , < Small > }

Shapes = {Filled triangle } ρ= {NA }

χ= {< 4 >, < 4 >, < 4 >, < 4 >, < 4 >, < 4 >, < 4 > } σ= {< AN >, < AN >, < AN >, < AN >, < AN >,

< AN >, < AN > }, where AN is All Normal.

Table 4

Details of the predicted knowledge and answers.

Predicted answer Predicted knowledge Detected category Correct?

ρ= {270 } Category 1 (RT) Yes

χ= {< 8 > } Category 1 (CT) Yes

σ= {< T iny > } Category 1 (SS) Yes

Not Applicable Category 2 (OT) No

ture asactive feature (

α

). First, thethree images ofthe problem

are chosen and

κ

is extracted for those images. Next, the rota-

tion (

ρ

) andcounting(

χ

) is replaced by ”Equal/Not Equal” if all

the values are equal or not. Next, all the features are encoded using the one-hot encoder and a supervised k-nearest neighbor (KNN) is applied to classify the problem into 4 classes (CT, RT, SS, andOT). We have empirically chosen k=10 and it produces good results. Based on the problem type, the active feature is

chosenasgivenbelow:

α

=

⎧ ⎪

⎪ ⎩

ρ

ifclass=RT

χ

ifclass=CT

σ

ifclass=SS I ifclass=OT

Table 4 shows reference feature prediction (answer) of the problemsshowninTable3.

(8)

Fig. 11. Interpolation model for solving Category 2 problems. E: Encoder, D: Decoder.

Fig. 12. Prediction model for solving Category 1 problems. Knowledge extractor is used to represent sequence of images to sequence of features ( ρ, χ, σ). Depending on the problem type corresponding feature is taken as the input (active feature) to the LSTM.

2.4.LSTM-basedsolutiongenerator

The final stage is to learn the patternfrom the question im- ages andpredict the correct answer from the givenoptions. We haveused two variations of LSTM to predict the answer option.

Inthe caseofCategory1,a simplervariation ofLSTM is usedas proposedin[39].Themethodhasbeenusedtogenerateacaption fromthe images.Wehavenot usedtheimage asinput.We have usedtheneuralarchitectureofthelanguagelearningandgenera- tionpartintheproposed method.Ratherthanusingtheconven- tionalimage-based features[40],we haveusedrelationalfeatures (RF)extractedbytheknowledgeextractor.Themethodisdepicted inFig. 12,where the knowledge extractor (KE) isthe process of extractingRFsasdiscussedearlier.Atthebeginning,therelational features(RF)areextractedfromalltrainingsamplesandtheactive feature (

α

) is chosen. Next, a text-based LSTM corresponding to Category1(

α

istheinputtotheLSTM)istrainedtobuildthepre-

dictionmodel.Inthetestingphase,asimilarknowledgebaseand theactive featureare extractedfromthe test samples. Unknown problems(Category2)aresolvedbyavariationoftheLSTM,called FlexibleSpatio-Temporal Network (FSTN)proposed in [41].Origi- nallythe method predicts the futurevideo framesfrom a set of observedsequences.Inthismethod,image-based featuresare se- quentiallypassedthroughaconvolutional-LSTM.Fig.11depictsthe methodindetail. The methodconsistsofa sequence ofconvolu-

tionalandpoolinglayers andLSTMmodules.The methodtakesa sequenceofimagesandfeaturesafterconvolutionandpoolingare passedthroughtheLSTM.Thenetworkistrainedusingtheimage sequencesconsistingoftheproblemandcorrectanswerimages.

Model architecture and training : Category1 LSTM is mod- elledbyanRNNconsideringp(St

|

K,S0,S1,...,St1),whereKisthe knowledge,Sisthe wordrepresentingknowledge words,andtis thetimestep(t=4).Thehiddenstateormemory(ht)isupdated afterreceivingtheinput(xt)bynonlinearfunctionf:

ht+1= f

(

ht,xt

)

(4)

Tomaketheabovenetworkapplicable toourdomain,twocrucial design choicesare to be made:(1) What is theexact form off? and(2)Howaretheimagesandwordsfedasinputsxt?Forf,we usea Long-Short TermMemory (LSTM)network, that hasshown state-of-the-artperformanceon sequenceclassification.The LSTM usesstate-of-the-artmodules [42] forhiddenlayersandthefinal predictionlayerdefinedin(6),whereUandVareinputandoutput weightvectorsandbisthebias.

ht=

σ (

bi+xtUi+ht1Vi

)

(5)

Thehiddenunitiscombinedwithforgetgatesandoutputgates andthefinallayerisasoftmaxlayerasgivenin(6).

pt+1=So ftmax

(

lastlayer

)

(6)

(9)

Inour case,we haveuseda multilayerLSTM with512 unitsper layer consistingof2LSTM layers. Moreover,we denotethe input problembyPandthesequenceofactivefeaturesfortheimageby S=(S0,S1,S2,S3).Theapproachusesthestepsasgivenin7–(9),

x1=ActiveFeature

( α )

(7)

xt=WeSt,t

{

0,.,N1

}

(8)

pt+1=LSTM

(

xt

)

,t

{

0,.,N1

}

(9)

whereeachknowledgedescriptorword(St)usesone-hotwordem- bedding(We).Wehaveusedsequenceloss functionminimization ineachstepasgivenin(10).

Loss

(

K,S

)

= N

t=1

logpt

(

St

)

(10)

The model has beentrained using supervisedactive features ex- tractedfromthetrainingsets. Theproblemshavebeensolved by thevolunteerstoobtainthegroundtruthsandtheactivefeatures are recorded andused fortraining. Forexample, thetraining se- quence usedforthefirst threeproblemsdescribedinTable4are {0, 90,180,270},{< 2> , < 4> , < 6 >, < 8> },and {<VeryLarge >, <Large>, <Small >, <Tiny>}.Thepro- posed LSTM is then trainedto learn and predict fromthe active featuresofthetrainingsamples.

Category 2 RNN is a Convolutional LSTM that consists of a spatio-temporal autoencoder,whichin turnconsistsofan image- based encoder-decoder with an LSTM cell acting as a temporal encoder. The encoder (E) containsone convolutional layer, leaky ReLU non-linearity, anda spatial max-pooling layer. The decoder (D)mirrorstheencoder,exceptforthenon-linearitylayer,anduses spatialupsamplingtobringtheoutputbacktothesizeoftheorig- inal input. The proposed method uses 64 × 64 input image se- quences with 5 × 5 kernel and3 × 3 pooling layer with batch normalization.TheLSTM modulesaremultilayered(wehaveused 2layers)timedistributedlayer.Wehaveusedtwotypesoflosses asreportedin[43]and[41].Thefirstlossisal2lossappliedonde- coderoutput asLDt =

Xˆt+1Xt+1

22,whereX isinputimage and Xˆispredictedimage.Thesecondlossisanencoderlossappliedon encoder output as LEt =

E(Xˆ)t+1E(Xt+1)

2, where E(Xˆ) is en- coder feature output and E(X) input feature to the encoder. The globallossisdefinedin(11).

L=

i

LE+LD (11)

Thenetworkhasbeentrainedusingtheimagesequencesofother typesofproblems(OT)withthesupervisedcorrectanswerimage.

Themethodhasbeentrainedusinga learningrateof0.1in3000 epochs.

2.5. Solutionmodule

The solution module consists of an active feature matching module andan image similarity module.In the caseof Category 1 problems,the predictedactive featureis matched withtheac- tivefeatureoftheavailableoptions.Forexampleinthefirstprob- lem presented in Table 4, if

α

=

ρ

and the predictedsolution is

270,the match modulesearches for the optionwhen

ρ

=270, i.e.option4isthecorrectanswer.InthecaseofCategory2prob- lems,thepredictedimage iscomparedwithalltheoptionimages usingResNet50featureextractor.The solutionischosenbasedon themaximummatchedimageoptions.

Table 5

Results of shape detection.

Algorithm Accuracy

ResNet50 (baseline) 57.19 ResNet101 [34] 62.19

VGG16 [35] 71.11

GoogleNet [36] 77.22

YOLO [33] 86.76

3. Experimentresults

We present the experiment results in this section. The pro- posedarchitecturestartswitha shapedetectionmethodfollowed byproblemclassificationandsolutionselection.

3.1. Shapedetectionresults

The first step of the method is to detect shapes from a givenimage. We haveexperimented with state-of-the-artconvo- lutionalnetworksincludingResNet50,ResNet101[34],VGG16[35], GoogleNet[36]andYOLO[33].YOLOhasbeenfoundtobethebest architectureforthepresentcase.70% ofthedatahavebeenused fortrainingand30%fortestingacrossallexperiments.Resultsus- ing 10-fold cross validation havebeen reported.Table 5 summa- rizestheshapedetectionresults.

3.2.Problemclassification

In the next stage, an analysis of the results of classification has been carried out. The confusion matrix for four types of problems is depicted in Fig. 13. It may be observed that the KNNwiththeproposed featurecansuccessfullyclassifytheprob- lemswithreasonablyhighaccuracy.Wehaveperformeda10-fold cross-validation andobserved that the proposed classifier classi- fies counting and rotation problems with 92% and 87% accuracy respectively.The accuracyof scalingandother types ofproblems hasbeenfound tobe 88%. Thisdeclineinaccuracy isdueto the complexnature anddiverse varietyof theproblems in theother group.Inourproposed method,theidentificationofscalingprob- lemsinvolvesscalingfactoridentificationandclustering.Afailure inany stepmay affectthe classification outcome. Moreover,ma- jorityofthecomplexothertypeofproblemscanbeclassifiedwith 81%accuracy.

Fig. 13. The confusion matrix for classifying DR problem. R: Rotation, C: Counting, S: Scaling, O: Other.

(10)

Fig. 14. A few samples from the our DR dataset when the proposed method correctly identifies the answers. The green boxes represent the ground truths and/or the correctly chosen answers. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 15. A few samples taken from our DR dataset where the proposed method fails to choose the correct option. The green boxes represent the ground truths and the red boxes represent wrongly predicted answers. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.3.Solutionselection

Inthefinalstage,LSTMisusedtofindthecorrectsolution.We have compared the proposed architecture with the state-of-the- artimage-based reasoningsolvers.The resultsaresummarized in Table6.Fig.14depictssomesuccesscases,where(a)and(d)rep- resentrotationproblems,(b)and(f)representcountingproblems, (e) representsa typical scaling problemand (c)represents other typeproblems.Itmaybeobservedthattheproposed methodcan solve different typesof DR problems with better accuracy when comparedwith existing techniques.Fig. 15 presents some failure cases.Ithasbeenobserved that theFSTNisapplicable when the sequence of the image contains continuous visual changes such ashuman motioninvideoorcompleteness problemasshownin

Fig. 14(c). It is not suitable for reasoning problems that contain high-levellogicinformation.Reasoningforhigh-levelconcepts de- mandsknowledge ofshape, counting,relation,etc. It maybeob- servedthattheproposedmethodhasfailedtolearncomplexpat- ternsofreasoningproblems.

There are problems which are more complex than counting, scaling,androtation,suchasinvolvingXORandANDoperationsor pattern-basedproblemsinvolvinglineorfigures [28].Suchprob- lemsmaybe solvedusingrules[23].Thesetypesofproblemsfol- lowsimplepatternsandneuralnetworkcan easilylearnthelogic andapply itto unknown problems. The otherstype ofproblems may be related to completeness (Fig. 15(a) and (c)), where each figure is incrementally completed or reversed by adding or sub- tractingdifferentpartsormaybemixedproblems(Fig.15(d))that

(11)

areusuallycombinationsofvariouscomplexconcepts.Solvingsuch problemsdemandsahigherdegreeofcognitiveskillsthathumans possess.Tosolvesuchproblems,neuralnetworksnotonlyrequire trainingonhowtodealwiththeproblems,butalsoneedtoapply the knowledge of numbers, operators, logic, visual patterns, and mathematicalrulestoobtainresults.

3.4. Comparativeanalysis

Computational model-based approaches such as structure- mapping [26], cognitivemodeling [27],reasoning-based [28], and modellingapproach [29] arebased ona setoffundamentalrules for solving RPMs. The methods use high-level features and are restricted to specific types of reasoning problems that can be solvedbyasetofwell-definedrules.Thesemethodsaimtomodel the rules to automate thesolving process. Such methods are not learning-based methods. Various rule-based problem generators such asrelation preservingmodel[23],rule-based structuregen- erator [24] and analogical reasoning [30] are proposed in litera- ture. The authors generate a large volume of dataset by apply- ing a set of rules anduse a singleneural network to solve vari- ous typesof DR problems. The methods utilize low-level image- basedfeatures forlearningthatdemands alargevolumeoftrain- ingsamples.Wehavemadeabridgebetweenthemodeling-based methods andlearning-basedreasoning.We haveuseda newfea- ture representationof the RPMs, referred to as relationalfeature (RF)toconstructaknowledge-base.RFsharessimilarfundamental concept of feature representation used in modeling-based meth- odsdiscussedearlier.Thefeaturesareextractedfromlow-levelim- agesusingcomputervisionmethods.Theyarethenrepresentedin astructuredmannersuchthattheycanbeusedinatypicallearn- ing framework such asLSTM. Ourmethodcan learn newknowl- edgeviatrainingandsolvetheRPMswithoutcomputationalmod- elingandrules.Wehavealsofoundthatdifferentneuralnetworks resultinto different accuracyfor differentproblems,and thereis noclearwinner.Ithasbeenobservedthat high-levelfeatures are highly suitable forsolving reasoningproblems. However, extract- inghigh-level informationfromlow-levelimagescanbecomplex.

It hasalso beenobservedthat different learningmethodscan be used to learn the low-level features such as the image andalso thehigh-level featuressuchasobjectsandrelations.Wehavead- dressed the problem of extracting high-level features from low- levelimages,representationofthefeaturesasknowledge,andpro- posedaframeworkforlearninglow-levelandhigh-levelfeatures.

The concept of relational feature is new and the proposed methodcanbeusefulinvariousimageunderstandingproblemsin computer vision [5,22]. Further, the proposed framework can be usefulfordifferentartificialreasoningtaskssuchasintelligent tu- tor,digitalassistant[46],intelligentrobot[47],etc.

4. Conclusion

In this paper, we have introduced a new dataset for solving 4 ×1DR problemsusingmachine learningandcomputervision.

ThedatasetcanbeusedbytheCVresearchcommunityforextend- ing theresearch inthisdomain.Wehaveexperimentedwithsev- eralstate-of-the-artlearningframeworkstosolveavarietyof4×1 DR problems. It has been observed that the image-based analy- sisusuallyfailstoanswercorrectlyinmanycases.Wehaveintro- ducedanewfeature-setreferredtoasrelationalfeaturestosolve 4 × 1 DR problems. Supervised learning withthe help of LSTM hasbeenusedtoclassifytheDRquestions.Resultsrevealthatthe proposedframeworkoutperformsexistingimage-basedanalysis.

It has been observed that the algorithmic pipeline definedin this work can be highlyeffective as it requires lesssamples for learning. However, the knowledge-base proposed in this work is

relatively simple in nature and it may not be sufficient to solve complex DR problems. Therefore, it may be necessary to rede- finethe feature-setfor solving complexDR problems.Inparticu- lar,other types(OT)ofDRproblemsneedfurtherattentionofthe researchcommunity.

Ethicalapproval

Thisarticle doesnot contain any studies withhuman partici- pantsoranimalsperformedbyanyoftheauthors.Informedcon- sent: Informed consent wasobtained from all individual partici- pantsincludedinthestudy.

DeclarationofCompetingInterest

Theauthorsdeclarethatthereisnoconflictofinterestregard- ingthepublicationofthispaper.

Acknowledgement

Wegratefullyacknowledge thesupport ofNVIDIACorporation withthedonationoftheQuadroP5000GPUusedforthisresearch.

Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.patcog.2020.107412. References

[1] P.A . Carpenter , M.A . Just , P. Shell , What one intelligence test measures: a the- oretical account of the processing in the raven progressive matrices test., Psy- chol. Rev. 97 (3) (1990) 404–431 .

[2] H.R. Burke , Raven’S progressive matrices: a review and critical evaluation, J.

Genet. Psychol. 93 (2) (1958) 199–228 .

[3] C. Diamantini , A. Freddi , S. Longhi , D. Potena , E. Storti , A goal-oriented, ontolo- gy-based methodology to support the design of aal environments, Expert Syst.

Appl. 64 (2016) 117–131 .

[4] Y. Zhou , Y. Sun , V. Honavar , Improving image captioning by leveraging knowl- edge graphs, in: IEEE Winter Conference on Applications of Computer Vision, 2019, pp. 283–293 .

[5] J. Johnson , B. Hariharan , L. van der Maaten , L. Fei-Fei , C.L. Zitnick , R. Girshick , Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1988–1997 .

[6] P. Giorgini , J. Mylopoulos , E. Nicchiarelli , R. Sebastiani , Reasoning with goal models, in: International Conference on Conceptual Modeling, Springer, 2002, pp. 167–181 .

[7] G.W. Mineau , R. Godin , Automatic structuring of knowledge bases by concep- tual clustering, IEEE Trans. Knowl Data Eng 7 (5) (1995) 824–829 .

[8] L.D. Raedt , K. Kersting , S. Natarajan , D. Poole , Statistical relational artificial in- telligence: logic, probability, and computation, Synth. Lect. Artif. Intell.Mach.

Learn. 10 (2) (2016) 1–189 .

[9] C.-U. Shin , J.-W. Cha , End-to-end task dependent recurrent entity network for goal-oriented dialog learning, Comput. Speech Lang. 53 (2019) 12–24 . [10] C. Huang , J. Li , C. Mei , W.-Z. Wu , Three-way concept learning based on cogni-

tive operators: an information fusion viewpoint, Int. J. Approximate Reasoning 83 (2017) 218–242 .

[11] M. Verbeke , V. Van Asch , R. Morante , P. Frasconi , W. Daelemans , L. De Raedt , A statistical relational learning approach to identifying evidence based medicine categories, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 579–589 .

[12] F. Yang , Z. Yang , W.W. Cohen , Differentiable learning of logical rules for knowl- edge base reasoning, in: Advances in Neural Information Processing Systems, 2017, pp. 2319–2328 .

[13] M. Nickel , K. Murphy , V. Tresp , E. Gabrilovich , A review of relational machine learning for knowledge graphs, Proc. IEEE 104 (1) (2015) 11–33 .

[14] H. Jaeger , Artificial intelligence: deep neural reasoning, Nature 538 (7626) (2016) 467–468 .

[15] L. Serafini , A.S.d. Garcez , Learning and reasoning with logic tensor networks, in: Conference of the Italian Association for Artificial Intelligence, Springer, 2016, pp. 334–348 .

[16] S.M. Kazemi , D. Poole , Relnn: A deep neural model for relational learning, in:

32nd AAAI Conference on Artificial Intelligence, 2018, pp. 6367–6375 . [17] A. Garcez , M. Gori , L. Lamb , L. Serafini , M. Spranger , S. Tran , Neural-symbolic

computing: an effective methodology for principled integration of machine learning and reasoning, J. Appl. Logics 6 (4) (2019) 611–632 .

Referanser

RELATERTE DOKUMENTER

The starting time of each activity will depend on the activ- ity’s precedence relations, release date, deadline, location, exclusiveness, the assigned resources’ traveling times,

Keywords: gender, diversity, recruitment, selection process, retention, turnover, military culture,

3 The definition of total defence reads: “The modernised total defence concept encompasses mutual support and cooperation between the Norwegian Armed Forces and civil society in

The system can be implemented as follows: A web-service client runs on the user device, collecting sensor data from the device and input data from the user. The client compiles

In April 2016, Ukraine’s President Petro Poroshenko, summing up the war experience thus far, said that the volunteer battalions had taken part in approximately 600 military

This report documents the experiences and lessons from the deployment of operational analysts to Afghanistan with the Norwegian Armed Forces, with regard to the concept, the main

Based on the above-mentioned tensions, a recommendation for further research is to examine whether young people who have participated in the TP influence their parents and peers in

Overall, the SAB considered 60 chemicals that included: (a) 14 declared as RCAs since entry into force of the Convention; (b) chemicals identied as potential RCAs from a list of