Lexical Analysis of Source Codes with Walking through Abstract Syntax Tree (AST)

Arghavan Moradi
4 min readApr 30, 2020

--

Source code is written in a programming language that is comprehensible by machine.

Machine learning on code (#MLonCode) is an area in machine learning that is focused on the content of code. In this topic, the data type is not text (natural language) or image, the data type is source code. Source code is written in a programming language that is comprehensible by machine. The grammar and structure of programming language are different from the human language. Therefore, the methods that apply for human language data cannot be used properly for a machine language. We need a method that is compatible with the complexities of machine language and knows the semantic behind keywords.

Abstract Syntax Tree (AST) is a more abstract version of the parsing tree in the programming language. To convert a code to semantic behind it and to extract words at different levels, we use AST. If you already know about AST, come with me and continue to read the following examples, otherwise please first read “Leveling Up One’s Parsing Game with AST”.

In this post, I will explain in detail how we can extract different parts of code from variable names to API calls and class types after converting source code into AST. I will focus on lexemes analysis of the “python” source code in these posts, but each programming language has its own grammar and parser.

To start the game, you first need to import ast. Here is our first practice. The python code that we are going to convert it into AST is a simple line “text = ‘Hello world!’ “. Then we call parse from ast to convert the code into the abstract syntax tree. After all, in a loop, we walk through different nodes of the tree and print the name of each node.

Sample #1

The output of the above code sample is here. Nodes are printed in order from the root of the tree into the leaves.

Module
Assign
Name
Str
Store

To view more details, we can print the dump of the code. Don’t forget that to view the dump, you first need to parse the code, third line of sample #1, code_tree= ast.parse(code).

print(ast.dump(code_tree))

And here is the output. I put the class names in Bold. But as you can see, there are other words in dump output. what are those words?

Module(body=[Assign(targets=[Name(id=’text’, ctx=Store())], value=Str(s=’Hello world!’))])

For each class_name or that is better to say object, we have different arguments. In sample #1, try to see what would be the output if we only print node ( not node.__class__.__name__).

For example, here is the definition of Assign class:

classAssign(targets, value, type_comment)

There are different arguments. The target could be the variable name and value is equal to the value sets into the variable. But do we need to extract all keywords? it totally depends on the goal of analyzing. For example, if we only want to collect API calls or only function names, we don’t need these details.

let’s check more samples. Suppose that we want to know how many functions are defined in source code and also collect all function names. This time, I put the python code into a file, named pycode.py and try to read code from the pycode.py file. Here is the content of the file (The content that we are going to analyze its functions).

def name(n):
return n
def plus(x,y):
return (x+y)
def minus (x,y):
if x>y:
return(x-y)
else:
return(x-y)

def multi(x,y):
return (x*y)
def divide(x,y):
return(x/y)

We create a list to collect all function names. First, we open the pycode.py file. Then same as sample #2, we parse the code with ast parser. After that, for each node which is functionDef, we collect the node’s name.

Sample #2
len(function_name)5print(function_name)[‘name’, ‘plus’, ‘minus’, ‘multi’, ‘divide’]

I show the dump of one the functions to understand better the output of sample #2. As you can see, there are more arguments in FunctionDef class, but here we only want to collect the “name” of the function.

Module(body=[FunctionDef(name=’plus’, args=arguments(args=[arg(arg=’x’, annotation=None), arg(arg=’y’, annotation=None)], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Return(value=BinOp(left=Name(id=’x’, ctx=Load()), op=Add(), right=Name(id=’y’, ctx=Load())))], decorator_list=[], returns=None)])

The next sample is function calls. How if we want to analyze the function calls instead of function definition in a project to find the most famous function (A function that is used the most across the project) or top 3 function calls in a software project. Here is the code to collect function calls. Same as sample #2, I put the sample code in the pycode.py file. This time instead of ‘FunctionDef’, we are searching for ‘Call’. We walk through the tree till find ‘Call’ node, then we need to make sure it is function call node (not for example an API call). To do so, we check node.func, if the class_name is ‘Name’ then it is a function call but if it is ‘Attribute’ then it is an API call.

Sample #3

Here is the dump of a function call in a python code:

body=[Assign(targets=[Name(id=’keys’, ctx=Store())], value=Call(func=Name(id=’try_sort’, ctx=Load()), args=[Name(id=’mapping’, ctx=Load())], keywords=[]))

All I am trying to tell you is, to walk through the AST tree and extract different concepts, you need to be able to read the dump and then organize what you are going to collect. I put some other samples to collect API calls, Import libraries, and class names.

Sample #4 (API Call)
Sample #5 (libraries)
Sample #6 (Class Name)

--

--