API docs

compare50

class compare50.Comparator

Abstract base class for compare50 comparators which specify how submissions should be scored and compared.

compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files)

Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of compare50.Scores for each submission pair.

class compare50.Comparison(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)
Variables:
  • sub_a – the first submission
  • sub_b – the second submission
  • span_matches – a list of pairs of matching compare50.Spans, wherein the first element of each pair is from sub_a and the second is from sub_b.
  • ignored_spans – a list of compare50.Spans which were ignored (e.g. because they matched distro files)

Represents an in-depth comparison of two submissions.

exception compare50.Error

Base class for compare50 errors.

class compare50.File(name, submission)
Variables:
  • name – file name (path relative to the submission path)
  • submission – submission containing this file
  • id – integer that uniquely identifies this file (files with the same path will always have the same id)

Represents a single file from a submission.

classmethod get(id)

Find File with given id

lexer()

Determine which Pygments lexer should be used.

path

The full path of the file

read(size=-1)

Open file, read size bytes from it, then close it.

tokens()

Returns the preprpocessed tokens of the file.

unprocessed_tokens()

Get the raw tokens of the file.

class compare50.Pass

Abstract base class for compare50 passes, which are essentially ways for compare50 to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every file compare50 recieves) as well as a comparator (used to score and compare the preprocessed submissions).

class compare50.Score(sub_a, sub_b, score=0)
Variables:
  • sub_a – the first submission
  • sub_b – the second submission
  • score – a number indicating the similarity between sub_a and sub_b (higher meaning more similar)

A score representing the similarity of two submissions.

class compare50.Span(file, start, end)
Variables:
  • file – the ID of the File containing the span
  • start – the character index of the first character in the span
  • end – the character index one past the end of the span

Represents a range of characters in a particular file.

class compare50.Submission(path, files, preprocessor=<function Submission.<lambda>>, is_archive=False)
Variables:
  • path – the file path of the submission
  • files – list of compare50.File objects contained in the submission
  • preprocessor – A function from tokens to tokens that will be run on each file in the submission
  • id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).

Represents a single submission. Submissions may either be single files or directories containing many files.

classmethod get(id)

Retrieve submission corresponding to specified id

class compare50.Token(start, end, type, val)
Variables:
  • start – the character index of the beginning of the token
  • end – the character index one past the end of the token
  • type – the Pygments token type
  • val – the string contents of the token

A result of the lexical analysis of a file. Preprocessors operate on Token streams.

compare50.compare(scores, ignored_files, pass_)
Parameters:
  • scores ([compare50.Score]) – Scored submission pairs to be compared more granularly
  • ignored_files ({compare50.File}) – files containing distro code
  • pass (compare50.Pass) – pass whose comparator should be use to compare the submissions
Returns:

Compare50Results corresponding to each of the given scores

Return type:

[compare50.Compare50Result]

Performs an in-depth comparison of each submission pair and returns a corresponding list of compare50.compare50Results.

compare50.expand(span_matches, tokens_a, tokens_b)
Parameters:
  • span_matches ([(compare50.Span, compare50.Span)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same file
  • tokens_a ([compare50.Token]) – the tokens of the file corresponding to the first element of each span_match
  • tokens_b ([compare50.Token]) –
    param tokens_a:the tokens of the file corresponding to the first element of each span_match
Returns:

A new list of maximially expanded span pairs

Return type:

[(compare50.Span, compare50.Span)]

Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.

compare50.missing_spans(file, original_tokens=None, processed_tokens=None)
Parameters:
  • file (compare50.File) – file to be examined
  • original_tokens – the unprocessed tokens of file. May be optionally specified if file has been tokenized elsewhere to avoid tokenizing it again.
  • processed_tokens – the result of preprocessing the tokens of file. May optionally be specified if file has been preprocessed elsewhere to avoid doing so again.
Returns:

The spans of file that were stripped by the preprocessor.

Return type:

[compare50.Span]

Determine which parts of file were stripped out by the preprocessor.

compare50.rank(submissions, archive_submissions, ignored_files, pass_, n=50)
Parameters:
  • submissions ([compare50.Submission]) – submissions to be ranked
  • archive_submissions ([compare50.Submission]) – archive submissions to be ranked
  • ignored_files ({compare50.File}) – files containing distro code
  • pass (compare50.Pass) – pass whose comparator should be use to rank the submissions
  • n (int) – number of submission pairs to return
Returns:

the top n submission pairs

Return type:

[compare50.Score]

Rank submissions, return the top n most similar pairs

compare50.passes

class compare50.passes.structure

Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.

class compare50.passes.exact

Removes all whitespace, then uses the winnowing algorithm to compare submissions.

class compare50.passes.misspellings

Compares comments for identically misspelled English words.

compare50.preprocessors

compare50.preprocessors.by_character(tokens)

Make a token for each character.

compare50.preprocessors.comments(tokens)

Remove all tokens that aren’t comments.

compare50.preprocessors.extract_identifiers(tokens)

Remove all tokens that don’t represent identifiers.

compare50.preprocessors.normalize_builtin_types(tokens)

Normalize builtin type names

compare50.preprocessors.normalize_case(tokens)

Make all tokens lower case.

compare50.preprocessors.normalize_identifiers(tokens)

Replace all identifiers with v

compare50.preprocessors.normalize_numeric_literals(tokens)

Replace numeric literals with their types.

compare50.preprocessors.normalize_string_literals(tokens)

Replace string literals with empty strings.

compare50.preprocessors.split_on_whitespace(tokens)

Split values of tokens on whitespace into new tokens

compare50.preprocessors.strip_comments(tokens)

Remove all comments from tokens.

compare50.preprocessors.strip_whitespace(tokens)

Remove all whitespace from tokens.

compare50.preprocessors.text_printer(tokens)

Print token values. Useful for debugging.

compare50.preprocessors.token_printer(tokens)

Print each token. Useful for debugging.

compare50.preprocessors.words(tokens)

Split tokens into tokens containing just one word.

compare50.comparators

class compare50.comparators.Misspellings(dictionary)
compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files)

Number of identically misspelled words.

class compare50.comparators.Winnowing(k, t)

Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Parameters:t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched
Parma k:the noise threshold; any matching sequence of tokens shorter than this will be ignored
compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files)

Number of matching k-grams.